Topic Modeling Analysis

Messing around with Topic Modeling, I tried it a few times to see how the topics changed with different settings. The first time I used Mallet, I used the settings that we did as a group first for my first blog post to get used to the programing. The settings I used were, the numbers of topics 50, number of iterations 1000, number of topic words 20, and the stop words were removed. From using this program and looking at the first set of 50 topics, I realized I am really bad at figuring out the label or category for these words. It seems to me that I automatically go with the first word that comes to mind which does not fit the entire categories. It was amazing to see how fast this program works. The first time it took 48.149 seconds which is not a lot of time for 2845 files. I assumed it would have taken longer for the program to split up these works into topics. It was really interesting to see this work so efficiently.

The second time using Mallet, I changed up the settings to see if there were huge differences. This time the settings I changed the number of topics from 50 to 25, the number of iterations from 1000 to 1500, the number of topic words from 15 instead of 20, and I decided to keep the stop words removed. The program ran even faster this time 47.219 seconds. I am really impressed with how fast a program can run all that data.

Because the program runs everything so fast it really makes the process more efficient. I do not have to read all these works and I can still see common themes among them and topics. It was interesting seeing how many times words came up once you clicked on certain topics, I also liked the fact that it was split up then by frequency of that word being used within the works. I personally thought it was a good program, but I could see how someone else may catch on more flaws.  For me doing it in  class was very useful and it helped me see more themes within Sherlock Holmes stories.

 

-Erin S.

Topic Modeling: So Many Words!

The concept of topic modeling even at a surface glance, through the reading of an article, seemed like a pretty complicated endeavor. The nature of the program itself was very interesting, aside from the hardships it provided. The program allows for large volumes of literacy to be analyzed through “topics” or certain words from the text that are the most relevant in understanding the text without actually reading through it.

My topic modeling experimentation included two different uses:

60 topics, 1500 iterations, 25 topic and 50 topics, 2000 iterations, 50 topic words

My first example usage gave me the most varied results because of the larger amount of topic words. Although the amount of  iterations is less than my second example, the amount of topic words proved to be the true variable that determined more diverse results. The topics allowed me to understand the main point of the stories in which I chose to look at (their topic words) without even reading any of the stories. This is extremely helpful considering there is a very vast amount of Sherlock Holmes stories to be read. And like most other digital humanity tools, this would be very helpful in creating an archive, or any other project which requires the reading of many texts.

Most of us were pressed for time as this program was is only downloaded onto the computers in the lab where we have class. However, it was one of the many endeavors in life where once you understand it, it becomes quite easy to pick up efficiently. Screen Shot 2014-10-29 at 11.36.07 PM

As shown in the very crowded wordle of all of my topic words, there were many results.

Topic Modeling, Sherlock Holmes Edition

This week’s digital tool was very different from the others we have used in class. After playing around with Mallet and topic modeling, I actually enjoyed trying to “un-puzzle” the words (so to speak), and figure out what the major topics were for each Sherlock Holmes story. For the first three topics I chose, I displayed the modeling tool with 1,000 iterations, 20 words printed, and 50 topics. In reviewing the long list of topics I could have chosen, I actually struggled in finding one that I understood and thought was relevant enough to twentieth century London. For my first topic, broken home, I learned that these groups of words were most popular in the Sherlock Holmes story of The Solitary Cyclist. From reviewing the topics, it seemed as though 25% of the words were those of my topic. When identifying what I thought the topic would be, I eventually labeled it abandoned. However, after reviewing the words again I felt broken home was more appropriate, seeing as how even though a father left, the siblings remained in contact-not an ideal family, but still there version of what a family is. My second topic had to do with investigation. I chose this as a topic because not only did I get it right away, but it also proves that the majority of Sherlock Holmes stories are revolved around investigation! It seemed as though 39% of the words in the story revolved around this theme of investigation, but it didn’t rely to heavenly on it. My third topic was household. It seemed as though 25% of the words in Sussex Vampire revolved around the topic; however, it was only a small portion of the story seeing as how it only outlined the characters of the short story. Lastly, the fourth topic I chose was written document. I was surprised by these results because only about 13% of the story included this topic. Although it may not have been a major theme in “Gloria Scott,” I assume the document was the premise for what the investigative story was based upon.

When I played around with Mallet again, I decided to change it up. Instead of doing 1,000 iterations, this time I did 1,500 iterations, 25 words printed, and 40 topics. I found it easier labeling topics for these various groups of words because I had more to work with and more to compare. Therefore, the first topic I chose was characteristics. It became very clear that this was the topic for these words because words like “face, grey, man, thin, lips” were very prevalent. The next topic I chose was emotions. This was one of the harder topics I had to label because words like “god, voice, words” threw me off- but after reviewing it once more, I decided that a general topic for these words and more would have to do with a person’s emotions. My last topic for this set was physical appearance. This was kind of a fun group to label because it there was a lot of imagery and colors involved, so it was very clear for me to imagine this person standing in front of me. Therefore, I knew this topic had to involve some sort of appearance. When I reviewed the top Sherlock Holmes stories for these related topics I got The Priory School, Lion’s Mane, and A Case of Identity. Interestingly enough, all of these topics were slightly similar, leading me to believe these stories may have similar themes.

For my last set, I chose 2,000 iterations, 40 words printed, and 60 topics. I was a bit more overwhelmed with this set because although I had a lot more to work with, I felt like it was almost too much to work with. When looking at the different sets of words I felt like most of the words matched one another to create a topic, but I felt that others were kind of strenuous and took away from the major theme of the topic. With that being said, I said my first topic was schedule. A lot of the words had to do with timing and places to be, which reminded me a lot of when I plan my day out. For this topic, 12% of The Missing Three-Quarter had to do with my topic, and although another 6% had to do with a different topic, one of the major themes still pointed to scheduling. My second topic was suicide. This topic was very easy to label because all of the words involved with it pointed to something tragic and done to self. Therefore, I thought suicide would be an appropriate label. Looking at the percentages it seemed as though The Norwood Builder was the top story that had about 13% of the words listed for this topic. Although it was not the number one topic for the story, it is prevalent seeing as how a murder must have taken place. Lastly, I chose the topic of traveling for my next set. This was another topic that I was iffy about because I felt like it could have been journey or travel, but I leaned more toward travel because of words such as “bridge, town, (and) cross.” It seemed as though traveling was prevalent in Final Problem, with a total of 29% of topic words mentioned throughout the short story.

Overall I thought Mallet was a fun and interesting tool, and I would most definitely try it out again sometime. It taught me a lot about gathering a major theme based on prevalent words in a short story, but in a unique way. Every time I figured out a topic I felt as though I was unscrambling a really difficult puzzle piece; however, once I came up with the correct topics the whole process became extremely entertaining!

Topics Analysis.

Dan Albrecht.

I feel as though Mallet can be an extremely useful tool in modeling different topics within a  given amount of text.  When I ran the program with the setting for 30 topics, 1000 iterations, and 15 words, I got Death, Details, Victorian Women, Setting, and Deep Thought.  With 10,2000, 30, I got Morbid and Services Request. With 10, 500, and 20, I got Action, Physical Characteristics, and Terror.

The topics Services Request, Details, Setting, Physical Characteristics, and Deep Thought as part of Sherlock Holmes did not surprise me since this is a mystery genre, and one would expect to find them.  What surprised me a little was the presence of Death, Morbid, and Terror.  Since many of these stories deal with murder, than maybe they shouldn’t have, but I was impressed that the Holmes stories don’t just appeal to those who want logic and analytically detective work, but these stories can also appeal to the emotions of their readers to keep them gripped.

I also got Action and Victorian Women in this list of topics.  Action was another plot device that Conan Doyle was able to use to appeal to his readers.  Victorian Women was an indicator that much of these stories reflect general attitudes about Victorian culture, including gender attitudes.

These lists of topics really help to underscore some of the general themes and plot devices of the Holmes stories, but these topics might have been harder to understand if the user has never read the Holmes stories, but it can be useful nonetheless.

Topic Modeling with Sherlock Holmes: Analysis

When first introduced with this project I was not sure how to go about it. Although it sounds like as interesting and cool representation of a group of works, it is a bit confusing. As for figuring out how to work the program, that got a bit tricky. The process is relatively simple, but each time you change something you have to remember to adjust where you are saving as well as other settings. But overall, it was an interesting experience and an intriguing idea.

For my group of words, I found it quite fun to come up with topic titles. The word groups themselves were thought provoking, especially when trying to figure out which story some of these words could have possibly come from. Part of the program itself did reveal to you where the words appeared the most in a story, so it was fun to see if your guesses were right and also if we had read the story before.

The easiest word groups to name were the ones with words that were similar and consistent in subject matter. For example, one of the easy ones to identify was the articles of clothing/garments word group. The words (black red glass large coat dress centre top brown observe glasses faced dressed boots colour broad impression pair hat mark) were obvious items of clothing, so it was simple to come up with a topic title. Others, such as death (found dead body lay man blood death blow knife unfortunate terrible person lying finally cut weapon evidence constable remained wound), body parts and expressions (eyes face hands voice cried lips shoulders sat turned air amazement sprang companion stared sunk raised sank eager instant shrugged cheeks staring astonishment angry breast), and house (house night room master bell attention bed asked servants alarm servant ring remained phelps walked butler drawing kitchen finally stay save rope thief scent coffee state joseph rang suspect smell dragged cover cellar burglar ill harrison instantly sounds scene french bound county form rest wished partly pull chamber mr ventilator) were easy to create names for because the words had obvious relations to one another and similar subject matters.

There were also some difficult topic titles. Some of the words in the groups did not have a clear subject matter or they would have mini groups within the larger one, making it hard to pin down a clear, overall theme. For example, crime investigation (holmes mr inspector lestrade sherlock case yard detective opinion scotland arrest evidence prisoner official ready practical force quietly bank gregson remarked oldacre mcfarlane final joke absence credit finished rubbed warrant pleasure hands gentlemen norwood fail express suspicious bound wiser chuckled profession afford lucky attempted finds jonas rolled sense martin bradstreet) and evidence (small box examined large papers floor carefully examination inside cut top square iron carpet showed wooden furnished evidently lower contents central removed mantelpiece careful examining) were difficult to name because a lot of the words did not connect with one another. Some words were similar, while others were completely different. With groups such as these, you had to look closely at all the words and try to come up with a general title that would encompass all the words into one subject.

– Allyson Macci

Topic Modeling with MALLET: Analyzing the Results

Initially, it was difficult for me to understand the definition and purpose of topic modeling. However, after using MALLET, a topic modeling tool, to find patterns in Sherlock Holmes stories, I began to understand how topic modeling works.

After entering the Sherlock Holmes stories into MALLET, I found 10 good topics. The first 6 topics came from 50 topics,1000 iterations, and 20 topic words printed. The topic names were Letter Writing, Crime, Marriage, Death, Clues, and Physical Description (Male). The other four topics came from 70 topics, 1500 iterations, and 15 topic words printed. These were Holmes in his Chair, Rooms in a House, London Finance, and Investigation Process. I experimented with other variations of iterations, topics, and topic words printed, but only had time to upload these output files onto my computer. By testing out many different variations I found that the more iterations and topic words you have, the easier it is to identify the topic name. After I picked out my 10 topics, I clicked on the topic words within them in order to see the top ranked documents within that topic. MALLET then allowed me to see the number of words in a specific document that were assigned to that topic. I found, for example, that 22 words in a document from The Stock Broker’s Clerkwere assigned to the London Finance topic. The words in this topic were: money business work hundred answered good pounds company asked thousand advertisement city price headed pay. The document excerpt that MALLET showed at the top of the page revealed that this part of the story was about a “gigantic robbery” in which “nearly a hundred thousand pounds worth of American railway bonds” were found in the robber’s bag. This explains why 22 of the words within the document were assigned to London Finance. MALLET also showed that only 12% of the words in that entire document were assigned to this topic. I went through this same process with all of my topics to figure out which Sherlock Holmes stories discussed certain topics, and how many words in each story were assigned to those topics.

Altogether, I think topic modeling with MALLET is a great way of distant reading. MALLET proved to be efficient after it sifted through mass amounts of text from Sherlock Holmes stories and found patterns within them faster than most of us could even finish reading just one of those stories. There were a few aspects of MALLET, however, that I disliked. First, it creates enormous files. These files take up a lot of space, and this makes the process of transferring them onto Google Drive and onto other computers extremely slow. On top of this, some of the topics it creates are extremely difficult to decipher names for because the words didn’t seem have much in common. A lot of the topics also reappeared after I changed the number of iterations, topics, and topic words (ex. London Finance, Death, Holmes in his Chair). I suppose that was inevitable though, because the text being read by MALLET didn’t change.

After completing this project, I understand that topic modeling tools such as MALLET are useful in that they can take texts and then find patterns in the use of words. topic modeling is most effective when we have many documents/texts that we want to understand without actually closely reading each individual text (distant reading!).

Mary Dellas

Topic Modeling: Aftermath

When we first discussed topic modeling in class, I was a little bit confused. But, after we did some group work on it I began to grasp the basic concept. Doing it myself proved to be a little bit more difficult. In the beginning, I had a hard time with the application that we were working with, Mallet. It was more technical stuff that I just hadn’t realized. But once I figured it out and was able to make my lists I was a bit more comfortable. Picking out good lists, however, was also challenging. For many of the groups it was hard to figure out what they had in common without seeing the context from the text.

However, eventually it became easier. The more I figured out, the easier it became to put the words into categories. For instance, the group “cried hands god face voice sake moment soul burst truth” were clearly words of desperation, so I filed them under anguish. Similarly, there was a group of words, “cried god voice sake hope words soul heaven speak heavens jack frightened swear suggestion truth heart despair manner aid quick” that had many words in common with the previous group, but this group seemed more religion-related, so I filed it under “praying”. I think if someone were to see these topic models, the would grasp the basic idea behind the Sherlock stories. He would realize that they were detective stories and that Sherlock sometimes had to do tricky things to figure out his cases.

Results of Topic Modeling

Topic modeling in class proved to be methodically interesting and useful for a method of natural language processing and figuring out the statistical settings in the words to conclude explicit topics from them. This novel tool aided in “highlighting” key words that were highly significant and repetitive in the Sherlock Holmes stories.

One of the topics that I had generalized consisted of words that describe a place or a setting such as, “room door window light open long passed house stood round opened front dark night entered”. From these words, it was quite evident that the program was trying to distinguish an environment when an event had occurred. Another unique set of words that were distinguished and brought to my attention were, “woman lady wife husband life love left boy child nature loved beautiful maid ferguson happy madam women mistress devoted wonderful”. I characterized it with the title “Family” due to the concepts of love and relationships described with the words. Another set of words were, “man lay dead poor body professor blood close carried end terrible death struck moment broken shot strange deep long water” which depicted a negative connotation with pessimistic adjectives. I characterized this set of words with “Depression” as they were very sorrowful and lonely descriptive words. A set of words that described economic value and worth in the Sherlock Holmes stories were, ” business client money england hundred king pounds thousand large set gold photograph paid pay ten give draw fifty ready worth”. These words depicted importance of money and greed in the stories.

Topic Modeling Observations

And we’re back!
I suspect I am not alone in this regard, but when we first discussed Topic Modeling in class I had some understanding of it but also a lot of confusion with the process. Using the Mallet tool myself has helped make the process abundantly clearer in both how it works and the usefulness of it. So let’s dive in.

There’s no need to examine all of my found topics but you can view them here. They were Business and Labor, Holmes’ Office, Emotional Shock, Place, Crime, Foot Traffic, Time, Deduction, Love and Matrimony, and Notes and Writing. I found that increasing or reducing the number of topics to sort was one of the key factors in effecting the output. Fewer topics lead obviously to more repetition between searches whereas more topics lead to a wider variety and assortment. I’m sure the increase in iterations run does provide a stronger set per topic but it isn’t really possible for me to prove that claim and it does have the negative effect of increasing the duration for the program to run.

One of the topics I found personally interesting was Holmes’ Office. With the words – holmes chair sat table room fire back pipe sitting rose arm laid seated glanced books – I felt that it gave a clear indication that these were scenes of Holmes reflecting in his office. And indeed, in several of the passages it was connected to the scene was set in his office. But I was too specific in my naming of this topic, as the number one scene linked to this topic (with the highest number of topic words associated with it) was from “The Man with the Twisted Lip” but the scene in question from file twis_39.txt did not actually take place in Holmes’ office, even if it strongly evoked it. Technically none of the words specifically point to it being his office but they all provide a sense of scene that I felt linked to his office. So while they often fit into the topic, I was too specific in its naming and perhaps another theme or word would have been better suited to name it (although I admittedly struggle to decide on one myself).

In contrast, when using a more general topic name the results are more likely to be accurate by way of simple logic. The topic Crime – crime night police occurred house tragedy violence murder made caused committed account barclay death appeared – will almost always evoke crime in the scene in question. With the Holmes stories of course revolving around crimes and the mysteries they create, this was a prevalent topic where at least 5 instances of the words appear in 90 of the text files.

As is the theme of Digital Humanities as we’ve explored thus far, we could use Topic Modeling to look at these stories in new and interesting ways. The topics can shed light on things we may not have considered prior and open up questions that likely wouldn’t have been asked with close reading of the stories. The tool also provides insight into the relationships between words and how they have a power over us- such as in my mistake naming a certain topic after Holme’s Office when indeed it was more general than that.

Sherlock Holmes Topic Modeling

75 Topics / 1500 Iterations

Transportation train station carriage drive cab drove walk waiting started journey bridge town pulled roof hurried catch miles trouble passing stepped
Murder crime death murder evidence police found tragedy case reason criminal arrest charge scene inspector suspicion murderer committed murdered motive constable
Writing paper note read book table wrote writing written handed sheet picked letter write page address pen piece pencil post learn
Trails path feet round water green stood track foot edge grass farther leaving impossible fall run tracks reached tut direction passed
Working in London london office made found telegram left evening monday city knowledge return hotel cross manager touch notice clerk duties worked pleased
During the night night clock morning ten west alarm usual twelve late eleven rest moving hour vanished began continued fog midnight cadogan woolwich
Women lady wife woman child ago married dear girl maid ferguson daughter mother life years left mistress beautiful passed devoted frances
Describing the Dead body dead blood hand lay found head struck shot man revolver blow knife lying carried weapon held fell stick wound
Finances years money hundred twenty business pounds pay year thousand fifty worth sum age england ten honour rich gold price bank
Holmes Seated sat chair fire pipe room arm silence sitting seated silent rose cigar tobacco minutes opposite lit visitor smoking table smoke
Realisation instant face hands suddenly sprang cried caught air appeared coming voice feet quick sight glimpse sunk astonishment raised sank breast
Dress black dressed heavy broad hat round coat side thick brown boots dress yellow short pair worn grey nose chin centre
Facial Expression eyes face lips features set expression looked gray drawn fixed lines eager glance raised figure thinking fierce gleam eyebrows brows