Using MALLET for Topic Modeling

Travis Miller and Simeon Allocco

When using MALLET we found that when changing the number of topics used, there was a significant difference. We first decreased the number of topics from 50 to 35, and then we increased it from 35 to 65. After doing so we realized that the words being used contained many more nouns and less verbs when decreasing the number of topics. This made the the sets of words much more concise, making it easier to generalize a topic name for them.

When fiddling with the number of iterations we did not see any difference in the patterns at all, besides the fact that the words themselves were different. Aside from that there was no visible change in verb and noun usage. One setting that we strongly recommend when using MALLET is the remove stop words option in the advanced settings. This will cut out unnecessary words that are insignificant to the actual theme of the topic, making it much easier to analyze.

Favorite Topics:

Clues:

This topic was used the most in the Adventure of Sherlock Holmes: The Five Orange Pips while it was used the least in the Adventure of the Silver Blaze.

Questions:

Does date of publication affect the frequency of this topic?

Is this data reliable since this topic is so ubiquitous in the Sherlock Holmes Series?

Investigation:

This topic was used most in the Adventure of the Empty House and it was used the least in the Adventure of the Redheaded League.

Questions:

What differentiates this topic from the last topic?

How popular was this topic compared to the others?

Death:

Death showed up the most in the Adventure of the Gloria Scott and showed up the least in the Adventure of the Bruce-Partington Plans.

Questions:

The name Garcia pops up in the list of words for this topic. Why is that?

Was death a very popular topic during this time period?

Discussing Topic Models with Mary Dellas and Joe Mausler

After discussing the process and results of topic modeling using MALLET, we know that the fewer topics we have, the broader the topic category MALLET gives us. The more iterations we have, the easier it is to identify a topic name. We recommend the default settings we used in class: 50 topics,1000 iterations, and 20 topic words. This setting gave us enough topic words to determine a topic name, but not so many that it became confusing and repetitive.

These are our three favorite topics:

1. Physical Description (Male): face man eyes looked thin dark features tall expression appearance middle high pale figure set glasses gray keen clean bear

  • a) The top ranked document in the Physical Description (Male) topic is Charles Augustus Milverton. 26 words in the document are assigned to this topic.
  • b) The story The Sussex Vampire uses this topic the least (2 times).
    • Question 1: Even though 26 words in the document are assigned to Physical Description (Male), does this imply that this document is entirely dedicated to the topic Physical Description (Male)?
    • Question 2: Why does it seem like some of the words (ex. set, bear) do not relate to the other words in the topic?

2. Letter Writing: paper note read letter table book box letters papers written handed wrote writing sheet brought importance post write document address

  • a) The top ranked document in the letter writing topic is The “Gloria Scott”. 18 words in the document are assigned to this topic.
  • b) The story Shoscombe Old Place uses the topic least (2 times).
    • Question 1: Why does the same story name appear multiple times on the list of the top ranked documents?
    • Question 2: When we click the story chunk, why is MALLET only showing us a small part of the document?

3. Crime: police crime case night evidence murder death account occurred arrest unfortunate effect tragedy violence complete charge appeared reason terrible committed

  • a)  The top ranked document in the crime topic is The Second Stain. 62 words in The Second Stain we assigned to this topic.
  • b) We found that The Priory School uses crime the least–a total of two times.
    • Question 1: Is crime a more common topic in the later Sherlock Holmes stories or the earlier ones?
    • Question 2: Can MALLET tell us how many stories in total discuss crime?

Topic Modeling Partner Blog Post – Carly Rome and Jacquie Behan

Lowering the number of topics causes MALLET to generate broader topics. Lowering the number of topics too much can make topics too vague. We recommend using a higher number of iterations for topic modeling through MALLET because it makes it easier to identify common topics among the words generated. However, too many topics combined with too many iterations can make the results too specific, which isn’t helpful either.

Comparing both of our word lists, we both had the topics:

  1. Crime- (Carly – Crime: police man inspector lestrade colonel crime death reason evidence murder mystery affair case present remained criminal force arrest account undoubtedly/Jacquie- Crime- man police found inspector dead crime death body evidence reason murder blood night shot person)
  2. Murder/Death- (Carly- Death: found end lay dead hand part body long deep lost close blood finally carried showed attention broken shot leaving horror/ Jacquie- Murder- found dead man death crime body evidence terrible unfortunate attempt violence words occurred instantly action save murderer committed escape murdered)
  3. Characteristics/Appearance of a Person- (Carly- Characteristics: face eyes man red black white dark looked thin features hair lips tall appearance blue drawn expression pale heavy hat/ Jacquie-Appearance-face eyes man dark tall looked features expression thin lips pale mouth figure companion appearance)

Crime was found the most (74 times) in the Second Stain story and the least (2 times) in the Musgrave Ritual. Murder/death was found the most (28 times) in the Norwood Builder and the least (2 times) in the Noble Bachelor. Characteristics/appearance of a person was found the most (27 times) in the Mazarin Stone story and the least (2 times) in The Speckled Band.

Our two questions about the data are:

  1. Can any of the words in these topics have double meanings or be misinterpreted without context? How can this affect the accuracy and helpfulness of topic modeling?
  2. Is it okay to ignore one or two words that may not correlate with the rest of the words within a topic? Why or why not?

Sherlock Holmes Topic Modeling

Word Cloud for Blog

First and foremost, I accidentally miscounted and neglected to post a tenth topic so it is included in the following list:

(50 topics/1000 iterations/20 topics printed)

Place: house side road passed walked front round garden hall windows path corner direction window standing ran houses yards led bicycle

Murder/death: found left body blood lay brought examined revolver round examination ground knife carefully wood death stick marks track dead spot

Letter/note: paper note read letter book pocket letters handed wrote written writing write sheet post document slip table reading date envelope

(60 topics/700 iterations/15 topics printed)

Woman: woman lady wife young mrs girl love life husband child miss married story daughter beautiful

Spirits/ghosts: doubt lost danger dangerous clear life criminal law friend memory powers presence death care fear

Time: night heard morning evening clock ten past waiting house thirty usual surprise found quarter quiet

Crime: house found examined night body showed show clue signs finally death proved carefully carried servant

Money: years money ago twenty hundred lady king pounds gold months pay photograph age year thousand

Deduction process: case interest facts points point investigation remarked give follow incident theory interesting obvious run conclusion

Family: father made left happened death poor mother imagine story returned died strange mad truth butler

Though I found topic modeling to be an interesting concept and distant reading tool, I thought it was difficult to understand when I was configuring and selecting my own topics.  I don’t think I was able to spend enough time with the program.  Since I don’t have any background with programming, I felt like there was something I was missing.  It was difficult for me even to get MALLET to compute the data in the first place.  After that, I could go through the lists of words and find how many times they were used and, to an extent, the way they related each other – so I was able to better grasp the use for this tool.  Looking at the words this way appears to be more effective in finding information about a lot of text, as opposed to a word cloud.  A word cloud will display all of the words randomly and show their frequency [like above, displaying the frequency of the words in my topics]; MALLET will list words in relation to each other, so a reader will get a better idea of the themes throughout the collection of literature.  In theory, this word cloud should illustrate a very condensed version of the Sherlock Holmes stories, but these are only words based on my selections of topics from topic modeling.  To any reader outside of this blog, the word cloud above [which focuses mostly on death and bodies and seems to make the stories out to be much more morbid than they really are] could not possibly produce an authentic understanding of the text.

When I chose my topics, I picked out the ones that were the most intriguing to me.  Some were simple and some didn’t make sense – for example, the final topic [the one I had forgotten] makes so little sense to me I don’t know how to title it, whereas the “woman” topic features only words that have direct correlations with the female gender.  For the “family” topic, I finally chose that word to represent them all primarily because of “mother” and “father.”  However, I still wonder what “strange,” “mad,” and “truth” have to do with the topic.  Perhaps “family” is incorrect and the topic is really to do with “storytelling,” which is prevalent in the Holmes stories.  Sherlock’s clients and/or Sherlock himself tell their stories in every individual mystery.  Many of the topics feature at least one word that throws me off of what I think the topic is in general.  So, for me, there is still a disconnect in the idea of distant reading as a comprehensible look at lots of text, but I’m really enjoying looking at new technological ways to consider and discuss literature.

Topic Modeling: So Many Words!

The concept of topic modeling even at a surface glance, through the reading of an article, seemed like a pretty complicated endeavor. The nature of the program itself was very interesting, aside from the hardships it provided. The program allows for large volumes of literacy to be analyzed through “topics” or certain words from the text that are the most relevant in understanding the text without actually reading through it.

My topic modeling experimentation included two different uses:

60 topics, 1500 iterations, 25 topic and 50 topics, 2000 iterations, 50 topic words

My first example usage gave me the most varied results because of the larger amount of topic words. Although the amount of  iterations is less than my second example, the amount of topic words proved to be the true variable that determined more diverse results. The topics allowed me to understand the main point of the stories in which I chose to look at (their topic words) without even reading any of the stories. This is extremely helpful considering there is a very vast amount of Sherlock Holmes stories to be read. And like most other digital humanity tools, this would be very helpful in creating an archive, or any other project which requires the reading of many texts.

Most of us were pressed for time as this program was is only downloaded onto the computers in the lab where we have class. However, it was one of the many endeavors in life where once you understand it, it becomes quite easy to pick up efficiently. Screen Shot 2014-10-29 at 11.36.07 PM

As shown in the very crowded wordle of all of my topic words, there were many results.

A deeper look at topic modeling

wordcloud

All categories chosen from 50 topics with 1000 iterations:

time – morning night back clock waiting past early morrow quarter arrived

writing – paper note read letter table book handed letters written wrote

physical features – face eyes looked thin features lips figure tall dark expression

household – woman lady wife husband life love girl child married maid

clothing/accessories descriptions – black hair red hat heavy round broad centre coat dress

death/crime – found man dead lay body blood death knife lying round

interrogation/crime solving – give matter idea reason question impossible occurred absolutely explanation true

physical reactions – face turned back instant hand sprang forward moment side head

transportation – station train road carriage passed side drive reached drove hour

darkness/mystery – light suddenly dark long caught sat lamp spoke silence silent

Using MALLET was an interesting experience. I enjoyed how simple and accessible the interface was. I had no trouble navigating the program and tweaking the iterations and so forth to my liking. I experimented with several numbers before choosing to analyze my topics with 50 topics, 1000 iterations, and a 10 topic word selection. I tested extreme numbers to see how it would influence the data. In one trial I searched 500 topics with 3000 iterations. This resulted in too specific of data that explored topics that were relative to particular stories. I also searched as few as 10 topics with only 500 iterations. This generated too many broad and vague topics that did not capture the essence of the mysteries. In the end I felt that narrowing it down to 50 different topics with 1000 iterations gave me a good sense of the Sherlock Holmes stories in a general yet helpful way. The word cloud above displays these words in a creative and interactive way.

The ten topics that I chose out of the fifty total were due to their overall similarity. I assigned the simplest titles that I could think of to each of them to give a general structure for understanding the Sherlock Holmes stories as a collection. Understanding ten basic concepts that are reflective of the entire collection is easier to grasp and accept by the reader. Each title represents an element of the stories that is imperative to the work as a murder mystery relative to the time it was written. Obviously topics such as death, crime, interrogation, and mystery are all blunt examples of what a mystery story encompasses. Some of the other topics such as physical reactions and features are more subtle examples yet serve just as important a role. The stories rely primarily on context clues and other literary devices that create an interesting and challenging mystery to solve. Things such as physical expressions and reactions are important elements of any mystery story because they can explain a lot about an individual character or the way they respond to certain situations. Another topic such as clothing descriptions seems to be part of the style of writing of the collection of Sherlock Holmes stories. Holmes is an icon for mystery investigators and the way that he is dressed is an important part of his appeal. The author pays a lot of attention to the way that Holmes’ dress is described as well as other characters throughout the entire series.

Topic modeling provides a unique framework for examining thousands or millions of texts at once. Distant reading is an interesting concept that I will hopefully be able to exercise in future research. The ability to apply your own ideas and lens to any given topic or series of works through topic modeling is something truly valuable that many other classic tools or academic research methods do not allow or facilitate.

Topic Modeling with MALLET: Analyzing the Results

Initially, it was difficult for me to understand the definition and purpose of topic modeling. However, after using MALLET, a topic modeling tool, to find patterns in Sherlock Holmes stories, I began to understand how topic modeling works.

After entering the Sherlock Holmes stories into MALLET, I found 10 good topics. The first 6 topics came from 50 topics,1000 iterations, and 20 topic words printed. The topic names were Letter Writing, Crime, Marriage, Death, Clues, and Physical Description (Male). The other four topics came from 70 topics, 1500 iterations, and 15 topic words printed. These were Holmes in his Chair, Rooms in a House, London Finance, and Investigation Process. I experimented with other variations of iterations, topics, and topic words printed, but only had time to upload these output files onto my computer. By testing out many different variations I found that the more iterations and topic words you have, the easier it is to identify the topic name. After I picked out my 10 topics, I clicked on the topic words within them in order to see the top ranked documents within that topic. MALLET then allowed me to see the number of words in a specific document that were assigned to that topic. I found, for example, that 22 words in a document from The Stock Broker’s Clerkwere assigned to the London Finance topic. The words in this topic were: money business work hundred answered good pounds company asked thousand advertisement city price headed pay. The document excerpt that MALLET showed at the top of the page revealed that this part of the story was about a “gigantic robbery” in which “nearly a hundred thousand pounds worth of American railway bonds” were found in the robber’s bag. This explains why 22 of the words within the document were assigned to London Finance. MALLET also showed that only 12% of the words in that entire document were assigned to this topic. I went through this same process with all of my topics to figure out which Sherlock Holmes stories discussed certain topics, and how many words in each story were assigned to those topics.

Altogether, I think topic modeling with MALLET is a great way of distant reading. MALLET proved to be efficient after it sifted through mass amounts of text from Sherlock Holmes stories and found patterns within them faster than most of us could even finish reading just one of those stories. There were a few aspects of MALLET, however, that I disliked. First, it creates enormous files. These files take up a lot of space, and this makes the process of transferring them onto Google Drive and onto other computers extremely slow. On top of this, some of the topics it creates are extremely difficult to decipher names for because the words didn’t seem have much in common. A lot of the topics also reappeared after I changed the number of iterations, topics, and topic words (ex. London Finance, Death, Holmes in his Chair). I suppose that was inevitable though, because the text being read by MALLET didn’t change.

After completing this project, I understand that topic modeling tools such as MALLET are useful in that they can take texts and then find patterns in the use of words. topic modeling is most effective when we have many documents/texts that we want to understand without actually closely reading each individual text (distant reading!).

Mary Dellas

Followup: 2000 iterations and a burning hot computer

My computer is not sluggish- it can handle Battlefield 4 on Ultra at 1080p/60fps (which, for you nongamers, means very fast and very good looking). However, it would seem skimming through text documents gives it some pause for concern. 62.976 seconds after starting up the topic modeling tool, though, my little machine spit out a list of 50 topics that could be isolated from the various words therein. So that one doesn’t need to refer back to my last post, here’s a refresher:

1. holmes word head words men message revolver shook life shot — Holmes, firearms, and investigations
2. light stood long suddenly lamp dark sound low shoulder figure — Stealth and sneakiness
3. clear doubt mind person possibly obvious idea excellent perfectly point — Deduction and flattery
4. make father made heard son returned left mr view point — Conspiracy and inheritance
5. eyes face man looked dark thin tall features companion pale — Description of characters
6. house small large stone great high place square windows houses — Houses and mansions
7. reason remember fear danger clear told chance strong horror family — Rationale
8. told heart knew god story hands life speak truth leave — Rationalization
9. matter understand position imagine call absolutely important trust force hope — Help me, Holmes, you’re my only hope
10. holmes mr professor fresh work aware surprise action great change —Sudden change in behavior

So, why did I choose these topics? They all had a primary commonality, being that they were about a general topic narrowed down to instances from their specific stories. Examples were plucked from specific passages, but these are overarching sentiments seen again and again in the archives. These sentiments are basic tropes in the mystery canon: implements of murder (1), men creeping in the shadows (2), a victim’s family rationalizing their sorrows (8), and, particularly for Holmes, a plea for help (9).

The simplicity of the fairly elaborate points here makes these 10 topics effective for getting a “feel” for Sherlock Holmes and the universe he inhabits. Together, they detail the basic elements of an average story. Thus, I believe them to be the most effective topics to be chosen out of this fairly bulky list.

As for the generation of the list, I experimented with a variety of settings before settling on the 50 topics/2000 iterations/10 topic word option. I tried as many as 500 topics and 5000 iterations, and as few as 10 topics and 500 iterations. The former produced too many specific topics, focusing on specific plot elements from specific stories. The latter produced too many broad topics, focusing on broadly used vocabulary words from many of the stories. I determined that an appropriate middle ground was found in the 50/2000/10 option, and I believe the topics chosen reflect that.

50 topics, 2000 iterations and a strangely sluggish i7 later

All from a cycle consistent of 50 topics, 2000 iterations, and 10 topic words.

1. holmes word head words men message revolver shook life shot — Holmes, firearms, and investigations
2. light stood long suddenly lamp dark sound low shoulder figure — Stealth and sneakiness
3. clear doubt mind person possibly obvious idea excellent perfectly point — Deduction and flattery
4. make father made heard son returned left mr view point — Conspiracy and inheritance
5. eyes face man looked dark thin tall features companion pale — Description of characters
6. house small large stone great high place square windows houses — Houses and mansions
7. reason remember fear danger clear told chance strong horror family — Rationale
8. told heart knew god story hands life speak truth leave — Rationalization
9. matter understand position imagine call absolutely important trust force hope — Help me, Holmes, you’re my only hope
10. holmes mr professor fresh work aware surprise action great change — Sudden change in behavior

Mallet Modeling Tool, Sherlock Edition

1,000 Iterations 20 Words Printed 50 topics

Broken Home: young father left life time years poor son death man met sister boy mother ago fate died returned daughter family
Investigation: police crime evidence inspector murder death tragedy law person criminal official arrest present missing trace violence charge appeared committed effect
Household: lady wife woman husband maid mrs child character servants ferguson married rucastle lived madam mistress trouble children jack devoted nurse
Written Document: paper note table read papers box book pocket put handed writing written drew sheet glanced picked document slip envelope piece

1,500 Iterations 25 Words Printed 40 Topics

Characteristics: face eyes man thin lips features dark looked tall pale expression raised mouth figure gray beard drawn manner sprang handsome held eager fixed blue thinking
Emotions: cried hands face instant moment back god voice words quick sake minutes amazement cry spoke answer soul stared sank glimpse excitement burst heaven swear heavens
Physical Appearance: black red white hair hat head large broad coat heavy small middle set short dress cut brown round thick centre grey faced dressed clean glancing

2,000 Iterations 40 Words Printed 60 Topics

Schedule: night morning day clock morrow early leave breakfast work hours arrived sleep twelve dressed eleven bright spent slept fresh waited hour appointment porter reading moving caused lunch signs shortly meal awake tuesday matters victoria hearing reply earlier blood woking driven
Suicide: found man body dead lay blood head struck hand shot revolver blow knife stick heavy weapon unfortunate left death sign lying wound bullet handle formidable pistol finally escaped wounded tied fired carried world struggle dragged grotesque injury spot shirt gun
Traveling: hour half train station past carriage waiting cab quarter wait passed drive reached minutes drove started journey late ten line pulled hurried passing bridge town cross glancing hansom class reach brougham clearing nearer fast charing streets learn coachman cabman rattled