MALLET Joint Blog Post: Caitlin O’Brien and Lauren Alberti

Lowering the number of topics and iterations in MALLET can make the words found in each topic more general and broad and therefore, much harder to categorize. We found that using a higher number of topics made the terms found in each topic less vague. Higher iterations also helped with getting rid of some ambiguity while looking through the posts and made it a lot simpler to see how the words in each group were related to one another and to the Holmes’ stories themselves.

Three topics we had in common were:

1. Murder:

Lauren: found dead man body blood blow struck knife lay stick head weapon finally wound unfortunate bullet handle lying acted fainted (70 topics, 1000 iterations)

Caity: found man body blood dead knife lay stick blow head carried weapon heavy finally unfortunate neck wound lying drawn struck (100 topics, 1000 iterations)

2. Time/Time Measures:

Lauren: hour half past clock time cab ten waiting quarter work wait late minutes back drive catch eleven immediately presently church (70 topics, 1000 iterations)

Caity: time years week ago year country months days age twenty (50 topics, 2000 iterations)

3. Writing:

Lauren: paper note letter read handed wrote written writing sheet write book post page began pen pencil slip ran printed torn (100 topics, 5000 iterations)

Caity: paper note read letter table book papers pocket letters written (50 topics, 2000 iterations)

The topic of murder is found the most in “The Abbey Grange” and the least in “The Red-Headed League.” The topic of Time/Time Measures is found the most in “The Five Orange Pips” and the least in “The Illustrious Client.” The topic of Writing is found the most, again, in “The Five Orange Pips” and the least in “A Scandal in Bohemia.”

These are two question we proposed about the data:

1. Is this information important or useful to historians that are studying the time period in which the Sherlock Holmes stories were written?

2. How does the changing context of these stories change our interpretations of the data?

A Closer Look at Topic Modeling

The MALLET tool allows for individuals to manipulate several factors when generating topics, all of which can influence to topics that are produced. The number of iterations, the amount of text, the number of words in each topic, and the number of topics for example, can all influence the outcome. Changing the number of topics affects the topics that the tool outputs by either more or less holistically representing the body of texts that MALLET learns. A smaller number of topics leads to an html output with topics that are very prominent themes within the texts. A larger number of topics leads to an html output with much greater detail and variability based on the texts. When using this topic modeling tool, I recommend using 1000 iterations to ensure that the tool learns as much about the text files as possible, and for number of topics, I recommend inputting 50-100 number of topics to see a great variety of outputs that are still small enough subsets of data to analyze broad themes and topics in the texts.

I was unable to save the metadata for my project files and unable to go back to the classroom to re-do this assignment, so I am unable to identify the percentages of story origins for each topic. However, my three favorite topics for this assignment were “Investigation,” “Murder,” and “Attack.” Investigation came from the list with 75 topics, Murder came from the list with 50 topics and Attack came from 30 topics as well. I speculate that all of these topics involve the majority the Sherlock Holmes’ world of story. Interestingly enough, despite the variations in the lists that the topics come from, there is consistency in theme amongst these topics – an attack, murder and then an investigation is common in Arthur Conan Doyle’s stories.

The amount of metadata provided allows for a variety of questions to surface, such as what the publication dates are for stories involving murder and whether one can identify trends in the frequency of murder in a plot versus an attack, etc.

Mallet and Sherlock Holmes

Using MALLET, the quality of the output is a balancing act. The more topics, the more interpretation there can be. With less topics, it makes it easier to specify one term for topic modeling. With too few topics, however, it becomes a bit uncertain and hard to pinpoint a term. The higher the iteration numbers allows us to work with more precise word combinations. The settings I recommend using for MALLET when topic modeling would be 20 topics/250 iterations/10 words printed. I chose this setting because I felt that when I used a narrower search, MALLET provided me with an easier to read a less complicated set of results. I was able to pinpoint a key term when I was provided with less material to look at.

Three Topics

Hallway/Setting:     Empty House (EMPT)

Angela – room, door, open, window, entered, opened, key, rushed, closed, bedroom, passage, instant, locked, floor, stair, pushed, lock, stairs, led, safe

Mike – door, room, opened, open, heard, key, light, sound, passage, stood, inside, closed, hall, entered, locked, steps, pass, lock, dressing, stair

Office/Text:             The Stockbroker’s Clerk (STOC)

Mike – small, pocket, put, study, drew, cut, papers, attention, eye, safe, examination, bird, piece, left, cigar, mark, thumb, finger, seat, interest

Angela – paper, note, read, letter, letters, book, handed, table, papers, written, message, writing, wrote, address, short, sheet, post, write, importance, document

Attributes:                     Creeping Man Story (CREE)

Mike – face, man, head, hand, dark, black, cried, instant, turned, white, suddenly, figure, opened, quick, sight

Angela – man, face, eyes, dark, figure, looked, tall, head, drawn, black, features, mouth, thin, middle, appearance, deep, huge, beard, nose, lines

The story “Empty House” uses the topic Hallway/Seting the most to describe the appearance and the locations that take place within the story. The story “The Stockbroker’s Clerk” uses the topic Office/Text the least amount of times throughout the story. The topic is used to describe a few locations but other than that it isn’t mentioned that much.

Sample Questions

  1. How does the setting and descriptions reflect on the tone of the scene/tone of the overall plot?
  1. How does Conan Doyle incorporate description differently than other 19th century authors?

Angela and Mike

Joint Topic Modeling: Anne Flamio and Ailise Schendorf

After discussing the Mallet project together, we came to the conclusion that less topics makes its easier to identify the subject of the topic. And the more iterations, the better relation of words to the topic. However, too little topics can make the topic modeling too broad and too many iterations will take Mallet longer to come up with the topic models. We recommend 50 topics and 1500 iterations will give the best and fastest results for topic models.

House Decor  (50 topics, 3000 iterations)- house front large round side led window left garden windows small low standing lawn close houses lane high centre building huge gate iron wooden grounds narrow park trees chamber elderly lined walls doors upper ancient rain oak fashioned leads avenue appeared winding entrance century barred modern evidently brick surrounded plainly

This data was taken most from The Speckled Band and least from Resident Patient.

Question 1: When are home landscapes an important part to Sherlock Stories?

Question 2: Can descriptions of the home factor into the plot of the story?

Feminine (100 topics, 200 iterations) lady young woman wife maid child girl left beautiful ill give ferguson making mistress notice devoted possibly nurse poor frances

This data was taken most from A Scandal in Bohemia and least in Six Napoleons.

Question 1: Do women play a prominent role in Sherlock Stories?

Question 2: What can be inferred about the time period from the words used to describe women?

Murder (30 topics, 2000 iterations) found dead body death crime left murder police close attention
finally struck blood part blow remained knife examination stick tragedy

These words were found most in The Final Problem and The Adventure of the Blue Carbuncle

Question 1: Are all of Sherlock mysteries based off of murder?

Question 2: How graphic is the murder described in the stories?

 

 

Topic Modeling Group Project

While working with MALLET, we noticed that a lot of different factors change the types of topics you will get. Here are some of the things which we noticed affected our results.

  • Number of Topics–The number of topics affects the type of topics you get because if you let the computer sort it into more categories, they will have more variety as opposed to if you just have a few to choose from.  The more variety you have instantly makes you think outside the box as to what a specific topic really means.
  • Number of Iterations–The iterations affects the topics the tool gives you because you more words to work with creating more of a complex sentence with more foundation.

I found that the best settings for me was to let the computer sort the data 1000 times, into 100 categories. it gave me a lot to work with so I didn’t get caught up on the topics that meant nothing to me. 

These were the three categories we found the most interesting, and the stories they appeared the most, and least in.

  1. Manliness- sat pipe fire laid smoke tobacco blue corner lit armchair cigar hung silent gas brandy smoked smoking comfortable shining bachelor                                                                                                                                     MOST: man with the twisted lip    LEAST: His Last Bow
  2. Transportation- train station carriage cab drive waiting journey drove town cross started line follow fresh bridge reach passing hansom class reached                                                                                                                                 MOST: The Final Problem     LEAST: The Noble Bachelor
  3. Evidence- facts obvious clear person theory impossible explanation question idea perfectly mind means confess formed affair absurd probable possibly evident correct                                                                                                MOST: Boscombe Valley Mystery      LEAST: The Adventure of the Red Headed Leauge

I think that this raises a few questions. Mainly: How accurate is this data in considering ALL of the Holmes’ stories (considering each has it’s own specific themes) and, how do these topics change chronologically through each of the storied being published?

~Austin Carpentieri & Sammy Harris

Topic Modeling Partner Blog Post – Carly Rome and Jacquie Behan

Lowering the number of topics causes MALLET to generate broader topics. Lowering the number of topics too much can make topics too vague. We recommend using a higher number of iterations for topic modeling through MALLET because it makes it easier to identify common topics among the words generated. However, too many topics combined with too many iterations can make the results too specific, which isn’t helpful either.

Comparing both of our word lists, we both had the topics:

  1. Crime- (Carly – Crime: police man inspector lestrade colonel crime death reason evidence murder mystery affair case present remained criminal force arrest account undoubtedly/Jacquie- Crime- man police found inspector dead crime death body evidence reason murder blood night shot person)
  2. Murder/Death- (Carly- Death: found end lay dead hand part body long deep lost close blood finally carried showed attention broken shot leaving horror/ Jacquie- Murder- found dead man death crime body evidence terrible unfortunate attempt violence words occurred instantly action save murderer committed escape murdered)
  3. Characteristics/Appearance of a Person- (Carly- Characteristics: face eyes man red black white dark looked thin features hair lips tall appearance blue drawn expression pale heavy hat/ Jacquie-Appearance-face eyes man dark tall looked features expression thin lips pale mouth figure companion appearance)

Crime was found the most (74 times) in the Second Stain story and the least (2 times) in the Musgrave Ritual. Murder/death was found the most (28 times) in the Norwood Builder and the least (2 times) in the Noble Bachelor. Characteristics/appearance of a person was found the most (27 times) in the Mazarin Stone story and the least (2 times) in The Speckled Band.

Our two questions about the data are:

  1. Can any of the words in these topics have double meanings or be misinterpreted without context? How can this affect the accuracy and helpfulness of topic modeling?
  2. Is it okay to ignore one or two words that may not correlate with the rest of the words within a topic? Why or why not?

Sherlock Holmes Topic Modeling

Word Cloud for Blog

First and foremost, I accidentally miscounted and neglected to post a tenth topic so it is included in the following list:

(50 topics/1000 iterations/20 topics printed)

Place: house side road passed walked front round garden hall windows path corner direction window standing ran houses yards led bicycle

Murder/death: found left body blood lay brought examined revolver round examination ground knife carefully wood death stick marks track dead spot

Letter/note: paper note read letter book pocket letters handed wrote written writing write sheet post document slip table reading date envelope

(60 topics/700 iterations/15 topics printed)

Woman: woman lady wife young mrs girl love life husband child miss married story daughter beautiful

Spirits/ghosts: doubt lost danger dangerous clear life criminal law friend memory powers presence death care fear

Time: night heard morning evening clock ten past waiting house thirty usual surprise found quarter quiet

Crime: house found examined night body showed show clue signs finally death proved carefully carried servant

Money: years money ago twenty hundred lady king pounds gold months pay photograph age year thousand

Deduction process: case interest facts points point investigation remarked give follow incident theory interesting obvious run conclusion

Family: father made left happened death poor mother imagine story returned died strange mad truth butler

Though I found topic modeling to be an interesting concept and distant reading tool, I thought it was difficult to understand when I was configuring and selecting my own topics.  I don’t think I was able to spend enough time with the program.  Since I don’t have any background with programming, I felt like there was something I was missing.  It was difficult for me even to get MALLET to compute the data in the first place.  After that, I could go through the lists of words and find how many times they were used and, to an extent, the way they related each other – so I was able to better grasp the use for this tool.  Looking at the words this way appears to be more effective in finding information about a lot of text, as opposed to a word cloud.  A word cloud will display all of the words randomly and show their frequency [like above, displaying the frequency of the words in my topics]; MALLET will list words in relation to each other, so a reader will get a better idea of the themes throughout the collection of literature.  In theory, this word cloud should illustrate a very condensed version of the Sherlock Holmes stories, but these are only words based on my selections of topics from topic modeling.  To any reader outside of this blog, the word cloud above [which focuses mostly on death and bodies and seems to make the stories out to be much more morbid than they really are] could not possibly produce an authentic understanding of the text.

When I chose my topics, I picked out the ones that were the most intriguing to me.  Some were simple and some didn’t make sense – for example, the final topic [the one I had forgotten] makes so little sense to me I don’t know how to title it, whereas the “woman” topic features only words that have direct correlations with the female gender.  For the “family” topic, I finally chose that word to represent them all primarily because of “mother” and “father.”  However, I still wonder what “strange,” “mad,” and “truth” have to do with the topic.  Perhaps “family” is incorrect and the topic is really to do with “storytelling,” which is prevalent in the Holmes stories.  Sherlock’s clients and/or Sherlock himself tell their stories in every individual mystery.  Many of the topics feature at least one word that throws me off of what I think the topic is in general.  So, for me, there is still a disconnect in the idea of distant reading as a comprehensible look at lots of text, but I’m really enjoying looking at new technological ways to consider and discuss literature.

MALLET Topic Modeling (Part Two)

download

I used several different numbers and played around with them a bit to find my topics. In all I used 5 different settings (50 topics, 100 iterations, 10 words; 50 topics, 2000 iterations, 10 words; 50 topics, 1000 iterations, 20 words; 100 topics, 2000 iterations, 20 words; 100 topics, 1000 iterations, 20 words) to find my 10 different topics. The topics I choose seemed very similar and gave a basic overview of what the Sherlock Holmes Series could be about. Some words in each general topic were part of a specific story. Still, they also gave insight to what the Sherlock Holmes series is all about.

The more obvious topics that relate to the Sherlock Holmes stories include the topics of Murder, Investigation, Home/Baker Street, and Observation. It makes sense that terms relating to these topics would be found most often in the stories and be grouped together. Holmes investigates murders (as well as other crimes), lives on Baker Street (where he meets with almost all of his clients when they ask him to do investigations for him) and has very advanced tools of observation, which he utilizes to solve the cases he takes on. These 4 topics provide a basic premise of the Sherlock Holmes stories – who is involved, where is it set, and what are the key plots.

The other topics that I choose may not a first glance seem like topics that relate to Sherlock Holmes, they are very important in understanding Sherlock and his methods. Money is a topic that may seen out of place but money is seen as being very important to Sherlock in the stories. Holmes would always like to know in advance how much he will be payed. One can infer that he is so confident in his abilities that he thinks anyone can pay him in advance. Another topic I found was Actions, most interestingly the words “sit” and “laid.” This shows that Sherlock is very comfortable with his surroundings as he listens (another action word) to his new clients stories. A topic that may be seen as sort of a stretch to compare to Sherlock Holmes is Married Life. The terms in this topic start out positive but soon become negative with words like “spite” and “hate.” Sherlock could deal with marriages gone wrong. Clothing (and its descriptors) and Writing can be lumped together almost because they are two things that Sherlock observes the most when he is trying to solve a case. The last topic, Time Measures included words “time,” “week,” and “days” so the reader can tell that even though Sherlock is a great detective, it may take him some time to figure out the cases that he takes on.

The MALLET tool was easy to use and allowed me to change the settings to get the results I wanted as much as I wanted. I believe the topics that I found, even the ones that don’t seem obviously connected to the Sherlock Holmes stories, provide a background of information for someone trying to get the general themes and ideas behind the stories, whether they be Digital Humanities students like us or just first-time readers.

Mallet Lab

Screen Shot 2014-10-30 at 12.08.29 AM

 

After using the Mallet Lab, it was interesting to see how straight forward, and easy to navigate it was.   After using the Topic Modeling tool, it was interesting to see a lot of the same words being used to describe certain sentences within the passage.  Above is the word cloud that is composed of the words I have uncovered from the Topic Modeling tool.  Most of the words that showed up multiple times were investigation, blood, crime, and murder.  The other words were used as well, but those three stuck out the most.

What I found most useful about the Topic Modeling tool was how easy it was to navigate.  Really, the only thing we had to do as a class was type in the amount of topics, and iterations and the system did the rest for us.  Overall, it was quite interesting to see the amount of times a word was used, and interesting to see how they put certain words together in a sentence (that did not thoroughly flow) to explain a certain topic.

Sherlock Holmes novels can sometimes be difficult to understand, but with the help of the Topic Modeling tool, it makes it a lot easier to understand exactly what topics are apparent within the novel/passage.   Most of the time, a lot of the words that are used are there to explain the theme within the passage.  For example: Sherlock Holmes is a detective, therefore it would make sense that a majority of the keywords used involved investigation, crime, and murder.  This helps the reader ultimately understand without reading the full novel that the book/passage is about investigative activity, which can be extremely useful.

Overall, topic modeling is a very useful tool when it comes to literature, and opens up the eyes of the reader to overlook some of the more unimportant words, and focus on the topics and themes that either reoccur, or explain the story.

MALLET

50 topics 

Feminine

5. lady young woman wife maid child girl left beautiful ill give ferguson making mistress notice devoted possibly nurse poor frances

75 Topics

Sherlock’s Mannerisms

61. holmes chair sat fire pipe back asked arm rose glanced

Decor

67. light dark side lamp low match floor wall darkness held

100 Topics

Detective

40. london criminal learned dangerous attempt career order begin failed prepared chance due care fair succeeded

Travel

44. train station carriage line reached journey bridge hurried drove roof observed body opposite walk town

Murder

62. body lay blood shot found revolver blow dead head knife drawn fell weapon stick picked

Door

81. door room opened key safe locked side inside closed lock open shut fastened conscious doors

Assistant

88. watson dear surely fellow complete touch dangerous hope meet read draw report final form sufficient

500 Words

Setting

40. front house high drive road trees standing park windows place miles building sun trap drove

Furniture

178. bed bell rope end ventilator cord mantelpiece hung observe chair pull inches ring dummy fastened

Mallet is a useful tool specifically for looking at the broader idea of an author or a story. It definitely is a distant reading tool, as I could not see it being of value to anyone who is close reading. Mallet is very helpful in that it takes minutes to go through all the information (in this case, every Sherlock Holmes story) and sort it all into groups based on a similarity. This organization would take a person extended periods of time to just read through and record all the words in the Sherlock Holmes stories, much less sort them all into hundreds of groups based on patterns.I thought it was particularly interesting what the tool said about the collection of Sherlock Holmes stories.
The grouping of the words was rather obscure and more often than not, very difficult to figure out. Even when I had 500 groups, it was not as simple to understand the groupings as it would be if a person sorted the words. Had a human being grouped the words, I think it would have been much easier to figure out the pattern of the groupings. Often times, I found that if a group of words was rather obvious, there would be one or two words in the group that came almost out of left field completely. This was the most difficult or frustrating part about using Mallet.

Screen Shot 2014-10-29 at 11.46.28 PM

The words cloud really emphasizes the mystery and action of a Sherlock Holmes story. Some of the biggest words are danger, fasten, drove, and body. Danger is one of the key elements that adds to the mystery in any Sherlock Holmes story. Drove and fasten can go together well. Fasten often brings to mind the thought of fastening your seatbelt when on a dangerous or intense ride. In a lot of stories, there is a murder or a body to be found, so that makes sense as to why that is a big word. These words together properly give the essence of a good Sherlock Holmes story.