Topic Modeling

Iterations: 1500

Topics: 20

Topics Printed: 20

  1. ESTATE: house road passed round side place walked carriage garden left master dog horse hall drive path led ground walk standing
  2. CRIME: man young found inspector house father colonel dead police heard death body attention son crime evidence murder returned dangerous hopkins
  3. PUBLIC: street found home back station train lord james baker minutes st occurred waiting reached cab hours police late order town
  4. SEARCH: thought time make give made leave knew great find back hear place things doubt bring chance lost position fellow danger
  5. EXPRESSION: cried face turned back hands instant hand suddenly head moment voice words sprang forward eyes fell appeared feet lips threw
  6. APPEARANCE: man eyes face black looked red dark white figure deep hair thin features hat heavy drawn tall appearance blue sharp
  7. REASONING: clear mind point reason question person find matter idea make means absolutely secret presence impossible save excellent aware explanation sign
  8. CASE: interest sherlock case facts strange remarkable friend singular account london cases arthur nature problem extraordinary details public effect give find
  9. SILENT REFLECTION: holmes chair sat gave mrs companion fire fresh visitor rose pipe easy start table glanced silence cold silent horror change
  10. BUSINESS: business london money men papers set office answered letters brother made hundred work address man considerable great company west mycroft

Lauren Gao’s Topic Modeling

Using Mallet’s Topic Modeling program, DHM293 ran all 56 Sherlock Holmes short stories with various settings. Changing the number of iterations and topics, I finally settled on settings of 100 topics and 2000 iterations, with 20 words in each topic.

1) Murder in Sherlock Holmes

crime death police murder reason charge scene tragedy night committed arrest violence evidence murdered motive constable caused suspicion escape attempt

2) Watson

watson dr doctor friend means surprised matter natural blessington amberley patient disease days medical continued knowing reasons armstrong trevelyan brougham

3) Men in Sherlock Holmes Stories

face man eyes dark thin tall expression figure features looked beard voice middle manner handsome gray clean age huge fierce

4) Women in Sherlock Holmes

woman wife husband love life knew girl loved married lady women rich daughter soul beautiful power nature beauty marriage young

5) Transportation in Sherlock Holmes

home minutes cab waiting heard wait glad ten ha walking twenty church quiet send reach talking feel driven long drove

6) Deducing in Sherlock Holmes

case facts points explanation fact simple theory admit investigation give solution problem confess correct present obvious formed probable connection false

7) Holmes’ mannerism

holmes head hands shook easy smiled sank sunk breast short forehead gesture rubbed began forward clapped despair branch leaning eagerly

8) Villains in Sherlock Holmes

great doubt criminal dangerous country brain set career act failed makes gang cunning power war europe compelled sufficient traced remains

9) Smoking in Sherlock Holmes

sat pipe fire looked time cigar tobacco smoke asked sherlock corner long chair armchair smoked lit roylott smoking moran observe

10) Accents in Sherlock Holmes

don ll ve won talk thing give answered didn bit ready bad couldn wait eh minute masser wouldn isn lucky

 

 

Topic Modeling Group Project

While working with MALLET, we noticed that a lot of different factors change the types of topics you will get. Here are some of the things which we noticed affected our results.

  • Number of Topics–The number of topics affects the type of topics you get because if you let the computer sort it into more categories, they will have more variety as opposed to if you just have a few to choose from.  The more variety you have instantly makes you think outside the box as to what a specific topic really means.
  • Number of Iterations–The iterations affects the topics the tool gives you because you more words to work with creating more of a complex sentence with more foundation.

I found that the best settings for me was to let the computer sort the data 1000 times, into 100 categories. it gave me a lot to work with so I didn’t get caught up on the topics that meant nothing to me. 

These were the three categories we found the most interesting, and the stories they appeared the most, and least in.

  1. Manliness- sat pipe fire laid smoke tobacco blue corner lit armchair cigar hung silent gas brandy smoked smoking comfortable shining bachelor                                                                                                                                     MOST: man with the twisted lip    LEAST: His Last Bow
  2. Transportation- train station carriage cab drive waiting journey drove town cross started line follow fresh bridge reach passing hansom class reached                                                                                                                                 MOST: The Final Problem     LEAST: The Noble Bachelor
  3. Evidence- facts obvious clear person theory impossible explanation question idea perfectly mind means confess formed affair absurd probable possibly evident correct                                                                                                MOST: Boscombe Valley Mystery      LEAST: The Adventure of the Red Headed Leauge

I think that this raises a few questions. Mainly: How accurate is this data in considering ALL of the Holmes’ stories (considering each has it’s own specific themes) and, how do these topics change chronologically through each of the storied being published?

~Austin Carpentieri & Sammy Harris

Using MALLET for Topic Modeling

Travis Miller and Simeon Allocco

When using MALLET we found that when changing the number of topics used, there was a significant difference. We first decreased the number of topics from 50 to 35, and then we increased it from 35 to 65. After doing so we realized that the words being used contained many more nouns and less verbs when decreasing the number of topics. This made the the sets of words much more concise, making it easier to generalize a topic name for them.

When fiddling with the number of iterations we did not see any difference in the patterns at all, besides the fact that the words themselves were different. Aside from that there was no visible change in verb and noun usage. One setting that we strongly recommend when using MALLET is the remove stop words option in the advanced settings. This will cut out unnecessary words that are insignificant to the actual theme of the topic, making it much easier to analyze.

Favorite Topics:

Clues:

This topic was used the most in the Adventure of Sherlock Holmes: The Five Orange Pips while it was used the least in the Adventure of the Silver Blaze.

Questions:

Does date of publication affect the frequency of this topic?

Is this data reliable since this topic is so ubiquitous in the Sherlock Holmes Series?

Investigation:

This topic was used most in the Adventure of the Empty House and it was used the least in the Adventure of the Redheaded League.

Questions:

What differentiates this topic from the last topic?

How popular was this topic compared to the others?

Death:

Death showed up the most in the Adventure of the Gloria Scott and showed up the least in the Adventure of the Bruce-Partington Plans.

Questions:

The name Garcia pops up in the list of words for this topic. Why is that?

Was death a very popular topic during this time period?

Discussing Topic Models with Mary Dellas and Joe Mausler

After discussing the process and results of topic modeling using MALLET, we know that the fewer topics we have, the broader the topic category MALLET gives us. The more iterations we have, the easier it is to identify a topic name. We recommend the default settings we used in class: 50 topics,1000 iterations, and 20 topic words. This setting gave us enough topic words to determine a topic name, but not so many that it became confusing and repetitive.

These are our three favorite topics:

1. Physical Description (Male): face man eyes looked thin dark features tall expression appearance middle high pale figure set glasses gray keen clean bear

  • a) The top ranked document in the Physical Description (Male) topic is Charles Augustus Milverton. 26 words in the document are assigned to this topic.
  • b) The story The Sussex Vampire uses this topic the least (2 times).
    • Question 1: Even though 26 words in the document are assigned to Physical Description (Male), does this imply that this document is entirely dedicated to the topic Physical Description (Male)?
    • Question 2: Why does it seem like some of the words (ex. set, bear) do not relate to the other words in the topic?

2. Letter Writing: paper note read letter table book box letters papers written handed wrote writing sheet brought importance post write document address

  • a) The top ranked document in the letter writing topic is The “Gloria Scott”. 18 words in the document are assigned to this topic.
  • b) The story Shoscombe Old Place uses the topic least (2 times).
    • Question 1: Why does the same story name appear multiple times on the list of the top ranked documents?
    • Question 2: When we click the story chunk, why is MALLET only showing us a small part of the document?

3. Crime: police crime case night evidence murder death account occurred arrest unfortunate effect tragedy violence complete charge appeared reason terrible committed

  • a)  The top ranked document in the crime topic is The Second Stain. 62 words in The Second Stain we assigned to this topic.
  • b) We found that The Priory School uses crime the least–a total of two times.
    • Question 1: Is crime a more common topic in the later Sherlock Holmes stories or the earlier ones?
    • Question 2: Can MALLET tell us how many stories in total discuss crime?

MALLET Results MichealF

word cloud 2

Posted above is my word cloud made with my MALLET results. We had used MALLET previously in class and it was interesting to create a key word or category for a group of related words. Making them ourselves however was a different experience. I got to see what goes into making these topic models. I used 4 separate combinations when topic modeling. My first search was 50 topics/1000 iterations/ 20 words printed. Within this search I picked the 3 sentences that were able to be categorized the easiest. The topics for the three examples I chose were “Hallway”, “Communication”, and “Study/Office”. The second search I did was 25 topics/ 500 iterations/ 15 words printed. The three examples were “Case”, “Suspect” and “Evidence”. The third search I did was 20 topics/ 250 iterations/ 10 words printed. The four examples I chose were “Suspicious”, “Location”, “Discover/Trace” and “Attack/Violence”. During my search results I felt that it would be best to narrow my search requirements after every time. My reasoning behind this was that by narrowing my search queue, I would get more accurate results every time. I felt that the more words printed in a search results would make the topic harder to categorize because there is more words that you need to relate with each other. The models I got with narrower search results were easier to understand and easier to categorize. Overall, topic modeling using MALLET was a helpful tool to try and find main themes throughout all the Sherlock Holmes stories and I look forward to doing it again in class if given the opportunity.

Topic Modeling Analysis and Word Cloud

Wordcloud

My topics for MALLET were really interesting, and I think that they say a lot about Sherlock Holmes as a whole. One of the first ones I came across was one that I entitled “Evidence.” This category had words like “Facts,clear, theory, possibly…” and many others. The importance of this category to the Sherlock Holmes stories cannot be understated. Obviously, to a detective, evidence is a pretty important thing. I found many other categories which one would expect to find in detective stories (e.g. Crime and Investigation) but some of the others were a little more interesting. Take for example a category I named “Manliness.” This category had words like “pipe, fire, smoke, tobacco, armchair, cigar” and “brandy.” Just from these words alone, one can get the image of a wax mustachioed man, sipping brandy and smoking a pipe by the fireside. While this is not exactly how anyone in the Sherlock Holmes’ stories is portrayed, it does have a certain feel that you get from these stories– an almost Rudyard Kipling type ambiance. Another big category i noticed, I named “Transportation.” In it were words like “train, station, carriage, cab, drive, waiting” and “journey.” I think that this category illustrates that transportation is a big part of the stories, and also shows that there is not just one was of getting around that the stories focuses on. Sherlock and Watson use train, automobile, walking, carriage, and almost any other type of transportation that you can imagine. They are always going somewhere. These were the most interesting and telling categories I discovered with the MALLET tool, and upping the number of words in the categories really did help with creating some more unique categories. Overall, I really enjoyed using MALLET, and look forward to using it in the future.

~Austin Carpentieri

Followup: 2000 iterations and a burning hot computer

My computer is not sluggish- it can handle Battlefield 4 on Ultra at 1080p/60fps (which, for you nongamers, means very fast and very good looking). However, it would seem skimming through text documents gives it some pause for concern. 62.976 seconds after starting up the topic modeling tool, though, my little machine spit out a list of 50 topics that could be isolated from the various words therein. So that one doesn’t need to refer back to my last post, here’s a refresher:

1. holmes word head words men message revolver shook life shot — Holmes, firearms, and investigations
2. light stood long suddenly lamp dark sound low shoulder figure — Stealth and sneakiness
3. clear doubt mind person possibly obvious idea excellent perfectly point — Deduction and flattery
4. make father made heard son returned left mr view point — Conspiracy and inheritance
5. eyes face man looked dark thin tall features companion pale — Description of characters
6. house small large stone great high place square windows houses — Houses and mansions
7. reason remember fear danger clear told chance strong horror family — Rationale
8. told heart knew god story hands life speak truth leave — Rationalization
9. matter understand position imagine call absolutely important trust force hope — Help me, Holmes, you’re my only hope
10. holmes mr professor fresh work aware surprise action great change —Sudden change in behavior

So, why did I choose these topics? They all had a primary commonality, being that they were about a general topic narrowed down to instances from their specific stories. Examples were plucked from specific passages, but these are overarching sentiments seen again and again in the archives. These sentiments are basic tropes in the mystery canon: implements of murder (1), men creeping in the shadows (2), a victim’s family rationalizing their sorrows (8), and, particularly for Holmes, a plea for help (9).

The simplicity of the fairly elaborate points here makes these 10 topics effective for getting a “feel” for Sherlock Holmes and the universe he inhabits. Together, they detail the basic elements of an average story. Thus, I believe them to be the most effective topics to be chosen out of this fairly bulky list.

As for the generation of the list, I experimented with a variety of settings before settling on the 50 topics/2000 iterations/10 topic word option. I tried as many as 500 topics and 5000 iterations, and as few as 10 topics and 500 iterations. The former produced too many specific topics, focusing on specific plot elements from specific stories. The latter produced too many broad topics, focusing on broadly used vocabulary words from many of the stories. I determined that an appropriate middle ground was found in the 50/2000/10 option, and I believe the topics chosen reflect that.

50 topics, 2000 iterations and a strangely sluggish i7 later

All from a cycle consistent of 50 topics, 2000 iterations, and 10 topic words.

1. holmes word head words men message revolver shook life shot — Holmes, firearms, and investigations
2. light stood long suddenly lamp dark sound low shoulder figure — Stealth and sneakiness
3. clear doubt mind person possibly obvious idea excellent perfectly point — Deduction and flattery
4. make father made heard son returned left mr view point — Conspiracy and inheritance
5. eyes face man looked dark thin tall features companion pale — Description of characters
6. house small large stone great high place square windows houses — Houses and mansions
7. reason remember fear danger clear told chance strong horror family — Rationale
8. told heart knew god story hands life speak truth leave — Rationalization
9. matter understand position imagine call absolutely important trust force hope — Help me, Holmes, you’re my only hope
10. holmes mr professor fresh work aware surprise action great change — Sudden change in behavior

Mallet Modeling Tool, Sherlock Edition

1,000 Iterations 20 Words Printed 50 topics

Broken Home: young father left life time years poor son death man met sister boy mother ago fate died returned daughter family
Investigation: police crime evidence inspector murder death tragedy law person criminal official arrest present missing trace violence charge appeared committed effect
Household: lady wife woman husband maid mrs child character servants ferguson married rucastle lived madam mistress trouble children jack devoted nurse
Written Document: paper note table read papers box book pocket put handed writing written drew sheet glanced picked document slip envelope piece

1,500 Iterations 25 Words Printed 40 Topics

Characteristics: face eyes man thin lips features dark looked tall pale expression raised mouth figure gray beard drawn manner sprang handsome held eager fixed blue thinking
Emotions: cried hands face instant moment back god voice words quick sake minutes amazement cry spoke answer soul stared sank glimpse excitement burst heaven swear heavens
Physical Appearance: black red white hair hat head large broad coat heavy small middle set short dress cut brown round thick centre grey faced dressed clean glancing

2,000 Iterations 40 Words Printed 60 Topics

Schedule: night morning day clock morrow early leave breakfast work hours arrived sleep twelve dressed eleven bright spent slept fresh waited hour appointment porter reading moving caused lunch signs shortly meal awake tuesday matters victoria hearing reply earlier blood woking driven
Suicide: found man body dead lay blood head struck hand shot revolver blow knife stick heavy weapon unfortunate left death sign lying wound bullet handle formidable pistol finally escaped wounded tied fired carried world struggle dragged grotesque injury spot shirt gun
Traveling: hour half train station past carriage waiting cab quarter wait passed drive reached minutes drove started journey late ten line pulled hurried passing bridge town cross glancing hansom class reach brougham clearing nearer fast charing streets learn coachman cabman rattled