Topic Modeling Partner Blog Post – Carly Rome and Jacquie Behan

Lowering the number of topics causes MALLET to generate broader topics. Lowering the number of topics too much can make topics too vague. We recommend using a higher number of iterations for topic modeling through MALLET because it makes it easier to identify common topics among the words generated. However, too many topics combined with too many iterations can make the results too specific, which isn’t helpful either.

Comparing both of our word lists, we both had the topics:

  1. Crime- (Carly – Crime: police man inspector lestrade colonel crime death reason evidence murder mystery affair case present remained criminal force arrest account undoubtedly/Jacquie- Crime- man police found inspector dead crime death body evidence reason murder blood night shot person)
  2. Murder/Death- (Carly- Death: found end lay dead hand part body long deep lost close blood finally carried showed attention broken shot leaving horror/ Jacquie- Murder- found dead man death crime body evidence terrible unfortunate attempt violence words occurred instantly action save murderer committed escape murdered)
  3. Characteristics/Appearance of a Person- (Carly- Characteristics: face eyes man red black white dark looked thin features hair lips tall appearance blue drawn expression pale heavy hat/ Jacquie-Appearance-face eyes man dark tall looked features expression thin lips pale mouth figure companion appearance)

Crime was found the most (74 times) in the Second Stain story and the least (2 times) in the Musgrave Ritual. Murder/death was found the most (28 times) in the Norwood Builder and the least (2 times) in the Noble Bachelor. Characteristics/appearance of a person was found the most (27 times) in the Mazarin Stone story and the least (2 times) in The Speckled Band.

Our two questions about the data are:

  1. Can any of the words in these topics have double meanings or be misinterpreted without context? How can this affect the accuracy and helpfulness of topic modeling?
  2. Is it okay to ignore one or two words that may not correlate with the rest of the words within a topic? Why or why not?