Topic Modeling Group Project

While working with MALLET, we noticed that a lot of different factors change the types of topics you will get. Here are some of the things which we noticed affected our results.

  • Number of Topics–The number of topics affects the type of topics you get because if you let the computer sort it into more categories, they will have more variety as opposed to if you just have a few to choose from.  The more variety you have instantly makes you think outside the box as to what a specific topic really means.
  • Number of Iterations–The iterations affects the topics the tool gives you because you more words to work with creating more of a complex sentence with more foundation.

I found that the best settings for me was to let the computer sort the data 1000 times, into 100 categories. it gave me a lot to work with so I didn’t get caught up on the topics that meant nothing to me. 

These were the three categories we found the most interesting, and the stories they appeared the most, and least in.

  1. Manliness- sat pipe fire laid smoke tobacco blue corner lit armchair cigar hung silent gas brandy smoked smoking comfortable shining bachelor                                                                                                                                     MOST: man with the twisted lip    LEAST: His Last Bow
  2. Transportation- train station carriage cab drive waiting journey drove town cross started line follow fresh bridge reach passing hansom class reached                                                                                                                                 MOST: The Final Problem     LEAST: The Noble Bachelor
  3. Evidence- facts obvious clear person theory impossible explanation question idea perfectly mind means confess formed affair absurd probable possibly evident correct                                                                                                MOST: Boscombe Valley Mystery      LEAST: The Adventure of the Red Headed Leauge

I think that this raises a few questions. Mainly: How accurate is this data in considering ALL of the Holmes’ stories (considering each has it’s own specific themes) and, how do these topics change chronologically through each of the storied being published?

~Austin Carpentieri & Sammy Harris