Using MALLET for Topic Modeling

Travis Miller and Simeon Allocco

When using MALLET we found that when changing the number of topics used, there was a significant difference. We first decreased the number of topics from 50 to 35, and then we increased it from 35 to 65. After doing so we realized that the words being used contained many more nouns and less verbs when decreasing the number of topics. This made the the sets of words much more concise, making it easier to generalize a topic name for them.

When fiddling with the number of iterations we did not see any difference in the patterns at all, besides the fact that the words themselves were different. Aside from that there was no visible change in verb and noun usage. One setting that we strongly recommend when using MALLET is the remove stop words option in the advanced settings. This will cut out unnecessary words that are insignificant to the actual theme of the topic, making it much easier to analyze.

Favorite Topics:

Clues:

This topic was used the most in the Adventure of Sherlock Holmes: The Five Orange Pips while it was used the least in the Adventure of the Silver Blaze.

Questions:

Does date of publication affect the frequency of this topic?

Is this data reliable since this topic is so ubiquitous in the Sherlock Holmes Series?

Investigation:

This topic was used most in the Adventure of the Empty House and it was used the least in the Adventure of the Redheaded League.

Questions:

What differentiates this topic from the last topic?

How popular was this topic compared to the others?

Death:

Death showed up the most in the Adventure of the Gloria Scott and showed up the least in the Adventure of the Bruce-Partington Plans.

Questions:

The name Garcia pops up in the list of words for this topic. Why is that?

Was death a very popular topic during this time period?