A Closer Look at Topic Modeling | Digital Tools for the 21st Century

The MALLET tool allows for individuals to manipulate several factors when generating topics, all of which can influence to topics that are produced. The number of iterations, the amount of text, the number of words in each topic, and the number of topics for example, can all influence the outcome. Changing the number of topics affects the topics that the tool outputs by either more or less holistically representing the body of texts that MALLET learns. A smaller number of topics leads to an html output with topics that are very prominent themes within the texts. A larger number of topics leads to an html output with much greater detail and variability based on the texts. When using this topic modeling tool, I recommend using 1000 iterations to ensure that the tool learns as much about the text files as possible, and for number of topics, I recommend inputting 50-100 number of topics to see a great variety of outputs that are still small enough subsets of data to analyze broad themes and topics in the texts.

I was unable to save the metadata for my project files and unable to go back to the classroom to re-do this assignment, so I am unable to identify the percentages of story origins for each topic. However, my three favorite topics for this assignment were “Investigation,” “Murder,” and “Attack.” Investigation came from the list with 75 topics, Murder came from the list with 50 topics and Attack came from 30 topics as well. I speculate that all of these topics involve the majority the Sherlock Holmes’ world of story. Interestingly enough, despite the variations in the lists that the topics come from, there is consistency in theme amongst these topics – an attack, murder and then an investigation is common in Arthur Conan Doyle’s stories.

The amount of metadata provided allows for a variety of questions to surface, such as what the publication dates are for stories involving murder and whether one can identify trends in the frequency of murder in a plot versus an attack, etc.