Topic Modeling: The Experience

This week’s exploits with Mallet and topic modeling have proved to be incredibly interesting to say the least. While discussing topic modeling in class, I had vivid expectations on what would happen upon entering the files into the program. I imagined uploading the Sherlock Holmes books in individual files and getting a set number of perfect topics pop to right up, waiting to be labeled. Whole lot of wishful thinking happening, I guess.

I was astonished by how many nonsensical categories appeared in the outputs. This was even more apparent when stop words were not removed and I had one output of 75 topics and 2000 iterations that was entirely unusable because of this factor. In my first output (50 topic categories), I had only two categories that contained words with an apparent, understandable theme. The second (40 topics) also had two categories. Only when I ran it for 100 categories with 4000 iterations twice and followed with 75 topics and 4000 iterations did I get ten unique topics.

Unsurprisingly to the theme of the books, I got reoccurring topics that I wanted to label as Crime. I used the topic label from my first output. 59 words in the category came from The Second Stain, the largest number, but The Six Napoleons was the best represented for this particular instance, ranking 2-9 in number of words from the section being represented in the category. Murder or Death seemed to occur in all outputs, not dependent on size or iteration. The sections represented came from The Norwood Builder, The Boscombe Valley Mystery, Silver Blaze, etc. All the represented stories have a death or murder depicted in some manner. I also got many categories that I felt could be labeled with Case, featuring various words like evidence, crime, mystery, explanation, etc. Nearly all the books were represented in some capacity within the category.

While topic modeling did not match my (outlandish) pre-existing conception of it, I was really happy with the results I got. While these is certainly a need for deeper analysis to make sense of the categories presented, I found that I learned a lot about reoccurring themes in the Sherlock Holmes adventures that match what I have read and hinted at what I have not. My categories seemed to coincide with major themes of the books, which I really enjoyed from an analytic perspective. Topic modeling is a really unique and creative way to examine common themes across literary sources without having to closely read each one. While I would not necessarily base an analysis on the themes alone (there is a degree of subjectivity to it), I think it makes a really interesting supplementary element to literary analysis.