Topic Modeling, Sherlock Holmes Edition

This week’s digital tool was very different from the others we have used in class. After playing around with Mallet and topic modeling, I actually enjoyed trying to “un-puzzle” the words (so to speak), and figure out what the major topics were for each Sherlock Holmes story. For the first three topics I chose, I displayed the modeling tool with 1,000 iterations, 20 words printed, and 50 topics. In reviewing the long list of topics I could have chosen, I actually struggled in finding one that I understood and thought was relevant enough to twentieth century London. For my first topic, broken home, I learned that these groups of words were most popular in the Sherlock Holmes story of The Solitary Cyclist. From reviewing the topics, it seemed as though 25% of the words were those of my topic. When identifying what I thought the topic would be, I eventually labeled it abandoned. However, after reviewing the words again I felt broken home was more appropriate, seeing as how even though a father left, the siblings remained in contact-not an ideal family, but still there version of what a family is. My second topic had to do with investigation. I chose this as a topic because not only did I get it right away, but it also proves that the majority of Sherlock Holmes stories are revolved around investigation! It seemed as though 39% of the words in the story revolved around this theme of investigation, but it didn’t rely to heavenly on it. My third topic was household. It seemed as though 25% of the words in Sussex Vampire revolved around the topic; however, it was only a small portion of the story seeing as how it only outlined the characters of the short story. Lastly, the fourth topic I chose was written document. I was surprised by these results because only about 13% of the story included this topic. Although it may not have been a major theme in “Gloria Scott,” I assume the document was the premise for what the investigative story was based upon.

When I played around with Mallet again, I decided to change it up. Instead of doing 1,000 iterations, this time I did 1,500 iterations, 25 words printed, and 40 topics. I found it easier labeling topics for these various groups of words because I had more to work with and more to compare. Therefore, the first topic I chose was characteristics. It became very clear that this was the topic for these words because words like “face, grey, man, thin, lips” were very prevalent. The next topic I chose was emotions. This was one of the harder topics I had to label because words like “god, voice, words” threw me off- but after reviewing it once more, I decided that a general topic for these words and more would have to do with a person’s emotions. My last topic for this set was physical appearance. This was kind of a fun group to label because it there was a lot of imagery and colors involved, so it was very clear for me to imagine this person standing in front of me. Therefore, I knew this topic had to involve some sort of appearance. When I reviewed the top Sherlock Holmes stories for these related topics I got The Priory School, Lion’s Mane, and A Case of Identity. Interestingly enough, all of these topics were slightly similar, leading me to believe these stories may have similar themes.

For my last set, I chose 2,000 iterations, 40 words printed, and 60 topics. I was a bit more overwhelmed with this set because although I had a lot more to work with, I felt like it was almost too much to work with. When looking at the different sets of words I felt like most of the words matched one another to create a topic, but I felt that others were kind of strenuous and took away from the major theme of the topic. With that being said, I said my first topic was schedule. A lot of the words had to do with timing and places to be, which reminded me a lot of when I plan my day out. For this topic, 12% of The Missing Three-Quarter had to do with my topic, and although another 6% had to do with a different topic, one of the major themes still pointed to scheduling. My second topic was suicide. This topic was very easy to label because all of the words involved with it pointed to something tragic and done to self. Therefore, I thought suicide would be an appropriate label. Looking at the percentages it seemed as though The Norwood Builder was the top story that had about 13% of the words listed for this topic. Although it was not the number one topic for the story, it is prevalent seeing as how a murder must have taken place. Lastly, I chose the topic of traveling for my next set. This was another topic that I was iffy about because I felt like it could have been journey or travel, but I leaned more toward travel because of words such as “bridge, town, (and) cross.” It seemed as though traveling was prevalent in Final Problem, with a total of 29% of topic words mentioned throughout the short story.

Overall I thought Mallet was a fun and interesting tool, and I would most definitely try it out again sometime. It taught me a lot about gathering a major theme based on prevalent words in a short story, but in a unique way. Every time I figured out a topic I felt as though I was unscrambling a really difficult puzzle piece; however, once I came up with the correct topics the whole process became extremely entertaining!