Topic Modeling Observations

And we’re back!
I suspect I am not alone in this regard, but when we first discussed Topic Modeling in class I had some understanding of it but also a lot of confusion with the process. Using the Mallet tool myself has helped make the process abundantly clearer in both how it works and the usefulness of it. So let’s dive in.

There’s no need to examine all of my found topics but you can view them here. They were Business and Labor, Holmes’ Office, Emotional Shock, Place, Crime, Foot Traffic, Time, Deduction, Love and Matrimony, and Notes and Writing. I found that increasing or reducing the number of topics to sort was one of the key factors in effecting the output. Fewer topics lead obviously to more repetition between searches whereas more topics lead to a wider variety and assortment. I’m sure the increase in iterations run does provide a stronger set per topic but it isn’t really possible for me to prove that claim and it does have the negative effect of increasing the duration for the program to run.

One of the topics I found personally interesting was Holmes’ Office. With the words – holmes chair sat table room fire back pipe sitting rose arm laid seated glanced books – I felt that it gave a clear indication that these were scenes of Holmes reflecting in his office. And indeed, in several of the passages it was connected to the scene was set in his office. But I was too specific in my naming of this topic, as the number one scene linked to this topic (with the highest number of topic words associated with it) was from “The Man with the Twisted Lip” but the scene in question from file twis_39.txt did not actually take place in Holmes’ office, even if it strongly evoked it. Technically none of the words specifically point to it being his office but they all provide a sense of scene that I felt linked to his office. So while they often fit into the topic, I was too specific in its naming and perhaps another theme or word would have been better suited to name it (although I admittedly struggle to decide on one myself).

In contrast, when using a more general topic name the results are more likely to be accurate by way of simple logic. The topic Crime – crime night police occurred house tragedy violence murder made caused committed account barclay death appeared – will almost always evoke crime in the scene in question. With the Holmes stories of course revolving around crimes and the mysteries they create, this was a prevalent topic where at least 5 instances of the words appear in 90 of the text files.

As is the theme of Digital Humanities as we’ve explored thus far, we could use Topic Modeling to look at these stories in new and interesting ways. The topics can shed light on things we may not have considered prior and open up questions that likely wouldn’t have been asked with close reading of the stories. The tool also provides insight into the relationships between words and how they have a power over us- such as in my mistake naming a certain topic after Holme’s Office when indeed it was more general than that.