Mallet & Topic Modeling

Drawing from the collection of Sherlock Holmes stories, MALLET was used to draw out keywords and sort them into separate topics seen throughout Conan Doyle’s work. From the topics that I’d gotten running MALLET, I’d been able to pick out a handful of them that felt more classifiable than others, many of which seemed to be filler words that, together in a topic, seem a bit nonsensical. After running MALLET a few times, using settings of 75 topics with 1500 iterations I was able to pull out transportation, murder, writing, trails, working in London, during the night, women, describing the dead, finances, Holmes seated, realization, and facial expressions.

Changing the settings, it seemed as though the amount of topics output did not affect the quality of each topic, rather the amount of topics that I could pull from effectively. Which is to say, I might be able to get 5 good topics with a setting of 40 topics output, and 10 good ones with 80 output. Further, the amount of iterations played an important role in coming up with usable results. It seemed as though the more iterations that the program ran through, the more cohesive each topic was. The words per topic seemed to be in the same vain as the number of topics output overall. Running the program, I stuck with 20. Like with topics, from the amount of words put out per each topic, some seemed nonsensical. Though, with less words I might have lost the words that helped connect the topic to a certain subject. With more words, the result for each topic may have become too convoluted to make sense of and categorize.

Particularly interesting about how this project with MALLET was run was the html pages output with the results. Being able to click on and see the words of a topic in context helped immensely with more vague topics which could have been applicable to a number of adjacent topics.

MALLETwordle

Putting the words from MALLET into Wordle to create a word cloud, I could immediately imagine more interesting ways that the data could be processed. It is intriguing to see commonalities between topics, though it would be increasingly more so to see how they connect overall. Particularly, to see how often they appeared throughout a timeline of Holmes stories — perhaps incorporating density with adjustments pertaining to the length of the work the topic appeared in.