Comparing Similar Topics in MALLET

This blog was written by Kevin Finer and Megan Doty

While using MALLET it is important to use the right variables in your search. We found that 75 was a good number of topics for MALLET to create. If the number of topics being created is too high than the program is forced to cluster words into topics that won’t provide a realistic sense of the text. In terms of iterations run, the output will only strengthen by increasing that variable but those iterations come at the cost of processing time. We felt that 1,500 iterations was good enough to create reliable topics while not overloading the program.

Now we will examine three similar topics that both of our searches created.

Writing / Notes and Writing
For the topic of Writing from Megan’s search, the words MALLET returned included “paper note read book table wrote writing written handed sheet picked letter write page address pen piece pencil post learn,” with the top ranked document being “The Adventure of the Reigate Squire”. For the segment of the story where the topic was most prevalent, the count came in as making up 41% of the file, with words from the topic appearing 19 times. The story in which it appeared least is “The Adventure of Wisteria Lodge”, with only 1 usage appearing up from multiple parts of the story.

Meanwhile, a similar topic in Kevin’s search provided different story frequency. “The Five Orange Pips” brought a total of 17 text files linked to the topic of writing, ranging from 4 to 24 words. In that largest count the topic provided 27% of the file. Writing occurred the least in “The Yellow Face”, although the use of it is still larger than in Megan’s data set. This is because in Kevin’s search where this topic occurred there were only 30 topics made, thus increasing the frequency of all of them.

Question 1: Where this topic appeared most, what was the writing in pertinence to?

Question 2: In these instances is Holmes typically writing or reading?

Emotional Shock / Reactions upon realization

In a search with 1500 iterations and 75 topics output, the topic pertaining to reactions and emotional shock was most prevalent in the story “The Adventure of the Creeping Man,” with 26 appearances of words from within the topic, followed closely by “The Adventure of the Devil’s Foot,” with 21 words appearing from the segment of text.

In Kevin’s version of this topic “The Speckled Band” used the topic the most, but “The Adventures of the Creeping Man” was probably the second most prevalent story. In this case there was some similarity in the two topics. This particular version of the topic appeared the least in “The Norwood Builder”.

Question 1: What type of event is normally occurring when this topic appears?

Question 2: Could these topics perhaps have overlap with the topics of Murder or Crime?

Time / During the Night

Under the results for this topic, different parts of the story “The Adventure of the Bruce-Partington Plans” came up for the first three results of top ranked documents. Here, with Megan’s search, the results showed more specifically words that were relevant to times during the night. Looking at the list of stories with top usages, the first came in with 17 usages for the section, followed by 14, and then 12. The same story also ranks in for 7 – 12 under the results, which can infer that — without having read the story — much of the action unfolds at night.

In a different search with a similar topic, “The Red-Headed League” was easily the most common with the top two files belonging to that story. In total 27 files belong to that story. “The Mazarin Stone” had few instances of the topic at all, comprising a total of eight words through three files of the story.

Question 1: Where in the story is the time usually mentioned?

Question 2: Is time more likely to be mentioned in stories that are a longer length?

As is evidenced by the differences in the most common stories these topics were found, even when the topic is similar the data for related topics does not always match. The change of even just a few words and settings in the search provides different data. What this means is that MALLET may not give you the full picture (like many of the tools we have used so far). However, it is still useful in drawing certain broad conclusions as well as interpreting data in new ways and considering new approaches to stories and information.