Topic Modeling

How does the number of topics affect the topics the tool gives you?

Changing the number of topics allows it to vary ont the granularity of produced topics.

How does the number of iterations affect the topics the tool gives you?

The higher the number of iterations the higher the topic coherence.

What settings do you recommend for use with the Topic Modeling Tool?

10-20 topics

>100 iterations

Remove stopwords

Solving: case point facts points fact obvious interest explanation investigation mystery simple confess theory present admit solution formed true problem connection

  1. What story uses that topic the most? The Dancing Men
  2. Which stories use it less? The Disappearance of Lady Frances Carfax
  3. What is the most common word from this topic in the story?
  4. Why some words are repeated?

Crime: man dead poor strong body death life brought terrible dangerous sort words creature real deep notice wild turn devil lies

  1. What story uses that topic the most? The Veiled Lodger
  2. Which stories use it less? The Reigate Squires
  3. Why do you think this is topic is more used in this story?
  4. What is the relation of the words and topic and their stories?

Murder Case: crime police found murder death night scene arrest reason attention remained trace instantly murderer attempt suspicion discovered charge caused search

  1. What story uses that topic the most? The Second Stain
  2. Which stories use it less? The Devil’s Foot
  3. Why are there some words that are not related with the topic?
  4. How does the topic modeling tool help us with the understanding of the story?

By Alessandra Oestreich and Isabelle Berta

Using MALLET for Topic Modeling

Travis Miller and Simeon Allocco

When using MALLET we found that when changing the number of topics used, there was a significant difference. We first decreased the number of topics from 50 to 35, and then we increased it from 35 to 65. After doing so we realized that the words being used contained many more nouns and less verbs when decreasing the number of topics. This made the the sets of words much more concise, making it easier to generalize a topic name for them.

When fiddling with the number of iterations we did not see any difference in the patterns at all, besides the fact that the words themselves were different. Aside from that there was no visible change in verb and noun usage. One setting that we strongly recommend when using MALLET is the remove stop words option in the advanced settings. This will cut out unnecessary words that are insignificant to the actual theme of the topic, making it much easier to analyze.

Favorite Topics:

Clues:

This topic was used the most in the Adventure of Sherlock Holmes: The Five Orange Pips while it was used the least in the Adventure of the Silver Blaze.

Questions:

Does date of publication affect the frequency of this topic?

Is this data reliable since this topic is so ubiquitous in the Sherlock Holmes Series?

Investigation:

This topic was used most in the Adventure of the Empty House and it was used the least in the Adventure of the Redheaded League.

Questions:

What differentiates this topic from the last topic?

How popular was this topic compared to the others?

Death:

Death showed up the most in the Adventure of the Gloria Scott and showed up the least in the Adventure of the Bruce-Partington Plans.

Questions:

The name Garcia pops up in the list of words for this topic. Why is that?

Was death a very popular topic during this time period?

Relevance of Topics in Mallet

By: Mike Mirando & Nadia Sharif

The utilization of a variance in number of topics in mallet methodically effects the results and outcomes out the process. As we observed the data from previous findings, we concluded that if one has a fewer amount of topics, evidence is more clear and broad in certain aspects. The outputs are meticulously  strengthened and bolded as the number of iterations are increased, existing as a positive correlation.  A recommended setting for Mallet would be: 50 topics, 1500 iterations, and 20 topic words.

Money & Worth: business client money england hundred king pounds thousand large set gold photograph paid pay ten give draw fifty ready worth

a) The specific story that utilized this topic the most was, The Blue Carbuncle with 21% of the words assigned to this topic (34 words)

b) The specific story that utilized this topic the least was, The Boscombe Valley Mystery (2 words)

c) 1) Due to the fact that these explicit words are associated with money, can you imply that the entire story circulated around economics?

2) Can Mallet help us determine if all of the Sherlock Holmes are circulated around money?

Family: woman lady wife husband life love left boy child nature loved beautiful maid ferguson happy madam women mistress devoted wonderful

a) The story that this selection of words is mostly found is is, The Sussex Vampire (88 words)

b) The story that this selection of words is mostly found in is, The Red Circle (2 words)

c) 1) Can we determine what characters have been associated with these words from MALLET?

2) Can we comprehend the background of information associated with this selection of words?

Unfortunate Murder: death man poor terrible dead heard told creature happened died night life human met devil great wild fate dreadful killed

a) The most prevalent story that these words are associated with is, The Second Stain (64 words)

b) The least prevalent story that these words are associated with is, The Noble Bachelor (2 words)

c) 1) Can we find the specific event that occurred from MALLET?

2) At which time and point in the story did these words come up most and why?

As we overlooked the results a numerous amount of times, we found these topics to be the most interesting and most evident and specific when it came to comprehending the background information. Methodically understanding the information the topic sets out for you was a highly significant aspect of Mallet and can be proved to be a primary aspect of reading.

 

Comparing Similar Topics in MALLET

This blog was written by Kevin Finer and Megan Doty

While using MALLET it is important to use the right variables in your search. We found that 75 was a good number of topics for MALLET to create. If the number of topics being created is too high than the program is forced to cluster words into topics that won’t provide a realistic sense of the text. In terms of iterations run, the output will only strengthen by increasing that variable but those iterations come at the cost of processing time. We felt that 1,500 iterations was good enough to create reliable topics while not overloading the program.

Now we will examine three similar topics that both of our searches created.

Writing / Notes and Writing
For the topic of Writing from Megan’s search, the words MALLET returned included “paper note read book table wrote writing written handed sheet picked letter write page address pen piece pencil post learn,” with the top ranked document being “The Adventure of the Reigate Squire”. For the segment of the story where the topic was most prevalent, the count came in as making up 41% of the file, with words from the topic appearing 19 times. The story in which it appeared least is “The Adventure of Wisteria Lodge”, with only 1 usage appearing up from multiple parts of the story.

Meanwhile, a similar topic in Kevin’s search provided different story frequency. “The Five Orange Pips” brought a total of 17 text files linked to the topic of writing, ranging from 4 to 24 words. In that largest count the topic provided 27% of the file. Writing occurred the least in “The Yellow Face”, although the use of it is still larger than in Megan’s data set. This is because in Kevin’s search where this topic occurred there were only 30 topics made, thus increasing the frequency of all of them.

Question 1: Where this topic appeared most, what was the writing in pertinence to?

Question 2: In these instances is Holmes typically writing or reading?

Emotional Shock / Reactions upon realization

In a search with 1500 iterations and 75 topics output, the topic pertaining to reactions and emotional shock was most prevalent in the story “The Adventure of the Creeping Man,” with 26 appearances of words from within the topic, followed closely by “The Adventure of the Devil’s Foot,” with 21 words appearing from the segment of text.

In Kevin’s version of this topic “The Speckled Band” used the topic the most, but “The Adventures of the Creeping Man” was probably the second most prevalent story. In this case there was some similarity in the two topics. This particular version of the topic appeared the least in “The Norwood Builder”.

Question 1: What type of event is normally occurring when this topic appears?

Question 2: Could these topics perhaps have overlap with the topics of Murder or Crime?

Time / During the Night

Under the results for this topic, different parts of the story “The Adventure of the Bruce-Partington Plans” came up for the first three results of top ranked documents. Here, with Megan’s search, the results showed more specifically words that were relevant to times during the night. Looking at the list of stories with top usages, the first came in with 17 usages for the section, followed by 14, and then 12. The same story also ranks in for 7 – 12 under the results, which can infer that — without having read the story — much of the action unfolds at night.

In a different search with a similar topic, “The Red-Headed League” was easily the most common with the top two files belonging to that story. In total 27 files belong to that story. “The Mazarin Stone” had few instances of the topic at all, comprising a total of eight words through three files of the story.

Question 1: Where in the story is the time usually mentioned?

Question 2: Is time more likely to be mentioned in stories that are a longer length?

As is evidenced by the differences in the most common stories these topics were found, even when the topic is similar the data for related topics does not always match. The change of even just a few words and settings in the search provides different data. What this means is that MALLET may not give you the full picture (like many of the tools we have used so far). However, it is still useful in drawing certain broad conclusions as well as interpreting data in new ways and considering new approaches to stories and information.

Discussing Topic Models with Mary Dellas and Joe Mausler

After discussing the process and results of topic modeling using MALLET, we know that the fewer topics we have, the broader the topic category MALLET gives us. The more iterations we have, the easier it is to identify a topic name. We recommend the default settings we used in class: 50 topics,1000 iterations, and 20 topic words. This setting gave us enough topic words to determine a topic name, but not so many that it became confusing and repetitive.

These are our three favorite topics:

1. Physical Description (Male): face man eyes looked thin dark features tall expression appearance middle high pale figure set glasses gray keen clean bear

  • a) The top ranked document in the Physical Description (Male) topic is Charles Augustus Milverton. 26 words in the document are assigned to this topic.
  • b) The story The Sussex Vampire uses this topic the least (2 times).
    • Question 1: Even though 26 words in the document are assigned to Physical Description (Male), does this imply that this document is entirely dedicated to the topic Physical Description (Male)?
    • Question 2: Why does it seem like some of the words (ex. set, bear) do not relate to the other words in the topic?

2. Letter Writing: paper note read letter table book box letters papers written handed wrote writing sheet brought importance post write document address

  • a) The top ranked document in the letter writing topic is The “Gloria Scott”. 18 words in the document are assigned to this topic.
  • b) The story Shoscombe Old Place uses the topic least (2 times).
    • Question 1: Why does the same story name appear multiple times on the list of the top ranked documents?
    • Question 2: When we click the story chunk, why is MALLET only showing us a small part of the document?

3. Crime: police crime case night evidence murder death account occurred arrest unfortunate effect tragedy violence complete charge appeared reason terrible committed

  • a)  The top ranked document in the crime topic is The Second Stain. 62 words in The Second Stain we assigned to this topic.
  • b) We found that The Priory School uses crime the least–a total of two times.
    • Question 1: Is crime a more common topic in the later Sherlock Holmes stories or the earlier ones?
    • Question 2: Can MALLET tell us how many stories in total discuss crime?

Topic Modeling: Sherlock Holmes

Looking at the topics I selected earlier this week I could notice that even if in some cases there are words that are repeated or some words that shouldn’t be on the topic, the topic tool is really effective and give us a good general idea.

For example if you look at the topic that I named Mr. Watson:

watson case find point friend doubt sherlock interest facts remarkable fact singular remarked account dear strange present points curious reason

You can link most of the words with the character. For example, he is Shelock Holmes’ friend, interested in the facts of the case, he points things that Sherlock says and sometimes has doubt about it.

Some other topics for example can show us some steps of solving a case:

case point facts points fact obvious interest explanation investigation mystery simple confess theory present admit solution formed true problem connection

First of all you have the case and then you have to point the facts and even take in account the obvious ones.In the end of it you need to have and explanation for the mistery that the case is. You can start by having a theory of what happened or by investigating the victim’s connections or maybe follow a clue of a confession to get to the true story. It doens’t matter your choice of technique but you need to solve the problem.

Obviously some topics can be a little messy and don’t mean anything at all but this tool is really helpfull. I tried to set as many iterations as I could and it was really fun seeing the result.

Mallet & Topic Modeling

Drawing from the collection of Sherlock Holmes stories, MALLET was used to draw out keywords and sort them into separate topics seen throughout Conan Doyle’s work. From the topics that I’d gotten running MALLET, I’d been able to pick out a handful of them that felt more classifiable than others, many of which seemed to be filler words that, together in a topic, seem a bit nonsensical. After running MALLET a few times, using settings of 75 topics with 1500 iterations I was able to pull out transportation, murder, writing, trails, working in London, during the night, women, describing the dead, finances, Holmes seated, realization, and facial expressions.

Changing the settings, it seemed as though the amount of topics output did not affect the quality of each topic, rather the amount of topics that I could pull from effectively. Which is to say, I might be able to get 5 good topics with a setting of 40 topics output, and 10 good ones with 80 output. Further, the amount of iterations played an important role in coming up with usable results. It seemed as though the more iterations that the program ran through, the more cohesive each topic was. The words per topic seemed to be in the same vain as the number of topics output overall. Running the program, I stuck with 20. Like with topics, from the amount of words put out per each topic, some seemed nonsensical. Though, with less words I might have lost the words that helped connect the topic to a certain subject. With more words, the result for each topic may have become too convoluted to make sense of and categorize.

Particularly interesting about how this project with MALLET was run was the html pages output with the results. Being able to click on and see the words of a topic in context helped immensely with more vague topics which could have been applicable to a number of adjacent topics.

MALLETwordle

Putting the words from MALLET into Wordle to create a word cloud, I could immediately imagine more interesting ways that the data could be processed. It is intriguing to see commonalities between topics, though it would be increasingly more so to see how they connect overall. Particularly, to see how often they appeared throughout a timeline of Holmes stories — perhaps incorporating density with adjustments pertaining to the length of the work the topic appeared in.

Topic Modeling Sherlock Holmes via MALLET; Data Analysis

I initially questioned the reliability of topic modeling as a means of analyzing and interpreting bodies of literature. I understood how trends in word distribution could thematically summarize and reveal patterns in a canon but due to a misconception of how the process worked I remained skeptical. My skepticism came from denying that a universal algorithm could be applied to a variety of canons and for each yield worthwhile data.  My attitude shifted as I began to familiarize myself with MALLET’s interface and the nature of the topic modeling process. While nothing is perfect the program offers sufficient means to customize the analysis, a necessary component to effectively rendering worthy results. My experience proved this feature can in fact make or break an analysis and ultimately dictates the practicality of the data sets you receive. I came to this conclusion after running multiple trials of the same canon but with varied analysis and getting results that varied in effectiveness. For instance my first trial with MALLET  yielded “iffy” results.  The algorithm ran on its default setting grouped into 50 topics with 1000 iterations, 20 topic words and enabled stop words. Due to whatever reasons when the Sherlock Holmes canon was analyzed according to the default setting the resulting data groups contained many “outlier” words which made generalizing them under one topic difficult. If the data analysis wasn’t customizable than I couldn’t have adapted the algorithm to a practical setting for getting significant data. But in my second trial, when the same canon was analyzed with slight variation, (50 topics, 2000 iterations, 10 topic words, with stops words) the results were data sets more easily understood and grouped into topics that reveal trends in the body of literature. Some of the results were as follows:

Topic: High Class Estate “house long high large place windows garden front standing servants” This group reveals a recurring theme of wealth and upper social echelons in the Sherlock Holmes collection.

Topic: Business “business good money hundred pay worth pounds company thousand gold” This data shows a pattern of finance and business throughout the Holmes stories.

While the aforementioned data sets were interesting they weren’t usefully insightful. Then I started noticing an intriguing pattern worth analyzing arising in most of my trials. On numerous occasions in different trials suspense and excitement were grouped with appeals to the senses.

Topic: Suspense (visual) “face turned instant back eyes head forward sprang suddenly caught” I chose to label this topic as visual suspense because these words imply suspense and appeal to the readers sense of sight.

Interestingly enough in trial 4 (100 topics, 3000 iterations, 10 topic words, with stop words) another suspenseful appeal to the senses surfaced, this time addressed to hearing and sight.

Topic: Suspense (auditory/visual) “heard long window light sound suddenly silence match slipped sharp”

The trends were interesting but due to the small amount of topic words they needed to be substantiated. I wanted to see what would happen when I increased the amount of topic words. For my next trial, trial 5 (100 topics, 4000 iterations, 20 topic words, with stop words) a similar but more elaborate case of sensory appeal occurred.

Topic: Excitement (sensory) “looked gave coming stood start surprise suddenly opened bright astonishment surprised thinking thoughts spoke violent remark opposite approached minute fashion”

These topics indicate a pattern of dramatic sensory appeals throughout the Sherlock Holmes series. Showing a trend in Sir Arthur Conan Doyle’s writing style that makes the content of his stories easily engaging to the reader. Which I theorize as a potentially significant factor towards the great popularity of the Sherlock Holmes Stories. An inference I would never have been able to make without topic modeling.

 

Topic Modeling – Sherlock Holmes

This week’s assignment was actually really fun, I enjoyed experimenting different combinations of Topic Modeling and comparing the results, but the best part for sure was trying to unveil what possible labels I could use for each group of words. When I chose a combination of settings with more than 20 words printed it got tough to label them, I always had the feeling that the label I chose wasn’t good enough for most of the words. Also it is much better to chose less words in the same topic so that it looks more organized and better specified -you can just “jump in” to the part of the story or situation you want-.

When I first played with Malet I used the default settings but I changed the number of interations from 200 to 500, with10 words printed and 10 topics, it was very easy to label them once I had to work with only 10 words in the same topic, and as I said, when you have less words to group you can be even more specific.

For my second attempt I decided to do something a little different and contrasting to what I found to be easier, so I changed the settings to 30 topics, 2000 interations and 50 words printed. Worst decision ever! I had to look through at least 10 topics until I found one that I could relate all the words.

In my last time playing with the tool I decided to use all the settings that I liked better, so I chose 30 topics, 2000 interations and 10 words printed, the results were amazing and really easy to work with, also, the most interesting thing about this attempt is that all my labels were related in a certain way with crime and mystery.

Sherlock Holmes Topic Modeling – In Review

I enjoyed using Mallet and was surprised how extremely fast it did the topic modeling. I’m glad it’s not a slow process; it actually takes longer to set it up and tell it exactly what to do. As for titling my topics, I found some easier to do than others. While some were rather clear in what they could be titled, some were more difficult and I ended up using a word that was already included in my list for some, such as Work. I also found myself determined to make my titles only one word, not realizing that two or even three words would do just fine as well. I found two lists that were very similar to each other (Features and Appearance) as I thought some of the words in the two lists could be interchangeable. I certainly expected the popular topics such as Crime and Murder to show up. A lot of my topics are also related to one another, such as Features and Appearance, Crime and Murder, and Communication and Literature. In my Communication topic, the word “tregennis” appeared and I had no idea what that was. It turns out to be a character’s name in the short story titled “The Adventure of the Devil’s Foot.”  This just goes to show that additional research is always necessary, no matter what academic tool is used. As I played with different iterations and topic numbers, I noticed that the higher the number, the more variety in words included in lists. However, too many topics/words may be hard too hard to analyze, therefore making the whole process of titling the topics a strenuous task. I like Mallet as a tool for distant reading, which is a concept that I think is definitely useful. It’s kind of like a more organized word cloud in a way, one that groups words together instead of just gathering them. As someone who isn’t very fond of reading, analyzing texts, especially this many at one time, is way more enjoyable for me.