A deeper look at topic modeling

wordcloud

All categories chosen from 50 topics with 1000 iterations:

time – morning night back clock waiting past early morrow quarter arrived

writing – paper note read letter table book handed letters written wrote

physical features – face eyes looked thin features lips figure tall dark expression

household – woman lady wife husband life love girl child married maid

clothing/accessories descriptions – black hair red hat heavy round broad centre coat dress

death/crime – found man dead lay body blood death knife lying round

interrogation/crime solving – give matter idea reason question impossible occurred absolutely explanation true

physical reactions – face turned back instant hand sprang forward moment side head

transportation – station train road carriage passed side drive reached drove hour

darkness/mystery – light suddenly dark long caught sat lamp spoke silence silent

Using MALLET was an interesting experience. I enjoyed how simple and accessible the interface was. I had no trouble navigating the program and tweaking the iterations and so forth to my liking. I experimented with several numbers before choosing to analyze my topics with 50 topics, 1000 iterations, and a 10 topic word selection. I tested extreme numbers to see how it would influence the data. In one trial I searched 500 topics with 3000 iterations. This resulted in too specific of data that explored topics that were relative to particular stories. I also searched as few as 10 topics with only 500 iterations. This generated too many broad and vague topics that did not capture the essence of the mysteries. In the end I felt that narrowing it down to 50 different topics with 1000 iterations gave me a good sense of the Sherlock Holmes stories in a general yet helpful way. The word cloud above displays these words in a creative and interactive way.

The ten topics that I chose out of the fifty total were due to their overall similarity. I assigned the simplest titles that I could think of to each of them to give a general structure for understanding the Sherlock Holmes stories as a collection. Understanding ten basic concepts that are reflective of the entire collection is easier to grasp and accept by the reader. Each title represents an element of the stories that is imperative to the work as a murder mystery relative to the time it was written. Obviously topics such as death, crime, interrogation, and mystery are all blunt examples of what a mystery story encompasses. Some of the other topics such as physical reactions and features are more subtle examples yet serve just as important a role. The stories rely primarily on context clues and other literary devices that create an interesting and challenging mystery to solve. Things such as physical expressions and reactions are important elements of any mystery story because they can explain a lot about an individual character or the way they respond to certain situations. Another topic such as clothing descriptions seems to be part of the style of writing of the collection of Sherlock Holmes stories. Holmes is an icon for mystery investigators and the way that he is dressed is an important part of his appeal. The author pays a lot of attention to the way that Holmes’ dress is described as well as other characters throughout the entire series.

Topic modeling provides a unique framework for examining thousands or millions of texts at once. Distant reading is an interesting concept that I will hopefully be able to exercise in future research. The ability to apply your own ideas and lens to any given topic or series of works through topic modeling is something truly valuable that many other classic tools or academic research methods do not allow or facilitate.

Topic Modeling with MALLET: Analyzing the Results

Initially, it was difficult for me to understand the definition and purpose of topic modeling. However, after using MALLET, a topic modeling tool, to find patterns in Sherlock Holmes stories, I began to understand how topic modeling works.

After entering the Sherlock Holmes stories into MALLET, I found 10 good topics. The first 6 topics came from 50 topics,1000 iterations, and 20 topic words printed. The topic names were Letter Writing, Crime, Marriage, Death, Clues, and Physical Description (Male). The other four topics came from 70 topics, 1500 iterations, and 15 topic words printed. These were Holmes in his Chair, Rooms in a House, London Finance, and Investigation Process. I experimented with other variations of iterations, topics, and topic words printed, but only had time to upload these output files onto my computer. By testing out many different variations I found that the more iterations and topic words you have, the easier it is to identify the topic name. After I picked out my 10 topics, I clicked on the topic words within them in order to see the top ranked documents within that topic. MALLET then allowed me to see the number of words in a specific document that were assigned to that topic. I found, for example, that 22 words in a document from The Stock Broker’s Clerkwere assigned to the London Finance topic. The words in this topic were: money business work hundred answered good pounds company asked thousand advertisement city price headed pay. The document excerpt that MALLET showed at the top of the page revealed that this part of the story was about a “gigantic robbery” in which “nearly a hundred thousand pounds worth of American railway bonds” were found in the robber’s bag. This explains why 22 of the words within the document were assigned to London Finance. MALLET also showed that only 12% of the words in that entire document were assigned to this topic. I went through this same process with all of my topics to figure out which Sherlock Holmes stories discussed certain topics, and how many words in each story were assigned to those topics.

Altogether, I think topic modeling with MALLET is a great way of distant reading. MALLET proved to be efficient after it sifted through mass amounts of text from Sherlock Holmes stories and found patterns within them faster than most of us could even finish reading just one of those stories. There were a few aspects of MALLET, however, that I disliked. First, it creates enormous files. These files take up a lot of space, and this makes the process of transferring them onto Google Drive and onto other computers extremely slow. On top of this, some of the topics it creates are extremely difficult to decipher names for because the words didn’t seem have much in common. A lot of the topics also reappeared after I changed the number of iterations, topics, and topic words (ex. London Finance, Death, Holmes in his Chair). I suppose that was inevitable though, because the text being read by MALLET didn’t change.

After completing this project, I understand that topic modeling tools such as MALLET are useful in that they can take texts and then find patterns in the use of words. topic modeling is most effective when we have many documents/texts that we want to understand without actually closely reading each individual text (distant reading!).

Mary Dellas

Followup: 2000 iterations and a burning hot computer

My computer is not sluggish- it can handle Battlefield 4 on Ultra at 1080p/60fps (which, for you nongamers, means very fast and very good looking). However, it would seem skimming through text documents gives it some pause for concern. 62.976 seconds after starting up the topic modeling tool, though, my little machine spit out a list of 50 topics that could be isolated from the various words therein. So that one doesn’t need to refer back to my last post, here’s a refresher:

1. holmes word head words men message revolver shook life shot — Holmes, firearms, and investigations
2. light stood long suddenly lamp dark sound low shoulder figure — Stealth and sneakiness
3. clear doubt mind person possibly obvious idea excellent perfectly point — Deduction and flattery
4. make father made heard son returned left mr view point — Conspiracy and inheritance
5. eyes face man looked dark thin tall features companion pale — Description of characters
6. house small large stone great high place square windows houses — Houses and mansions
7. reason remember fear danger clear told chance strong horror family — Rationale
8. told heart knew god story hands life speak truth leave — Rationalization
9. matter understand position imagine call absolutely important trust force hope — Help me, Holmes, you’re my only hope
10. holmes mr professor fresh work aware surprise action great change —Sudden change in behavior

So, why did I choose these topics? They all had a primary commonality, being that they were about a general topic narrowed down to instances from their specific stories. Examples were plucked from specific passages, but these are overarching sentiments seen again and again in the archives. These sentiments are basic tropes in the mystery canon: implements of murder (1), men creeping in the shadows (2), a victim’s family rationalizing their sorrows (8), and, particularly for Holmes, a plea for help (9).

The simplicity of the fairly elaborate points here makes these 10 topics effective for getting a “feel” for Sherlock Holmes and the universe he inhabits. Together, they detail the basic elements of an average story. Thus, I believe them to be the most effective topics to be chosen out of this fairly bulky list.

As for the generation of the list, I experimented with a variety of settings before settling on the 50 topics/2000 iterations/10 topic word option. I tried as many as 500 topics and 5000 iterations, and as few as 10 topics and 500 iterations. The former produced too many specific topics, focusing on specific plot elements from specific stories. The latter produced too many broad topics, focusing on broadly used vocabulary words from many of the stories. I determined that an appropriate middle ground was found in the 50/2000/10 option, and I believe the topics chosen reflect that.

50 topics, 2000 iterations and a strangely sluggish i7 later

All from a cycle consistent of 50 topics, 2000 iterations, and 10 topic words.

1. holmes word head words men message revolver shook life shot — Holmes, firearms, and investigations
2. light stood long suddenly lamp dark sound low shoulder figure — Stealth and sneakiness
3. clear doubt mind person possibly obvious idea excellent perfectly point — Deduction and flattery
4. make father made heard son returned left mr view point — Conspiracy and inheritance
5. eyes face man looked dark thin tall features companion pale — Description of characters
6. house small large stone great high place square windows houses — Houses and mansions
7. reason remember fear danger clear told chance strong horror family — Rationale
8. told heart knew god story hands life speak truth leave — Rationalization
9. matter understand position imagine call absolutely important trust force hope — Help me, Holmes, you’re my only hope
10. holmes mr professor fresh work aware surprise action great change — Sudden change in behavior

Mallet Topic Modeling; Sherlock Holmes

Topics: 50, Iterations: 2000, TW:10, Stopwords Enabled

Estate: house long high large place windows garden front standing servants

Business: business good money hundred pay worth pounds company thousand gold

Intriguecase singular interest remarkable friend cases problem sherlock interesting events

Suspense (visual): face turned instant back eyes head forward sprang suddenly caught

 

Topics 100, iterations, 3000, TW 10, Stopwords Enabled

Suspense (auditory): heard long window light sound suddenly silence match slipped sharp

 

Topics 100, Iterations 4000, TW 20, Stopwords Enabled

Excitement: looked gave coming stood start surprise suddenly opened bright astonishment surprised thinking thoughts spoke violent remark opposite approached minute fashion

 

Topics 1000, Iterations 3000, TW 15, Stopwords Enabled

Books: paper note read book writing written handed sheet table slip wrote page pen pencil address

 

Topics 75, Iterations 4000, TW 20, Stopwords Enabled

Society: knew man men made american england country fear heard secret rich world society america lived garrideb wonderful collection land gennaro

Death: found dead body lay man blood death blow knife unfortunate terrible person lying finally cut weapon evidence constable remained wound

Crime: crime night criminal evidence murder death police arrest charge strong present appeared tragedy committed discovered violence murdered proved occurred motive

 

Google NGram Viewer: Police v. Crime & Domestic Work v. Industrial Work

The development of police forces progressed drastically throughout the 19th century. This advancement in the police force made me curious as to whether or not police appeared as often as crime in english literature during the 19th century. The first two words I entered into Google NGram Viewer were crime and police. 

Screen Shot 2014-10-19 at 12.49.41 PM
Crime
made a steady appearance in English literature throughout most of the 19th century, showing up much more often than police until about 1880.  In 1880, police takes a huge turn and begins popping up a lot more while the appearance of crime decreases slightly.  I googled “police in 1880” in an attempt to figure out what caused this spike in the appearance of police. One of the first web results revealed that there was a surge in gun crime in 1880, mainly in London.  I also found that urban police departments in the 1880s were developing new methods to keep track of criminals and maintain records about them. Here, it became evident that the word police was beginning to come into English literature more often because surges in violence prompted police to develop more effective strategies in approaching crime and criminals.

I think that crime and police cross paths in 1893 on Google NGram Viewer because many cities developed (or were in the process of doing so) strong police forces after seeing their success in other cites. The growing popularity of police forces suggests that crime in 1893 English literature probably involved police.

The next two phrases that I entered into Google NGram Viewer were domestic work and industrial work. I was curious to see if the change from the domestic industry to the factory/industrial industry was reflected onto the pages of books in English literature.

Screen Shot 2014-10-19 at 2.46.39 PM

The graph processed by Google Ngram Viewer shows that industrial work was seldom mentioned in English literature until 1843. After researching industrial work in 1843, I found that between 1843 and 1848, women protested their wage decrease in textile mills (industrial work). Another prominent point on this graph is the period from 1866 to 1869, where the appearance of industrial work spikes and then crosses paths with domestic work.  Perhaps the reason for the spike is the invention of dynamite in 1866, and tungsten steel in 1867. Both played an important role in the industrial revolution because dynamite allowed for the clearing of paths (to build on), and tungsten steel was used in new buildings. During such a pivotal period, people probably began writing more about the industrial revolution, which explains this spike in the appearance of industrial work.

In 1875 industrial work became less popular in English literature and then began to climb gradually in 1880. In 1890 we see a peak in the appearance of domestic work.  This was when the National American Woman Suffrage Association was formed and the American Federation of Labor declared support for woman suffrage. Female voices were heard and women were able to discuss their desire to vote and to be viewed seriously outside of the domestic workforce. This movement may explain the peak in the appearance of domestic work in English literature.

Altogether, I found that Google NGram Viewer is an effective way of “distant reading.”  It allows me to spot trends across many different works by looking at frequency words and phrases in literature. The only change I suggest on this site is the addition of axis titles.

Mary Dellas

Works Cited:

“Detection and the Police.” Detection and the Police. N.p., n.d. Web. 19 Oct. 2014.

“How Safe Was Victorian London?” How Safe Was Victorian London? Ed. Jacqueline Banerjee. N.p., 6 Feb. 2008. Web. 19 Oct. 2014.

“National Women’s History Museum.” Education & Resources. NWHM, n.d. Web. 19 Oct. 2014.

Taylor, Emily. “Inventions of the Industrial Revolution.” Time Toast. TimeToast, n.d. Web. 19 Oct. 2014.

Ngram Comparisons: “War, American” & “Happy, Stress”

TRENDS OF CULTURE:

“War, American”

Screen Shot 2014-10-19 at 5.45.28 AM
Ngram comparison shows unsurprising correlation between War (blue) and American (red). Smoothing:5 1800-2000

Although the context in which these words appear remains entirely unknown, I can’t help but feel this correlation reinforces the militant stereotype of the United States. Unsurprisingly, “war” peaks during the years of each world war (1914–1918) and (1939–1945) around the same time “American” sees rapid increase. The mention of “war” was nearly identical for the first year of the Civil (1861, .029%) and Vietnam War (1955, .031%). The decrease in “war” mentioning seen from 1965 to 2000 could likely be a result of a change in popular parlance as terms like conflict are now commonly used to refer to war-like circumstance.

Also, similar results appear when swapping “American” for “United States”.

 

“Happy, Stress”

Screen Shot 2014-10-19 at 5.44.41 AM
Ngram displays unfortunate trend of culture. Stress (blue), Happy (red). Smoothing:10 1800-2000

 Smoothing this one out made the message very clear, “stress” is becoming more of a hot topic whilst “happy” decreases. There was a lot of happiness being talked about in the 1800s and virtually no stress until the advent of the civil war. “Stress” gains considerable momentum beginning in 1970 and “happy” is at an all time low in 2000.


JUST FOR FUN:

“Most, Less, and Least”

Screen Shot 2014-10-19 at 5.45.04 AM
 Nothing ironic about that. Most (blue), Less (red), Least (green). Smoothing:1 1800-2000

Much to my surprise this ngram went exactly as I anticipated. I experienced similarly coincidental results searching “1,2,3”, “one,two,three” and “first,second,third” with all comparisons resulting in sequential order with “first”,”one”, “1”, and “most” occurring considerably more regularly than their comparative counterparts.

“Apples, Oranges”

Screen Shot 2014-10-19 at 5.45.18 AM
Ngram comparing apples and oranges. Smoothing:3 1800-2000

 For humors sake I decided to compare apples and oranges. The result showed a steady disparity in popularity between the two, with apples being the most often referenced. Interestingly enough both peaked in popularity between the years 1909-1948.

Google Ngram

I used the Branch Collective website to choose words that I thought may show interesting correlations regarding their presence in texts throughout the nineteenth century. For my first Ngram, I looked at evolution and ethics. For my second Ngram, I looked at imperialism and nationalism.

Screen shot 2014-10-17 at 2.03.12 PM

Screen shot 2014-10-17 at 2.05.54 PM

The first chart (evolution and ethics) shows an increase in the use of both evolution and ethics later in the century, around 1870. This makes a lot of sense because Charles Darwin began to publish his theories around this time, and there was a lot of talk and controversy surrounding evolution. Many debates on evolution took place around this time, such as the 1860 meeting of the British Association for the Advancement of Science in Oxford. Ethics played a large role in debates surrounding evolution and God.

The second chart (imperialism and nationalism) shows an increase in both words during the second half of the century, with a huge spike in “imperialism” at the tail end, closest to 1900. This makes a lot of sense because the Second Boer War started in 1899 and was marked by an increase in feelings of nationalism and the “New Imperialism,” along with racism and genocidal thinking. This was part of the “Scramble for Africa” among European nations.

(source: Branch Collective Topic Clusters – http://www.branchcollective.org)

Google Ngram Viewer is very helpful in locating trends within literature of digitized books from specific time periods. However, as noted in the blog post by Ted Underwood, there is a lack of context which can lead to misinterpretation or misinformation. That is why websites like Branch Collective can be helpful in understanding these correlations and trends.