I initially questioned the reliability of topic modeling as a means of analyzing and interpreting bodies of literature. I understood how trends in word distribution could thematically summarize and reveal patterns in a canon but due to a misconception of how the process worked I remained skeptical. My skepticism came from denying that a universal algorithm could be applied to a variety of canons and for each yield worthwhile data. My attitude shifted as I began to familiarize myself with MALLET’s interface and the nature of the topic modeling process. While nothing is perfect the program offers sufficient means to customize the analysis, a necessary component to effectively rendering worthy results. My experience proved this feature can in fact make or break an analysis and ultimately dictates the practicality of the data sets you receive. I came to this conclusion after running multiple trials of the same canon but with varied analysis and getting results that varied in effectiveness. For instance my first trial with MALLET yielded “iffy” results. The algorithm ran on its default setting grouped into 50 topics with 1000 iterations, 20 topic words and enabled stop words. Due to whatever reasons when the Sherlock Holmes canon was analyzed according to the default setting the resulting data groups contained many “outlier” words which made generalizing them under one topic difficult. If the data analysis wasn’t customizable than I couldn’t have adapted the algorithm to a practical setting for getting significant data. But in my second trial, when the same canon was analyzed with slight variation, (50 topics, 2000 iterations, 10 topic words, with stops words) the results were data sets more easily understood and grouped into topics that reveal trends in the body of literature. Some of the results were as follows:
Topic: High Class Estate “house long high large place windows garden front standing servants” This group reveals a recurring theme of wealth and upper social echelons in the Sherlock Holmes collection.
Topic: Business “business good money hundred pay worth pounds company thousand gold” This data shows a pattern of finance and business throughout the Holmes stories.
While the aforementioned data sets were interesting they weren’t usefully insightful. Then I started noticing an intriguing pattern worth analyzing arising in most of my trials. On numerous occasions in different trials suspense and excitement were grouped with appeals to the senses.
Topic: Suspense (visual) “face turned instant back eyes head forward sprang suddenly caught” I chose to label this topic as visual suspense because these words imply suspense and appeal to the readers sense of sight.
Interestingly enough in trial 4 (100 topics, 3000 iterations, 10 topic words, with stop words) another suspenseful appeal to the senses surfaced, this time addressed to hearing and sight.
Topic: Suspense (auditory/visual) “heard long window light sound suddenly silence match slipped sharp”
The trends were interesting but due to the small amount of topic words they needed to be substantiated. I wanted to see what would happen when I increased the amount of topic words. For my next trial, trial 5 (100 topics, 4000 iterations, 20 topic words, with stop words) a similar but more elaborate case of sensory appeal occurred.
Topic: Excitement (sensory) “looked gave coming stood start surprise suddenly opened bright astonishment surprised thinking thoughts spoke violent remark opposite approached minute fashion”
These topics indicate a pattern of dramatic sensory appeals throughout the Sherlock Holmes series. Showing a trend in Sir Arthur Conan Doyle’s writing style that makes the content of his stories easily engaging to the reader. Which I theorize as a potentially significant factor towards the great popularity of the Sherlock Holmes Stories. An inference I would never have been able to make without topic modeling.