Topic Modeling

Using MALLET, I was able to Topic Model the canon of Sherlock Holmes stories. Once the program was set up, I changed the configuration to use 50 “Number of Topics,” 1000 “Number of Iterations,” and 20 topic words printed. I then fed in all the Sherlock Holmes stories. With these settings, MALLET gave me fifty different topics, each topic having twenty different words in them. From these fifty topics, I choose ten of the more obvious topics and named them. These are the following:

TOPICS:

Monetary Transactions

  • Business, money, make, hundred, asked, man, year, England, company, pounds, pay, thousand, friends, fifty, lived, ten, gold, paid, price, named

Discovering a Murder

  • Found, dead, man, body, crime, death, murder, police, bloody, finally, blow, knife, tragedy, lay, weapon, criminal, murderer, terrible, committed, scene

Presenting Cases to Holmes

  • Matter, understand, family, gave, brought, trust, complete, confidence, force, absolute, question, son, save, promise, happy, taking, honour, roof, reputation, private

Dinner Party

  • House, night, live, people, large, master, servant, evening, servants, household, purpose, dinner, lodge, baynes, enter, high, Garcia, gregson, scott, children

In a Bedroom

  • Room, window, bed, night, sitting, entered, bedroom, morning, open, heard, dressing, lawn, moment, sleep, drawing, rose, upstairs, gown, rooms, smoking

Waiting for a Taxi

  • Street, half, back, hour, past, baker, waiting, cab, quarter, ten, minutes, waited, drive, found, reach, waking, reach, hurried, passing, presently

Standing at a Door

  • Door, open, opened, heard, light, key, stood, closed, sound, passage, led, inside, room, locked, step, heavy, stair, hall, lock, instant

Stationary

  • Paper, note, letter, read, papers, table, box, handed, written, book, wrote, writing, letters, happened, write, sheet, post, document, slip, pocket

Travel

  • House, road, hall, place, side, front, walked, carriage, windows, round, led, garden, miles, drove, houses, yards, direction, drive, walk, cottage

Detective

  • Case, lestrade, evidence, mystery, yard, points, theory, afraid, arrest, facts, effect, Scotland, undoubtedly, difficulty, prisoner, innocent, charge, simple, probably, disappearance

I was then able to experiment with the settings to see different results. As I began to increase the number of iterations, the process began to take longer and longer, until I didn’t really have the time to wait for it. I think the last successful trial was one with 2000 “Number of Iterations.”

At first I had thought that having twenty words in each topic was a bit too much, so I experimented with decreasing this number. By doing this I learned that having less words in each topic often began to make the process more difficult. Sometimes a smaller group of words isn’t enough to successfully establish a trend, and thus a topic. There may be a better number to use than twenty, but it seems to be a pretty fine line. And, of course, not all of the topics I received with using twenty as the number of words in each topic were very good. Out of the fifty topics I received, I was really only able to find ten topics that made any sense to me, and even those may be pushing the proverbial envelope.

As for Topic Modeling, I went into this experiment as a skeptic and I did not really leave any more confident in the process. What Topic Modeling does is take a large number of files, looks through all of the words, then puts certain words in groups. The words in groups are suppose to have a common theme, and thus they are meant to tell you something about the group of text as a whole. I am skeptical about this process because words, when put together, don’t always mean what they may mean if you take them in a literal sense. Examples of this are figurative language such as idioms and metaphors. I’m not sure if the algorithm used in MALLET takes this into consideration, but I wouldn’t think it does. So, if this is the case, MALLET could still be a great program (it really is a great program) and Topic Modeling could still be very useful, but only in the cases where everything is literal. That being said, Topic Modeling has it’s limits just like everything else has it’s limits, and it still may be a great tool to distant read.