Tools
There are various tools that can be used to employ topic modeling, including Python and R. Here, we’ll discuss a program called MALLET and the statistical model it uses.
MALLET
- Natural language processing toolkit
- Command line tool
- Easy to run with this tutorial: Programming Historian Guide to MALLET
Under the Hood: Latent Dirichlet Allocation
MALLET uses an algorithm called Latent Dirichlet Allocation (LDA) that works in this way:
-
Begin by gathering a set of documents you want to model
-
Assign the algorithm the number of topics (X) you want it to produce, and the number of iterations (Z) you want it to run on your documents
-
The model then goes through each of your documents and randomly assigns each word to one of X topics
-
After the first iteration, you have some pretty terrible topics
-
But luckily you’ve assigned the model to iterate Z times! (This is the important part…)
-
The model runs Z times, each time assessing the probability that Word A appears in each topic, and the probability that Topic A appears in each document
-
After so many iterations, the model gets pretty good at clustering words that are likely to appear in similar contexts across all the documents in your corpus
-
Your end-product is a list of these clusters (or “topics”)…

…and a data file containing the percentage of each topic’s presence in each of your documents:


Still Confused? Want to Know Specifics?
-
Matt Jockers’s Topic Modeling “Fable” (LDA Buffet) is a really great, non-technical entry into how LDA works.
-
David Blei invented LDA. Check out his article Introduction to Probabilistic Topic Models in Communications of the ACM for further information about the algorithms LDA uses.
Modifying Your Output
-
Stopword List: A stopword is a word (usually a commonly-used word) that an application has been programmed to ignore. Usually, stopword lists contain common words such as a, an, the, and, to, from, etc.
- Sometimes, it can be useful to add common words like person names or place names, depending on what your research question is. Stopword lists are customizable, allowing the researcher to remove words such as character or place names from the analysis.
- Parameters: Change some aspects of how the MALLET program is running (number of iterations, topics, and other more complex parameters) to achieve a different output (see MALLET Documentation for more details)
Other Tools
- Overview Docs, online tool designed for journalists to sort through huge data sets
- jsLDA, facilitates in-browser topic modeling using LDA