Introducing Topic Modeling

Topic modeling is a type of Machine Learning…

Machine Learning (ML)

ML is the application of algorithms and statistical modeling to allow computers to “learn” from data to do a task (often overlapping or used interchangeably with Artificial Intelligence / AI).

ML tasks are broadly separated into supervised or unsupervised learning. Supervised learning tasks typically involve feeding the algorithm a labeled training data set which is used to build a model that can then classify unknown items, making inferences based on what it knows. Unsupervised learning tasks involve feeding unlabeled data to an algorithm that can identify patterns and clustering in the grouping.


Topic Modeling

Topic modeling is an example of unsupervised machine learning

Terms to Know
  • Text Mining: A general term encompassing several different types of automated discovery using a body of texts. Topic modeling is a type of text mining.

  • Topic: (Discourse/theme): A group of words that have a high likelihood of clustering together.

  • Document: A discrete unit of text. This can be a blog post, a book chapter, a book, a journal article, a diary entry, etc. For topic modeling, you determine what a document is depending on the nature of the corpus or the kind of results you’re looking for.

    • Because topics arise from documents, it is wise to think carefully about how to segment your data: For example, if your text is 25,000 emails, do you treat each one as a document? All emails by a specific author as a single document? The choices you make at this stage will directly affect your outcomes and the way you interpret them.
  • Corpus: A collection of documents (“body of text”).

Basics of Topic Modeling: