What is Text Mining?

Mathematically rigorous inquiries into the relationship between words in a large corpus of text.

too many books

Text Mining is often about counting words:


Text Mining involves pattern matching:


A Few Types of Text Mining:

Google Books Ngram Viewer
Named Entity Recognition
  • Part-of-speech tagging
  • Possible use: Extract and categorize entities such as person names, organizations, etc.
  • Example: Six Degrees of Francis Bacon
Named Entity Recognition
Word Clouds
  • Simple approach to analytical partitioning
  • Elements of data visualization: size, color, distance of words can be used as elements of argumentation
  • Example: TagCrowd.com or WordItOut (try it using this text)
A word cloud made with Voyant Tools
Topic Modeling
  • Comparing large trends in corpora
  • Iterative algorithm identifies a set of topics related to a set of documents
  • Example: Mining the Dispatch

Limitations and Errors in Text Mining:

  1. Be cognizant of your question as you gather data
    • Data lies (if it’s collected in inappropriate ways). What’s in your set?
    • Data from NY and LA between 1920-1940 and 2000-2010 can’t be used to make claims about patterns in the US between 1900 and 2010
  2. Be aware of the limits of text mining
    • You need computer readable files
  3. Be aware of what you can munge
    • If your transcription/OCR (Optical Character Recognition) always turns the word “like” into the word “lilce”… consistency in errors is still consistency