What is Text Mining?

Mathematically rigorous inquiries into the relationship between words in a large corpus of text.

Text mining helps researchers detect patterns and connections in large volumes of textual material, allowing them to draw conclusions from a large body of text that they would not be able to otherwise read, synthesize, and incorporate into their scholarship.

Text Mining is often about counting words:

Text Mining involves pattern matching:

Identify similarities in a large corpus
- Overarching trends across a whole set of texts
- Association within sets
- Categorization of new items being added to a set
Identify differences
- Outliers and anomalies between texts
Combine the two
- Clusters of similar groups with indications of outlier groups

Part-of-speech tagging
Possible use: Extract and categorize entities such as person names, organizations, etc.
Example: Six Degrees of Francis Bacon

Simple approach to analytical partitioning
Elements of data visualization: size, color, distance of words can be used as elements of argumentation
Example: TagCrowd.com or WordItOut (try it using this text)

Be cognizant of your question as you gather data
- Data lies (if it’s collected in inappropriate ways). What’s in your set?
- Data from NY and LA between 1920-1940 and 2000-2010 can’t be used to make claims about patterns in the US between 1900 and 2010
Be aware of the limits of text mining
- You need computer readable files
Be aware of what you can munge
- If your transcription/OCR (Optical Character Recognition) always turns the word “like” into the word “lilce”… consistency in errors is still consistency