Text Cleaning
- RegExr, an online tool to learn, build, and test Regular Expressions
- Regex lessons
- Regex cheat sheet
- Ted Underwood, DataMunging, example of Python scripts used to clean and normalize OCR
Text Analysis
- Antconc: an open-source tool for word concordance and text analysis. This software is an easy download with robust documentation and more reliable performance than Voyant.
- HathiTrust Research Center
NLP
- Stanford NLP Group
- Python NLTK, learn via the NLTK Book (Steven Bird, Ewan Klein, and Edward Loperm, designed to teach text analysis)
- R, learn via Text analysis with R for students of literature, Matthew L Jockers (New York : Springer-Verlag, 2014).
- Open Calais (API trained on web and newspaper text)
- Watson Natural Language Understanding (API trained on web content)
- Google Cloud Natural Language
- Text-processing (free API, based on Python NLTK)
Topic Modeling
- Megan R. Brett, Topic Modeling: A Basic Introduction
- Ted Underwood, Topic Modeling Made Just Simple Enough
- Scott Weingart, Topic Modeling for Humanists: A Guided Tour
- Mallet (Developed by David Mimno)
- Overview Docs, online tool designed for journalists to sort through huge data sets