Machine Learning (ML)
ML is the application of algorithms and statistical modeling to allow computers to “learn” from data to do a task (often overlapping or used interchangeably with Artificial Intelligence / AI).
ML tasks are broadly separated into supervised or unsupervised learning. Supervised learning tasks typically involve feeding the algorithm a labeled training data set which is used to build a model that can then classify unknown items, making inferences based on what it knows. Unsupervised learning tasks involve feeding unlabeled data to an algorithm that can identify patterns and clustering in the grouping.
The ability to learn from data is changing the approach to many computational tasks, such as OCR or NLP, putting the focus on curating training data sets rather than developing new software (see Andrej Karpathy, Software 2.0, 2017). However, computational techniques can also challenge our expertise in DH, stretching even stats experts ability to evaluate the validity of complex models (for fun, spurious correlations).
Natural Language Processing (NLP)
NLP is a family of techniques to analyze unstructured language data found in everyday speech and writing.
Historically, it wasn’t based in ML, but relied on manually identifying rules and patterns in human speech that could be parsed by code. For example, take a minute to play with ELIZA (1966), an electronic psychologist based in early NLP pattern matching.
The web has provided an explosion of unstructured text, making NLP a huge business as enterprise seeks to extract information from social media or create chat-bots to minimize labor costs. Typical tasks involve chunking/stemming, part-of-speech tagging, named entity recognition (NER), classification, and sentiment analysis. Speech recognition, OCR , and text-to-speech are also considered NLP tasks.
Demos:
- IBM Watson Natural Language Understanding (API trained on web content)
- VPOD sentiment analysis (poetry sent to IBM NLU API for sentiment analysis)
- Text-processing NLTK demos (free API, based on Python NLTK)
- Book Visualizations Sandbox (Text and sentiment analysis with Hathi books. Shared on the Observable platform, a web-based code notebook for javascript)
Topic Modeling
A few basics of topic modeling
- Text mining that allows the user to identify patterns in a corpus of texts
- Input: texts
- Output: several lists of words that appear in the texts
- Groups words across the corpus into clusters of words, or “topics” based on those words’ similarity and dissimilarity
- Sometimes topics are easy to identify (for example: “navy, ship, captain”). Other times they’re more ambiguous.
- Usually works best on large bodies of text
- Some familiarity with your text is important
- See “When you have a MALLET, everything looks like a nail” for an example of what can happen when you’re not familiar with your data or the tools you’re using
Topic modeling is an example of unsupervised machine learning
- You have input data but don’t know the output variables
- Used as a process to find meaningful structure and groupings in your data
Latent Dirichlet Allocation (LDA)
- A type of topic modeling
- Matt Jockers’s Topic Modeling “Fable” (LDA Buffet)
What to do with your output?
Topic Modeling Activity
- Topic modeling in-browser with jsLDA
- Explore a sample corpus of Jane Austen texts
- Click on the link above, unzip the downloaded file. Inside you’ll find a text file of Austen’s corpus, along with a stopword list. Upload both of these files to jsLDA.
- You can also try it with your own corpus. Just make sure your text file is formated correctly:
- One document per line, with each document consisting of
[doc ID] [tab] [label] [tab] [text...]
- If you need to get rid of line breaks in your text, try the following regex:
- Find
([\n\r])+
- Replace a space
- Find
- One document per line, with each document consisting of
- Explore a sample corpus of Jane Austen texts
Don’t be afraid to fail or get bad results. Topic modeling is exploratory, and sometimes you have to play around with it before you know what settings work best for your project.
Exploring big data
Google Books Ngram Viewer
Fun with Text Generators
Unsupervised deep learning neural network models? Can you collaborate with a machine algorithm?
- Sunspring (Oscar Sharp, Ross Goodwin, Thomas Middleditch, 2016)
- Janelle Shane, Darth Vader’s recipe (aiweirdness blog)
- Train a GPT-2 Text-Generating Model w/ GPU For Free, google colab
- GPT-2 code, recent press 1 2, 3, etc…
- Talk to Transformer (ask it a question, like
Q: What is Digital Humanities?
)