Machine Learning (ML)

ML is the application of algorithms and statistical modeling to allow computers to “learn” from data to do a task (often overlapping or used interchangeably with Artificial Intelligence / AI).

ML tasks are broadly separated into supervised or unsupervised learning. Supervised learning tasks typically involve feeding the algorithm a labeled training data set which is used to build a model that can then classify unknown items, making inferences based on what it knows. Unsupervised learning tasks involve feeding unlabeled data to an algorithm that can identify patterns and clustering in the grouping.

The ability to learn from data is changing the approach to many computational tasks, such as OCR or NLP, putting the focus on curating training data sets rather than developing new software (see Andrej Karpathy, Software 2.0, 2017). However, computational techniques can also challenge our expertise in DH, stretching even stats experts ability to evaluate the validity of complex models (for fun, spurious correlations).

Natural Language Processing (NLP)

NLP is a family of techniques to analyze unstructured language data found in everyday speech and writing.

Historically, it wasn’t based in ML, but relied on manually identifying rules and patterns in human speech that could be parsed by code. For example, take a minute to play with ELIZA (1966), an electronic psychologist based in early NLP pattern matching.

The web has provided an explosion of unstructured text, making NLP a huge business as enterprise seeks to extract information from social media or create chat-bots to minimize labor costs. Typical tasks involve chunking/stemming, part-of-speech tagging, named entity recognition (NER), classification, and sentiment analysis. Speech recognition, OCR , and text-to-speech are also considered NLP tasks.

Demos:


Topic Modeling

A few basics of topic modeling
  • Text mining that allows the user to identify patterns in a corpus of texts
    • Input: texts
    • Output: several lists of words that appear in the texts
  • Groups words across the corpus into clusters of words, or “topics” based on those words’ similarity and dissimilarity
  • Sometimes topics are easy to identify (for example: “navy, ship, captain”). Other times they’re more ambiguous.
  • Usually works best on large bodies of text
  • Some familiarity with your text is important

Topic modeling is an example of unsupervised machine learning

Latent Dirichlet Allocation (LDA)

What to do with your output?

Don’t be afraid to fail or get bad results. Topic modeling is exploratory, and sometimes you have to play around with it before you know what settings work best for your project.


Exploring big data

Google Books Ngram Viewer


Fun with Text Generators

Unsupervised deep learning neural network models? Can you collaborate with a machine algorithm?