Skip to Main Content UMKC University Libraries

Introduction to Text Data Mining

This is a beginner's guide to the principles and concepts of text data mining (TDM). TDM is the computational and statistical analysis of large corpora of texts. In this guide you'll find brief descriptions of different types of text mining, some low bar

Choosing a Method

The text analysis method you choose will depend on your research question. When choosing a method to use, first consider what you expect to learn from your research and what form you would like your results to take. The methods described below can be combined in different ways during the course of a research project. For example, natural language processing algorithms might reveal the names of people in your text, to which you could apply network analysis to study how the actors are connected. 

Word Frequency Analysis

Computing word frequencies is a basic building block of higher level textual analysis algorithms, although they can sometimes be revealing in themselves. This can include raw word counts, or calculating the percentage of words in a text or set of texts and comparing that across texts or time. Frequencies can also be counted for "n-grams," or phrases with a certain number (n) of words.

Example Project Using Word Frequencies

Machine Learning

Text analysis often relies on machine learning, a branch of computer science that trains computers to recognize patterns. There are two kinds of machine learning used in text analysis: supervised learning, where a human helps to train the pattern-detecting model, and unsupervised learning, where the computer finds patterns in text with little human intervention. An example of supervised learning is Naive Bayes Classification. See Natural Language Processing and Topic Modeling for examples of unsupervised machine learning.

Example Project Using Classification (Supervised Machine Learning):

Topic Modeling

Topic modeling, a form of machine learning, is a way of identifying patterns and themes in a body of text.  Topic modeling is done by statistical algorithms, such as Latent Dirichlet Allocation, which groups words into "topics" based on which words frequently co-occur in a text.

Example Project using Topic Modeling:

Natural Language Processing

Natural language processing, a kind of machine learning, is the attempt to use computational methods to extract meaning from free text. Among other things, natural language processing algorithms can derive names of people and places, dates, sentiment, and parts of speech. 

Example Project using Natural Language Processing:

Network and Citation Analysis

Network analysis is a method for finding connections between nodes representing people, concepts, sources, and more. These networks are usually visualized into graphs that show the interconnectedness of the nodes.

Citation analysis can be used to discover connections and relationships between various citations of documents and then visualized.

Example Project Using Network Analysis:

Visualizing Text Data

Generating visualizations is a way to "see" your data.  Text mining visualization can help researchers see relationships between certain concepts.  An example of a visualization of data can be word clouds, graphs, maps, and other graphics that produce a visual depiction the data.