Research Guides: Introduction to Text Data Mining: Types of Text Mining

Choosing a Method

The text analysis method you choose will depend on your research question. When choosing a method to use, first consider what you expect to learn from your research and what form you would like your results to take. The methods described below can be combined in different ways during the course of a research project. For example, natural language processing algorithms might reveal the names of people in your text, to which you could apply network analysis to study how the actors are connected.

Word Frequency Analysis

Computing word frequencies is a basic building block of higher level textual analysis algorithms, although they can sometimes be revealing in themselves. This can include raw word counts, or calculating the percentage of words in a text or set of texts and comparing that across texts or time. Frequencies can also be counted for "n-grams," or phrases with a certain number (n) of words.

Example Project Using Word Frequencies

Clement, T.E. (2008). ‘A Thing Not Beginning and Not Ending’: Using Digital Tools to Distant-Read Gertrude Stein’s The Making of Americans. Literary and Linguistic Computing, vol. 23(3), 361-81. http://doi.org/10.1093/llc/fqn020.

Machine Learning

Text analysis often relies on machine learning, a branch of computer science that trains computers to recognize patterns. There are two kinds of machine learning used in text analysis: supervised learning, where a human helps to train the pattern-detecting model, and unsupervised learning, where the computer finds patterns in text with little human intervention. An example of supervised learning is Naive Bayes Classification. See Natural Language Processing and Topic Modeling for examples of unsupervised machine learning.

Example Project Using Classification (Supervised Machine Learning):

Horton, R., Morrissey, R., Olsen, M., Roe, G., Voyer, R. (2009). Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie. Digital Humanities Quarterly, vol. (3)2. Retrieved from http://www.digitalhumanities.org/dhq/vol/3/2/000044/000044.html.

Topic Modeling

Topic modeling, a form of machine learning, is a way of identifying patterns and themes in a body of text. Topic modeling is done by statistical algorithms, such as Latent Dirichlet Allocation, which groups words into "topics" based on which words frequently co-occur in a text.

Example Project using Topic Modeling:

Mendenhall, R., Brown, N., Black, M., Van Moer, M., Lourentzou, I., Flynn, K., McKee, M., Zerai, A. (2016). Rescuing lost history: Using big data to recover black women's lived experiences. In Proceedings of XSEDE 2016: Diversity, Big Data, and Science at Scale (Vol. 17-21-July-2016). https://doi.org/10.1145/2949550.2949642.

Natural Language Processing

Natural language processing, a kind of machine learning, is the attempt to use computational methods to extract meaning from free text. Among other things, natural language processing algorithms can derive names of people and places, dates, sentiment, and parts of speech.

Example Project using Natural Language Processing:

Underwood, T., Bamman, D., & Lee, S. (2018). The Transformation of Gender in English-Language Fiction. Journal of Cultural Analytics. http://doi.org/10.22148/16.019.

Network and Citation Analysis

Network analysis is a method for finding connections between nodes representing people, concepts, sources, and more. These networks are usually visualized into graphs that show the interconnectedness of the nodes.

Citation analysis can be used to discover connections and relationships between various citations of documents and then visualized.

Example Project Using Network Analysis:

Kaufman, M. (2014-2015). Quantifying Kissinger. Retrieved from http://blog.quantifyingkissinger.com/.

Visualizing Text Data

Generating visualizations is a way to "see" your data. Text mining visualization can help researchers see relationships between certain concepts. An example of a visualization of data can be word clouds, graphs, maps, and other graphics that produce a visual depiction the data.

Introduction to Text Data Mining