Skip to Main Content

Analyze Digital Text

A guide to introduce and support researchers interested in distant reading approaches to digital texts.

What is a Corpus?

A corpus is, simply put, a text under study or a set of texts to study (the plural is corpora). For linguists, a corpus is specifically a collection of written or spoken material upon which a linguistic analysis is based.

You may source your corpora from many different sources. These may include Google Books, Project Gutenberg, text digitized from newspapers collected on microfilm or other formats, Twitter, library-subscribed databases of secondary literature, et al.

Corpora for Text Analysis

Open Access Collections


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.