Skip to Main Content

Analyze Digital Text as Data

A guide to introduce and support researchers interested in distant reading approaches to digital texts.

What is a Corpus?

A corpus is, simply put, a text under study or a set of texts to study (the plural is corpora). For linguists, a corpus is specifically a collection of written or spoken material upon which a linguistic analysis is based.

You may source your corpora from many different sources. These may include Google Books, Project Gutenberg, text digitized from newspapers collected on microfilm or other formats, Twitter, library-subscribed databases of secondary literature, et al.

Corpora for Text Analysis

Open Access Collections

_____________________________________________________

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.