Text analysis (sometimes referred to as text mining or text data mining) is a set of methodologies of using computers to facilitate discovery of new, high-quality information in a text (corpus). In contrast to a close reading of a text, distant reading or topic modeling can provide insights not readily apparent by assessing word frequency and usage within and across texts.
Frequent Types of Analyses
- Word frequency (words that appear in a text, sorted by frequency/uniqueness)
- Collocation (words that commonly appear near another word)
- Concordance (contexts of a given word or set of words in a corpus)
- N-grams (common two-, three-, etc.- word phrases)
- Entity recognition (identifying names, places, time periods, etc.)
- Dictionary tagging (locating a specific set of words in a corpus)
- Topic model: a statistical model for finding abstract topics in a corpus