Library Guides: Data Curation, Preservation, and Reuse: Documenting Data

Overview

Why Document Data?

Documenting your data will help you keep track of what you’ve done with your data throughout a research project.

Documentation provides context, methods, tools, and requirements for collaborators, and when you share and publish your data.

This documentation (or metadata) may take various forms:

specific elements, defined by a metadata standard
readme.txt files and other unstructured metadata
data dictionaries and codebooks
lab notebooks

Source: http://www.lib.uiowa.edu/data/manage/documenting/

Metadata

What is metadata?

Metadata is sometimes described as data about data, but is better understood as "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource" (National Information Standards Organization, 2004).

The Digital Curation Network maintains a thorough list of metadata standards, across domains. A non-exhaustive list of common metadata standards by domain is also included below:

Metadata Element Sets by Domain

General Purpose:

DublinCore (DC) Metadata Element Set is a generic set of 15 properties for describing a wide range of resources.
Metadata Object Description Schema (MODS) is a descriptive standard used to describe a variety of types of resources; it is maintained by the Library of Congress.

Sciences:

Darwin Core (DwC) is used in the biological sciences to describe collections of biological objects or data and includes a glossary of terms intended to facilitate the sharing of information about biological diversity.
Open Microscopy Environment (OME), OME-XML is a file format for storing microscopy information (both pixels and metadata) using the OME Data Model.
Access to Biological Collection Data (ABCD) is used in the biological sciences to describe specimens and scientific observations.
Astronomy Visualization Metadata Standard (AVM) is used to describe data-derived astronomical images.
Ecological Metadata Language (EML) is used to formalize and standardize concepts necessary when describing ecological data.
FDGC Federal Geographic Data Committee Standard is a standard for documenting digital geospatial data; it is especially relevant to researchers in the field of Geographic Information Systems.

Social and Behavioral Sciences:

Data Documentation Initiative (DDI) is a standard for describing observational and survey data in the social, behavioral, economic, and health sciences; also useful for structuring research data documentation.
OLAC is a standard used by the Open Language Archives Community for describing language resources in linguistics research.

Arts and Humanities:

Text Encoding Initiative Guidelines (TEI) is a standard for the representation of texts in digital form, and has been used by researchers in the humanities, social sciences, and linguistics since 1994.
VRA Core is a standard created by the Visual Resources Association for describing cultural objects, such as images and works of art.
PBCore, also known as the Public Broadcasting Metadata Dictionary, is a standard designed for the description of audiovisual resources in digital and analog formats.

Adapted from: https://pitt.libguides.com/metadatadiscovery/metadata-standards

README Files

What is a README?

A README file provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. Standards-based metadata is generally preferable, but where no appropriate standard exists, for internal use, writing “readme” style metadata is an appropriate strategy.

Source: https://data.research.cornell.edu/content/readme

Useful Resources

For best practices on creating a README see the excellent Guide to writing "README" style metadata developed by Cornell University. The README Template published by Cornell University is a great starting place for creating your own project README.

Documenting a Project Directory Structure in a README

In Unix-based and Windows systems the tree shell command can be used to generate a plaintext visual depiction of the directory structure of a project, which can then be added to a project README to make the project's structure apparent. For a better intuition of what tree can do see the results of the tree command in an example project:

tree -L 4

The tree command used above uses the argument -L 4 to only list directories to a depth of four. Another useful parameter -d will list only directories and not individual files within those directories. For all of the optional arguments to the tree command use the command tree --help.

To append the output of tree to a README file, use a pipe append >> to send the results of the tree command to a README text file. See the command below as an example:

tree -d -L 4 >> README.md

Data Dictionaries and Codebooks

What is a data dictionary?

Data dictionaries provide critical information about data by describing the names, definitions, and attributes of the data elements in the data file. In some cases, data dictionaries and codebooks may provide overlapping information.

A data dictionary is a file that describes each element of your dataset. If your dataset includes tabular data, R code, and images, the data dictionary would include a list of the fields in the table and what they mean, including units and precision; a brief overview of the purpose of the code (if not already contained in comments); and information about the images and how they relate to the dataset (more detailed metadata for the images should be embedded).

Codebooks are generally used by survey researchers to provide information about the data from a survey instrument and include a data dictionary as well as information about the survey instrument or questionnaire used to solicit responses from the respondent. Dictionaries are deposited with the data when data is shared or published. In some cases, the repository might assist with creating the dictionary and codebook.

Data dictionaries can serve several purposes, including:

Improving efficiency and reduce the risk of mistakes and data loss by keeping things consistent across a project. The dictionary can define data names, labels, units, constraints such as an acceptable range of values, and other characteristics.
Enabling software to process a data file, by providing details to the software about the file. This information might include the type of data in each column (integer, character, date, etc); the name of the column; the physical units, if relevant; whether nulls are included; etc.
Increasing interoperability and reuse of the data that you want to share and publish.
Providing “human-readable” details to support discovery, interpretation, and analysis.
For more details on what might be in a data dictionary, how to make one, and examples, see:

Adapted from: https://www.lib.uiowa.edu/data/manage/documenting/readme/

How do I create a data dictionary?

For an excellent tutorial on creating a data dictionary follow the guide developed by Smithsonian Libraries’ Describing Your Data: Data Dictionaries.

Data Curation, Preservation, and Reuse

Overview

Metadata

Metadata Element Sets by Domain​

README Files

Data Dictionaries and Codebooks

Metadata Element Sets by Domain