Why Document Data?
Documenting your data will help you keep track of what you’ve done with your data throughout a research project.
Documentation provides context, methods, tools, and requirements for collaborators, and when you share and publish your data.
This documentation (or metadata) may take various forms:
Source: http://www.lib.uiowa.edu/data/manage/documenting/
What is metadata?
Metadata is sometimes described as data about data, but is better understood as "structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource" (National Information Standards Organization, 2004).
The Digital Curation Network maintains a thorough list of metadata standards, across domains. A non-exhaustive list of common metadata standards by domain is also included below:
General Purpose:
Sciences:
Social and Behavioral Sciences:
Arts and Humanities:
Adapted from: https://pitt.libguides.com/metadatadiscovery/metadata-standards
What is a README?
A README file provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. Standards-based metadata is generally preferable, but where no appropriate standard exists, for internal use, writing “readme” style metadata is an appropriate strategy.
Source: https://data.research.cornell.edu/content/readme
Useful Resources
For best practices on creating a README see the excellent Guide to writing "README" style metadata developed by Cornell University. The README Template published by Cornell University is a great starting place for creating your own project README.
Documenting a Project Directory Structure in a README
In Unix-based and Windows systems the tree shell command can be used to generate a plaintext visual depiction of the directory structure of a project, which can then be added to a project README to make the project's structure apparent. For a better intuition of what tree can do see the results of the tree command in an example project:
tree -L 4
The tree command used above uses the argument -L 4 to only list directories to a depth of four. Another useful parameter -d will list only directories and not individual files within those directories. For all of the optional arguments to the tree command use the command tree --help.
To append the output of tree to a README file, use a pipe append >> to send the results of the tree command to a README text file. See the command below as an example:
tree -d -L 4 >> README.md
What is a data dictionary?
Data dictionaries provide critical information about data by describing the names, definitions, and attributes of the data elements in the data file. In some cases, data dictionaries and codebooks may provide overlapping information.
A data dictionary is a file that describes each element of your dataset. If your dataset includes tabular data, R code, and images, the data dictionary would include a list of the fields in the table and what they mean, including units and precision; a brief overview of the purpose of the code (if not already contained in comments); and information about the images and how they relate to the dataset (more detailed metadata for the images should be embedded).
Codebooks are generally used by survey researchers to provide information about the data from a survey instrument and include a data dictionary as well as information about the survey instrument or questionnaire used to solicit responses from the respondent. Dictionaries are deposited with the data when data is shared or published. In some cases, the repository might assist with creating the dictionary and codebook.
Data dictionaries can serve several purposes, including:
Adapted from: https://www.lib.uiowa.edu/data/manage/documenting/readme/
How do I create a data dictionary?
For an excellent tutorial on creating a data dictionary follow the guide developed by Smithsonian Libraries’ Describing Your Data: Data Dictionaries.