Skip to Main Content

Data Curation, Preservation, and Reuse

This guide aims to help researchers through the data curation process at any stage in a projects lifecycle.

Transforming File Formats for Long-Term Preservation

Transforming Data for Preservation

To ensure the long-term preservation and usability of data it is important to ensure that data are 

  • completely documented using open documentation tools, preferably plain-text
  • platform-independent
  • non-proprietary (vendor-independent)
  • no "lossy" or proprietary compression
  • no embedded files, programs or scripts
  • no full or partial encryption
  • no password protection

Adapted from: http://guides.library.cornell.edu/ecommons/formats

Cornell University offers a guide detailing common file formats and their probability for long-term preservation. Ensure your data formats are in one the high probability categories to ensure long-term preservation. 

Transforming Software for Reproducibility

Reproducible Research Tools


A significant challenge with any research project that includes code, scripts, or software is ensuring that the code can be run by others not intimately familiar with the project. Rigorous documentation will aid in this process, but unforeseen challenges will arise with changes in the software dependencies after publication, as well as differences in the operating system and configuration of other users. Continuous maintenance of a published codebase is time-consuming but much of this maintenance can be avoided by a little foresight prior to the publication of a codebase in a public repository. Listed below are a few tools to ensure software is sufficiently structured, documented, and reproducible prior to public release.

 

Cookiecutter is a command-line utility that creates new boilerplate projects from cookiecutters (project templates). A project template comprises a directory skeleton with boilerplate code, plaintext documentation, and supporting files populated by a user-created template. Cookiecutters are language and domain agnostic; they can contain templates for any plaintext files, including, but not limited to, markdown READMEs, code scripts in any language, Makefiles for building a project, and requirements files for managing project dependencies. Cookiecutters are best used at the beginning of a research project to encourage consistent documentation and meaningful project structure.

Three common problems with reproducing the results of publications generated using included scripts or software are:

  1. Dependency "purgatory", an inability to reproduce the specific environment used in the publication due to a seemingly endless series of dependency requirements.
  2. Inadequate or imprecise documentation makes it difficult to impossible to reproduce the steps used in the original publication to install and run the software.
  3. Code rot, where updates to software packages used in the publication after publication may alter the software's functionality and ability to compile and run, and ultimately may change the results themselves.

Docker provides a way to escape these problems by providing an encapsulated software environment where all the software requirements are made explicit using a simple text file. In this way, it is possible to ensure your code will run exactly the same on any system, independent of operating system or configuration. 

For more information on using Docker for reproducible research please refer to the excellent work by Carl Boettiger An introduction to Docker for reproducible research, with examples from the R environment as well as the hands-on tutorial Author Carpentry : Docker for reproducible research.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.