Library Guides: Data Curation, Preservation, and Reuse: Data Curation Tutorial

Curate Tutorial Introduction

This tutorial is designed to help researchers prepare a research project for archiving in accordance with FAIR principles, ensuring the projects value is preserved during and after project completion. Data curation may seem like a daunting task, but this tutorial aims to simplify the process by reducing it into six themed steps, CURATE:

Check Files
Understand the data and external constraints
Request (or locate) missing information
Augment metadata for findability
Transform file formats for reuse
Evaluate for FAIRness

Adapted from: Data Curation Network (2018). “Checklist of CURATED Steps Performed by the Data Curation Network."

This guide highlights the minimum steps to make a project FAIR, but will also highlight specific funder requirements as appropriate when those requirements exceed the minimum steps. This tutorial is not intended to be exhaustive, but instead serves as a practical groundwork for implementing FAIR data practices. Additional resources beyond the scope of this tutorial will be referenced as appropriate.

The CURATE tutorial works through a series of question prompts about your data. Take note of the results of these questions as you move through the steps of the tutorial, as you will respond to these results progress through the CURATE process.

Curate Tutorial

In the Check phase of the tutorial, we will check the thoroughness of your documentation, the quality of your metadata, and the reproducibility of any code. This step will highlight any inadequacies in your project, which will then be addressed in the following steps. Don't worry if you are not completely familiar with these concepts yet, as we move through the tutorial they will become clear.

Data: Review the content of the data files (e.g., open and run the files or code)

Do your files open as expected?

Does your code run as expected?

If it does not run as expected:
- Is it a warning and the code is otherwise able to execute as expected?
- Is it an error and the code is unable to execute?

Metadata: Review the metadata associated with your data.

Is metadata quality rich, accurate, and complete?

Metadata is information about the data collected. Metadata should include descriptive information about the context, quality, and condition, or characteristics of the data (MGI, 2020). Rich metadata will allow a computer to automatically accomplish routine and tedious sorting and prioritizing tasks as a result of the richness and consistency of metadata.

Rich metadata describes data with enough relevant attributes to make it easily findable.
Accurate metadata describes data with elements that correctly identify the contents of the data.
Complete metadata describes data with enough elements that it fully describes the contents of the data. Complete metadata should include more comprehensive metadata elements than just author, title, and date.

For more information on what constitutes rich, accurate, and complete metadata, and resources for developing appropriate metadata, see the section on Documenting Data.

3. Documentation: Review the documentation associated with your data.

How is the project documented?

- No documentation is present. 
- Some combination of codebook, data-dictionary, and README are present.

In the Understand phase of the tutorial we will:

Understand external requirements placed on the data by funders, institutions, and prospective repositories.

Understand the data itself, check for quality assurance and usability issues, and ensure the data is sufficiently documented such that a new user, with domain knowledge, but not knowledge of the specific project, could understand it.

External Requirements

Know what your funding body expects you to do with your data and for how long.

Review your funder requirements

Determine intellectual property rights as they apply to your data. Who owns the data? How will the data be licensed?
Identify any anticipated publication requirements (embargoes or restrictions on publishing).
Identify an appropriate repository according to funder, institutional, or journal requirements.

Determine the terms of use set forth by your data repository

Understand Data

Check for quality assurance and usability issues such as missing data, ambiguous headings, code execution failures, and data presentation concerns.
Try to detect and extract any “hidden documentation” inherent to the data files that may facilitate reuse.
Determine if the documentation of the data is sufficient for a user with similar qualifications to the data author’s to understand and reuse the data.

For guides to curating common data formats, view the excellent primers created by the Data Curation Network:

Warning: While some of these guides are in proprietary formats, that does not serve as a tacit endorsement of that format. If it is possible to completely convert to an open data format without data loss, please do so. If the conversion to an open format would result in some data loss, then store the open format versions alongside the original proprietary files, with documentation on how the conversion was completed.

Atlas.ti	ATLAS.ti is used for qualitative data analysis in multiple disciplines, especially in the humanities and social science disciplines.
Confocal Microsocopy (.lsm, .czi, .nd2, .lif, .oib, .zip, .tiff)	Confocal Microscopy images are used in a wide range of fields from biology, health, engineering, and chemistry, and may constitute a range of formats depending on the acquiring instrument.
Excel (.xlsx , .xlsm , .xlsb, .xls)	Microsoft Excel is a proprietary spreadsheet format commonly used by many research disciplines. If possible, convert your Excel to a plaintext format such as .csv or .tsv. If this is impossible then follow this guide.
GeoJSON (.geojson)	GeoJSON is a geospatial data interchange format for encoding vector geographical data structures, such as point, line, and polygon geometries, as well as their non-spatial attributes.
Geodatabase (.gdb)	The geodatabase is a container for geospatial datasets that can also provide relational functionality between the files.
Google Docs (.gdoc)	Google docs imitate the familiar formats of productivity suite files such as Microsoft office documents but are structured to be accessed and edited via a browser-based tool, and to be exported to a variety of file formats.
Jupyter Notebooks (.ipynb)	Jupyter Notebooks are composite digital objects used to develop, share, view, and execute interspersed, interlinked, and interactive documentation, equations, visualizations, and code.
MS Access (.mdb, .accdb)	Microsoft Access is a proprietary Microsoft database software and format.
NVivo (.nvp, .nvpx, .nvcx)	NVivo is a proprietary format of qualitative data analysis software (QDAS).
PDF (.pdf)	The Portable Document Format (PDF) created by Adobe Systems is currently the de facto standard for fixed-format electronic documents (Johnson, 2014).
R (.R, .Rmd)	The file with the extension R (.R) typically contains a script written in R, which is a programming language and environment for statistical computing and graphics. The .Rmd format also contains plain text Markdown for documenting R code and producing structured documents.
SPSS (.sav .por .sas .spv, .spo)	SPSS Statistics Data files saved in IBM SPSS Statistics format.
STL (.stl)	An STL file stores information about 3D models. It is commonly used for printing 3D objects. The STL format approximates 3D surfaces of a solid model with oriented triangles (facets) of different size and shape (aspect ratio) in order to achieve a representation suitable for viewing or reproduction using digital fabrication.
Tableau (.twbx, .twb)	Tableau Software is a proprietary suite of products for data exploration, analysis, and visualization with an initial concentration in business intelligence.
Wordpress.com (.xml)	WordPress.com is an online publishing platform, run by Automattic (https://automattic.com/), a company started by Matt Mullenweg - a founding developer of WordPress.org software. WordPress sites are exported and stored as .xml (Extensible Markup Language) files.
netCDF (.nc)	NetCDF is both software and a file format used by researchers in the geosciences to store and analyze data in multi-dimensional arrays.

In the Request phase of the tutorial, you will request any missing or unclear data or metadata from the responsible party for the project.

Reach out to all project participants and collaborators

Request each project participant and collaborator release any data or documentation.

Compare additional collected materials against missing data and metadata identified during the Check and Understand phases. Remove from the missing data and metadata any materials that have been identified and located during this step.

Request missing information or unclear information

Review missing data and metadata identified during the Check and Understand phase.

Request missing data and metadata from the data author.

In the Augment phase of the tutorial, you will augment your metadata to ensure it is findable. You will also identify how a persistent identifier (PID) will be assigned to your dataset, either through a digital object identifier (DOI) assigned by a repository, or manually through a service to assign DOIs such as the one made available by DataCite.

Review metadata shortcomings identified in your metadata during the Check step of the tutorial.

Enhance the number and correctness of metadata elements to ensure metadata is findable.

Metadata elements should be more comprehensive than Author, Date, and Title. Ensure sufficient metadata elements are utilized to describe the data. Ensure that consistent and descriptive element names are used. For more information review the Documenting Data section of the libguide.

Review the repository you identified in the Understand step of the tutorial. Does the repository identified assign a DOI?

 If a DOI is not assigned utilize a separate service such as DataCite to assign one.

In the Transform phase of the tutorial, you will transform your file formats and software to ensure your project is reproducible and reusable.

Convert proprietary data formats to an open alternative.

When possible convert proprietary formats to an open alternative.

For guidance identifying proprietary data formats, and converting to open formats with a high probability of long-term preservation, review the Transforming Data section of the libguide.

Does your code run? During the Check phase of the tutorial, you identified whether your code runs successfully without errors. Can it be run, given the current level of documentation, by a non-expert?

 Document your code such that it can be run by a non-expert on any environment.

Review the Transforming Software for Reproducibility section of the libguide for tools and advice on preparing a software based research product for archiving reproducibly.

In the Evaluate phase of the tutorial, we will compare your data against each of the FAIR principles. At this stage of the tutorial, if you have followed the previous steps from Check to Transform, then your data should be completely FAIR compliant.

Findable -

Metadata exceeds author/ title/ date
Unique PID (DOI, Handle, PURL, etc.).
Discoverable via web search engines.

Accessible -

Retrievable via a standard protocol (e.g., HTTP).
Free, open (e.g., download link).

Interoperable -

Metadata formatted in a standard schema (e.g., Dublin Core).
Metadata provided in machine-readable format (OAI feed).

Reusable -

Data include sufficient metadata about the data characteristics to reuse.
Contact info displayed if the direct assistance of the author needed.
Clear indicators of who created, owns, and stewards the data.
Data are released with clear data usage terms (e.g., a CC License).

Source: Data Curation Network (2018). “Checklist of CURATED Steps Performed by the Data Curation Network."