This tutorial is designed to help researchers prepare a research project for archiving in accordance with FAIR principles, ensuring the projects value is preserved during and after project completion. Data curation may seem like a daunting task, but this tutorial aims to simplify the process by reducing it into six themed steps, CURATE:
Adapted from: Data Curation Network (2018). “Checklist of CURATED Steps Performed by the Data Curation Network."
This guide highlights the minimum steps to make a project FAIR, but will also highlight specific funder requirements as appropriate when those requirements exceed the minimum steps. This tutorial is not intended to be exhaustive, but instead serves as a practical groundwork for implementing FAIR data practices. Additional resources beyond the scope of this tutorial will be referenced as appropriate.
The CURATE tutorial works through a series of question prompts about your data. Take note of the results of these questions as you move through the steps of the tutorial, as you will respond to these results progress through the CURATE process.
In the Check phase of the tutorial, we will check the thoroughness of your documentation, the quality of your metadata, and the reproducibility of any code. This step will highlight any inadequacies in your project, which will then be addressed in the following steps. Don't worry if you are not completely familiar with these concepts yet, as we move through the tutorial they will become clear.
Do your files open as expected?
Does your code run as expected?
If it does not run as expected: - Is it a warning and the code is otherwise able to execute as expected? - Is it an error and the code is unable to execute?
Is metadata quality rich, accurate, and complete?
Metadata is information about the data collected. Metadata should include descriptive information about the context, quality, and condition, or characteristics of the data (MGI, 2020). Rich metadata will allow a computer to automatically accomplish routine and tedious sorting and prioritizing tasks as a result of the richness and consistency of metadata.
For more information on what constitutes rich, accurate, and complete metadata, and resources for developing appropriate metadata, see the section on Documenting Data.
3. Documentation: Review the documentation associated with your data.
How is the project documented?
- No documentation is present.
- Some combination of codebook, data-dictionary, and README are present.
In the Understand phase of the tutorial we will:
Understand external requirements placed on the data by funders, institutions, and prospective repositories.
Understand the data itself, check for quality assurance and usability issues, and ensure the data is sufficiently documented such that a new user, with domain knowledge, but not knowledge of the specific project, could understand it.
External Requirements
Review your funder requirements
Determine the terms of use set forth by your data repository
Understand Data
For guides to curating common data formats, view the excellent primers created by the Data Curation Network:
Atlas.ti | ATLAS.ti is used for qualitative data analysis in multiple disciplines, especially in the humanities and social science disciplines. |
Confocal Microsocopy (.lsm, .czi, .nd2, .lif, .oib, .zip, .tiff) | Confocal Microscopy images are used in a wide range of fields from biology, health, engineering, and chemistry, and may constitute a range of formats depending on the acquiring instrument. |
Excel (.xlsx, .xlsm, .xlsb, .xls) | Microsoft Excel is a proprietary spreadsheet format commonly used by many research disciplines. If possible, convert your Excel to a plaintext format such as .csv or .tsv. If this is impossible then follow this guide. |
GeoJSON (.geojson) | GeoJSON is a geospatial data interchange format for encoding vector geographical data structures, such as point, line, and polygon geometries, as well as their non-spatial attributes. |
Geodatabase (.gdb) | The geodatabase is a container for geospatial datasets that can also provide relational functionality between the files. |
Google Docs (.gdoc) | Google docs imitate the familiar formats of productivity suite files such as Microsoft office documents but are structured to be accessed and edited via a browser-based tool, and to be exported to a variety of file formats. |
Jupyter Notebooks (.ipynb) | Jupyter Notebooks are composite digital objects used to develop, share, view, and execute interspersed, interlinked, and interactive documentation, equations, visualizations, and code. |
MS Access (.mdb, .accdb) | Microsoft Access is a proprietary Microsoft database software and format. |
NVivo (.nvp, .nvpx, .nvcx) | NVivo is a proprietary format of qualitative data analysis software (QDAS). |
PDF (.pdf) | The Portable Document Format (PDF) created by Adobe Systems is currently the de facto standard for fixed-format electronic documents (Johnson, 2014). |
R (.R, .Rmd) | The file with the extension R (.R) typically contains a script written in R, which is a programming language and environment for statistical computing and graphics. The .Rmd format also contains plain text Markdown for documenting R code and producing structured documents. |
SPSS (.sav .por .sas .spv, .spo) | SPSS Statistics Data files saved in IBM SPSS Statistics format. |
STL (.stl) | An STL file stores information about 3D models. It is commonly used for printing 3D objects. The STL format approximates 3D surfaces of a solid model with oriented triangles (facets) of different size and shape (aspect ratio) in order to achieve a representation suitable for viewing or reproduction using digital fabrication. |
Tableau (.twbx, .twb) | Tableau Software is a proprietary suite of products for data exploration, analysis, and visualization with an initial concentration in business intelligence. |
Wordpress.com (.xml) | WordPress.com is an online publishing platform, run by Automattic (https://automattic.com/), a company started by Matt Mullenweg - a founding developer of WordPress.org software. WordPress sites are exported and stored as .xml (Extensible Markup Language) files. |
netCDF (.nc) | NetCDF is both software and a file format used by researchers in the geosciences to store and analyze data in multi-dimensional arrays. |
In the Request phase of the tutorial, you will request any missing or unclear data or metadata from the responsible party for the project.
Request each project participant and collaborator release any data or documentation.
Compare additional collected materials against missing data and metadata identified during the Check and Understand phases. Remove from the missing data and metadata any materials that have been identified and located during this step.
Review missing data and metadata identified during the Check and Understand phase.
Request missing data and metadata from the data author.
In the Augment phase of the tutorial, you will augment your metadata to ensure it is findable. You will also identify how a persistent identifier (PID) will be assigned to your dataset, either through a digital object identifier (DOI) assigned by a repository, or manually through a service to assign DOIs such as the one made available by DataCite.
Enhance the number and correctness of metadata elements to ensure metadata is findable.
Metadata elements should be more comprehensive than Author, Date, and Title. Ensure sufficient metadata elements are utilized to describe the data. Ensure that consistent and descriptive element names are used. For more information review the Documenting Data section of the libguide.
If a DOI is not assigned utilize a separate service such as DataCite to assign one.
In the Transform phase of the tutorial, you will transform your file formats and software to ensure your project is reproducible and reusable.
When possible convert proprietary formats to an open alternative.
For guidance identifying proprietary data formats, and converting to open formats with a high probability of long-term preservation, review the Transforming Data section of the libguide.
Document your code such that it can be run by a non-expert on any environment.
Review the Transforming Software for Reproducibility section of the libguide for tools and advice on preparing a software based research product for archiving reproducibly.
In the Evaluate phase of the tutorial, we will compare your data against each of the FAIR principles. At this stage of the tutorial, if you have followed the previous steps from Check to Transform, then your data should be completely FAIR compliant.
Findable -
Accessible -
Interoperable -
Reusable -
Source: Data Curation Network (2018). “Checklist of CURATED Steps Performed by the Data Curation Network."