FAIR Data Principles
Image Attribution: SangyaPundir, CC BY-SA 4.0.
The FAIR Data Principles were published after being defined at the Lorentz Workshop in 2014. These principles serve as guidelines to both those who publish data, e.g. researchers, as well as those who preserve data, e.g. repository or archive service providers. This is an important distinction, highlighting that a researcher’s data can only be as FAIR as the infrastructure and services available to them.
Keeping in mind the long-term goal of making scientific data reusable to others by creating systematic order in the ever-growing mound of data arising from research, the FAIR Data Principles place a strong focus on machine-readable (meta)data. While humans must be able to find and understand the published datasets as well, computers will be tasked with sifting through large amounts of data and determining which datasets are relevant to the intended purpose and how they can be reused.
Reusing well-curated research data, makes researchers’ tasks easier, enabling them to build upon previous research. Furthermore, data reuse has the potential for large stores of data to be mined for machine-learning applications. For example, large amounts of chemical synthesis data could lead to the virtual development of new synthesis methods or the discovery of new compounds.
In chemistry, the deposition of crystallographic data in a standardized file format (CIF) into a repository such as the Cambridge Structural Database (CSD) with the Cambridge Crystallographic Data Centre (CCDC) or the Inorganic Crystal Structure Database (ICSD) with FIZ Karlsruhe, required by many leading journals in the chemical community, is the primary example of the FAIR Data Principles in action. To broaden the scope of such standards, NFDI4Chem is working towards creating an infrastructure for FAIR data in Germany while training researchers in using the available infrastructure to ensure their data is as FAIR as can be.
In the following, we answer the questions: What makes data FAIR? What do researchers and those who provide data preservation services need to consider?
Researchers — and the computers working on their behalf — must be able to find datasets to be able to reuse them. Therefore, the first guideline of the FAIR Data Principles outlines methods to ensure a dataset’s discovery.
F1. (meta)data are assigned a globally unique and persistent identifier
A globally unique and persistent identifier (PID) helps both machines and humans find the data in the first place. These PIDs are essential for research as they guarantee the availability of the associated resource, in this case a dataset. The registry services that make these identifiers available work to maintain the link to the resource, thus avoiding dead links. This ensures the resource remains findable and may be referenced simply by the use of its PID.
A common example of a citable PID is the Digital Object Identifier, or DOI. As with many journals, scientific data repositories often assign a DOI automatically. The Registry of Research Data Repositories, re3data, indicates whether a given repository assigns an identifier, along with the PID type. For example, both the The Cambridge Structural Database (CSD) and the Chemotion Repository assign DOIs to each dataset deposited. Researchers must be aware of this option when searching for a suitable repository, while repositories should offer this service.
F2. data are described with rich metadata (defined by R1 below)
Data need to be sufficiently described in order to make them both findable and reusable. Hence, the specific focus here lies on making the (meta)data findable by using rich discovery metadata in a standardized format and allowing computers and humans to quickly understand the dataset’s contents. This is an essential component in the plurality of metadata described by R1 below. This information may include, but is not limited to:
- the context on what the dataset is, how it was generated, and how it can be interpreted,
- the data quality,
- licensing and (re)use agreements,
- what other data may be related (linked via its PID), and
- associated journal publications and their DOI.
Repositories should provide researchers with a fillable application profile that allows researchers to give extensive and precise information on their deposited datasets. For example, the Chemotion Repository uses, among others, the Datacite Metadata Schema to build its application profile, a schema specifically created for the publication and citation of research data. RADAR, including the variant RADAR4Chem, has also built its metadata schema on Datacite. These include an assortment of mandatory, recommended, and optional metadata properties, allowing for a rich description of the deposited dataset. For those publishing data, always keep in mind: the more information provided, the better.
F3. metadata clearly and explicitly include the identifier of the data it describes
While F1 stipulates the assignment of an identifier, F3 underlines the importance of including this identifier in the metadata itself. The metadata and the dataset it describes are typically separate files. Including the identifier in the metadata directly links the information to the associated dataset.
Furthermore, the dataset may not be published alongside the metadata. For example, in the case of unpublished archived datasets, the PID can lead to a method (e.g. a landing page) to contact those responsible for the data instead of to the dataset itself. Researchers must be aware of this importance, while repositories must not only assign a PID as described in F2 above, but should also ensure that this PID is a required property of the metadata.
F4. (meta)data are registered or indexed in a searchable resource
Metadata are used to set up indices, enabling machines to efficiently search for and find datasets. For this process to work successfully, metadata must be complete as outlined above. Repositories should ensure the metadata entered for a deposited dataset is available in a machine-readable format to facilitate the assignment of indices.
Accessible means that humans and machines receive instructions on how to obtain the data. It should be noted that FAIR does not equate to open, as further explained in A1.2.
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
To guarantee access to datasets, persistent identifiers, such as DOIs, are suggested, which are resolved by standard methods. Common protocols include http(s) or (s)ftp.
A1.1 the protocol is open, free, and universally implementable
Repositories should only use protocols that allow any computer to access at least the metadata. Not only does this refer to the use of standard communication protocols, as stated in A1, these protocols must also be freely available and open-sourced. Therefore, proprietary or non-standard protocols should be avoided.
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
Where necessary, machine-readable protocols that let the user know that action needs to be taken (such as a login) to access data must be in place. FAIR data and open data are not synonymous: FAIR data requires that it must be clearly stated how the data can be accessed, as opposed to granting anyone and everyone full access. In manuscripts of scientific articles, this information should be included in a data availability statement. This can be especially important for sensitive data, where, for example, personal data and/or medical information may be disclosed. Hence, repositories should also provide a way for users (and their computers) to identify themselves, enabling access permission to be granted.
A2. metadata are accessible, even when the data are no longer available
The metadata that describes a dataset should be stored in a separate file so it is available, even if the datasets themselves can no longer be accessed. Problems with dataset availability are usually due to 1) the cost of maintaining and storing full datasets and 2) file format deprecation as technologies evolve. Maintaining metadata files is cheaper and simpler and ensures that, at a minimum, details such as contact information remains available. These files should thus be archived forever.
A repository should clearly state a contingency plan for metadata storage should the service no longer exist, such as migrating to another repository service provider while ensuring the integrity of the persistent identifier is guaranteed.
Data need to be integrated with and/or compared to other datasets, while computers must be able to interpret and exchange the information. Ideally, they are compatible with standard applications and can thus be integrated into (automated) processing and analysis workflows. Interoperability often functions as a precursor to reusability, as it ensures the compatibility across systems.
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
Machines need to be able to understand how to exchange and interpret information. Similar to humans, a uniform and standard language aids in this understanding. In chemistry, a great typical example of such an information exchange standard is the crystallographic information (CIF). This standard also adheres to the aspects described in I2 and R1.3 below. Simply put, standard file formats for a given analytical method ensure the data and the associated metadata, which typically include measurement details, for example, follows a prescribed format. This ensures both humans and machines receive the information required to interpret the data.
Especially when looking at metadata, effective and efficient machine readability greatly depends on being able to reduce ambiguity. Metadata provides context to datasets. However, machines need to be able to interpret this context. Therefore, the structured schemas chosen by the repositories should include universally applied ontologies and controlled vocabularies to define relationships and avoid ambiguity. For example, chemistry-specific repositories should be designed to include ontologies such as the Chemical Methods Ontology (CHMO) or the Chemical Information Ontology (CHEMINF) to accurately describe the (meta)data provided. Such ontologies should be based on widely-applied data models, for instance, the Resource Description Framework (RDF).
I2. (meta)data use vocabularies that follow FAIR principles
The applied vocabularies or ontologies should be well-documented and resolvable using a PID. For instance, CHMO mentioned above uses a persistent URL (PURL), resolvable using a standard web browser through
http, while the documentation is publicly available on Github.
I3. (meta)data include qualified references to other (meta)data
Related datasets should be linked in a reliable manner, preferably via their PIDs. This includes any previous versions, datasets required to fully use and comprehend the current dataset, or datasets that the dataset builds upon. This relationship should also be described in a meaningful manner. For example, if dataset X is a previous version of dataset Y, it would be described as such rather than simply being described as a related or an associated dataset. Repositories should include a method of referring to other datasets in their metadata form.
Many of the previous points lead to one key aspect of data sharing: data reusability. Datasets must be described in a manner that allows the user to easily determine how and under which conditions the data can be reused.
R1. (meta)data are richly described with a plurality of accurate and relevant attributes
Related to F2 above, the focus here lies on whether the data, once found, is useable to the person or computer searching. It also stresses giving the data as many attributes as possible. Researchers should not assume the person—or that person’s computer—looking to re(use) their data is completely familiar with the discipline. Examples of information to assign here include (non-exhaustive list):
- What the dataset contains, including whether the data is raw and/or processed
- How the data was processed
- How the data can be reused
- Who created the data
- Date of creation
- Variable names
- Standard methods used
- Scope of the data and project
- Lab conditions
- Any limitations to the data
- Software and versions used for acquisition and processing.
An important piece of information for chemical data are machine-readable chemical structures. This should be included within the dataset and/or metadata and aids computers in finding the correct data in their queries.
Repositories should provide data publishers with the opportunity to include a plurality of information in their metadata. This includes giving a wide range of optional and free-fill fields for data publishers to complete.
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
In simple terms: metadata include any relevant history. If the dataset is related to other datasets or based on another researcher’s data, these should be linked via their PID as described in I3. This includes citing or acknowledging others for their work, which also takes their licensing or use agreements into consideration (see R1.1). Furthermore, metadata should contain machine-readable information on how the data was generated or processed.
R1.3. (meta)data meet domain-relevant community standards
As research data management and, as such, data publication becomes more and more prevalent across research areas, best practices in the individual communities will arise. This should encompass metadata templates for proper documentation of datasets, how the data should be organized, which vocabularies or ontologies to use, and file formats. NFDI4Chem is working to establish metadata and data standards for the various communities in chemistry.
Where available, community standards and best practices should be followed when those publishing prepare their datasets and relevant metadata for publication. Repositories, especially domain-specific service providers, should adhere to the standards set forth by the community by requiring files and metadata to follow format specifications. As noted in I1 above, the CIF format represents a community-specific standard associated with the chemical community. Furthermore, NMReDATA represents a possible standard for publishing and archiving (meta)data of Nuclear Magnetic Resonance (NMR) experiments.
Where required, format converters should be linked in the dataset’s metadata.