Skip to main content

How to choose the right repository

Introduction

The consortium for chemistry (NFDI4Chem) in the NFDI aims to support the community members (data producers and users) for collecting, storing, processing, analysing, disclosing and re-using research data by establishing a federation of repositories. In the consortium´s website you can find as an entry point an overview of the selected repositories, the central search service and common standards. Aiming to identify the core repositories that are part of the federation and have potential for development, the TA3-team selected in the proposal the following criteria:

  • The repository is suitable for the deposition of molecule related data
  • The repository contains reusable data or functionality (such as viewers, editors or analysis tools) that fulfil the needs of the NFDI4Chem community
  • The repository software is open source
  • The operators of the repositories have declared their willingness to adapt their services to the standards developed by NFDI4Chem including the FAIR data principles
  • The repository operators can be funded in accordance with the funding guidelines of the NFDI (i.e., the main operator is based in Germany and is a non-profit organisation)

The repositories that fulfil the above mentioned criteria are reported in the following list as core repositories. In case that not all criteria are fulfilled they are included as associated or relevant repositories:

Core Repositories

Associated Repositories

Other relevant Repositories

Further repositories (currently in development):

  • nmrXiv (NMR data)
  • VibSpecDB (Raman and IR spectra)

Moreover, the consortium supports further databases and data repositories on interoperability issues and encourages them to participate in the development of NFDI4Chem standards and interfaces. Please note: you are free to pledge your dataset to the NFDI4Chem-survey form. Support will be provided via a consulting service or more intense data stewardship to assist in publishing your research data. A suitable repository will also be suggested.

Mapping Matrix Data-Repository

Thanks both to the results generated by the survey of the NFDI4Chem consortium (in 2020) and the interviews with the repository leaders performed by the TA3-team, a list of the most common data types and formats within this community was collected and is reported in Table 1. Table 1 displays which data types are most commonly collected within the chemistry community and suggests which repository is most suitable for storing your specific data type. The recommendations provided in Table 1 will guide you to efficiently and quickly select the best repository for the storage of your data. Selection of the most appropriate database can have a significant impact on findability and citability of the data as well as visibility of the scientist.

Table 1: Data-types and -formats in the NFDI4Chem community.

Data typeData formatSuggested RepositoryCriteria for selection
Nuclear Magnetic ResonanceBruker format (zip), jcamp-dxChemotionPassing basic checks, curation
Nuclear Magnetic ResonanceBruker format, JOEL format NMReData, nmrML,ISA JSONnmrXivValidations / Minimum information reporting standards
Molecules and their properties, identification, reactions and experimental investigationsmass spectrometry: jcamp-dx, MzMl, MzXML (open, visualisable and processable), RAW for selected mass data types (processed and converted in JCAMP-dx), IR and Raman: jcamp-dx, XRD: jcamp-dx, UV-VIS: jcamp-dx, Cyclic voltammetry: jcamd-dx. *Chemotion repo offers the option to convert data from different file formats into jcamp-dx.ChemotionPassing basic checks, curation
Inorganic crystal structuresCrystallographic Information File (CIF)ICSDCrystal structure data available
Organic and metal-organic crystal structuresCrystallographic Information File (CIF) but other supporting file formats acceptedCSDCcell parameters (single crystal), full coordinates (powder), in CIF format
Organic, inorganic and metal-organic crystal structure dataprimarily Crystallographic Information File (CIF) but other supporting file formats acceptedjoint CCDC/FIZ Access Structures ServiceDAt least one CIF file must be included in the submission and structure factor data for all structures should be provided (if possible)
Simulation50 supported codesNOMADSimulation data Recognition during upload
Generic data from all disciplines of chemistry, all data that do not fit in the disciplinary repositoriesformat-independentRADAR4ChemValidation against MD schema
Enzyme kinetics datacurrently noneSTRENDA DBNone; STRENDA compliant, peer-reviewed data publication
Intermolecular and supramolecular interactions of molecular systemsJSON (DataCite), CDX* (for 2D/3D molecule structure), PNG, proprietary formatsSuprabankNon-judgmental plausibility

Summary of the repositories

A summary with the most relevant information for each repository is reported in the following section:

Chemotion - Repository for molecules and research data

Quick stats:

  • Accepted data types: mass spectrometry: jcamp-dx, MzMl, MzXML (open, visualisable and processable), RAW for selected mass data types (processed and converted in JCAMP-dx), NMR: Bruker format (zip) and jcamp-dx, IR and Raman: jcamp-dx, XRD: jcamp-dx, UV-VIS: jcamp-dx, Cyclic voltammetry: jcamd-dx. Chemotion repo offers the option to convert data from different file formats into jcamp-dx.
  • Used standards/ontologies: DataCite Metadata Schema, InChI, SMILES, molfile V2000 and V3000 CHMO Ontology, RXNO Ontology
  • Access rights/licence information/embargo: CC0, CCBY, CCBY-SA; public domain embargo possible (unlimited)
  • Recommended by Journals/Societies: in progress
Click to expand for more details about Chemotion!

Chemotion - Repository for molecules and research data

The Chemotion Repository, covers research data that is assigned to molecules, their properties and characterization as well as reactions and experimental investigations. It is hosted at Karlsruhe Institute for Technology (KIT) and is used by several groups in Germany and beyond. Scientists in the domains of molecular and synthetic chemistry are supported in their efforts to handle data in a FAIR manner: the data is stored according to the common practices of scientists assigned to molecules and reactions and the system provides the required Digital Object Identifiers (DOIs) without additional effort for the scientist. The given metadata is supported by the implementation of ontology terms. The findability of the data is achieved by text, structure and identifier search options and the submitted samples and their structure are referenced in PubChem for gaining higher visibility of the work for the scientists. The repository is interoperable with the Chemotion ELN which means that data can be transferred from ELN to the repository. Data is curated by automatic checks and a peer reviewing process. The integration of data stored in the repository in publications was shown with several examples and its usage is currently recommended by Chemistry methods. Authors can be referenced by Contributor iD (ORCID iD). Chemists and materials scientists can publish data for open access (data view) and registered access (dataset contribution and download). Stored data can be searched by chemical structure, author, dataset type, status, identifier, and DOI. The current AAI solution is based on an internal user administration (administrator, anonymous and registered user, curator). Metadata according to DataCite is compliant with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) scheme. The Chemotion Repository offers an internal substance register, a spectra viewer (ChemSpectra), a structure editor (Ketcher) and their own data converter.

Please note: Chemotion Repository accepts all data types but only few of them can be processed and edited in viewers.

MassBank EU - High Quality Mass Spectral Database

Quick stats:

Click to expand for more details about MassBank EU!

MassBank EU - High Quality Mass Spectral Database

MassBank EU is the first public repository of mass spectral data for sharing them among the scientific research community. Their target user groups in the domains of chemistry and life sciences are analytical chemists, metabolomics, biochemists, and bioinformaticians. Their data sets from community users and projects are openly accessible and represent the official database of the Mass Spectroscopy Society of Japan. GitHub is used as their current AAI environment (open read access, limited write access) whereas GitHub issues as their curation tracking system. The curation itself is performed by the MassBank record validator. The data sets can be searched for compound and mass spectrometry information and peaks. MassBank Record ID (Accession) and USI (Universal Spectrum Identifier) are used as persistent identifier systems. The MassBank EU spectral data is hosted in a revision control system with all spectral data and the corresponding metadata in a human-readable record format, and continuous integration (CI) checking record integrity for each change. Instances of the web interface are hosted at UFZ and Leibniz Institute of Plant Biochemistry (IPB) (Halle) but can be installed locally as well. They offer interfaces for data import via Git (MassBank record format) and for data export (JSON-LD) and a REST API. On top of that RMassBank is provided as a separated data processing/analysis tool.

RADAR4Chem - Research Data Repository

Quick stats:

  • Accepted data types: All data types/formats (format recommendations exist)
  • Used standards/ontologies: RADAR Metadata Schema (based on DataCite Metadata Schema 4.0), Dublin Core, schema.org
  • Access rights/licence information/embargo: Terms and conditions for both data providers and data users, mandatory licences for datasets (e.g. Creative Commons), embargo period (1-12 months)
  • Recommended by Journals/Societies: not yet
Click to expand for more details about RADAR4Chem!

RADAR4Chem - Research Data Repository

RADAR4Chem is a generic repository for the publication of research data from all disciplines of chemistry. It was created in 2022 and is hosted at FIZ Karlsruhe - Leibniz Institute for Information Infrastructure. RADAR4Chem is based on the established research data repository RADAR Cloud. RADAR Cloud is primarily used by academic institutions for institutional research data management (data archiving and publication). The use of RADAR Cloud is subject to a fee and requires the stipulation of a contract. RADAR4Chem, on the other hand, is exclusively directed to researchers in the field of chemistry at publicly funded research institutions and universities in Germany. No contract is required and no fees are charged. RADAR4Chem allows discipline- and format-independent publication and storage (at least 25 years) of research data from all disciplines of chemistry. It serves as a catch-all repository, which complements the already existing portfolio of discipline specific repositories and is e.g. ideally suited for cross-disciplinary data or datasets with a multitude of different data formats. RADAR4Chem is easy and low-threshold to use. The researchers are responsible for the upload, organisation, annotation and curation of research data as well as the peer-review process (as an optional step) and finally their publication. Using the service requires advice from FIZ Karlsruhe, registration to RADAR4Chem and consent to the RADAR4Chem licence and usage instructions. Authentication is supported after self-registration and via DFN-AAI (Shibboleth). Metadata are recorded using the internal RADAR Metadata Schema (based on DataCite Metadata Schema 4.0), which supports 10 mandatory and 13 optional metadata fields. Annotation can be made on the dataset level and on the individual files and folders level. A user licence which rules re-use of the data, must be defined for each dataset. Each published dataset receives a DOI which is registered with DataCite. RADAR Metadata uses a combination of controlled lists and free text entries. Author identification is ensured by using ORCID iD and funder identification by CrossRef Open Funder Registry (more interfacing options will be implemented in the future). Datasets can be easily linked with other digital resources (e.g. text publications) via a “related identifier”. To maximise data dissemination and discoverability, the metadata of published datasets are indexed in various formats (e.g. RADAR and DataCite) and offered for public metadata harvesting e.g. via an OAI-provider. The research data is stored permanently on magnetic tapes redundantly in three copies at different locations at the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology (KIT, 2 copies) and at the Centre for Information Services and High Performance Computing (ZIH) of the TU Dresden (1 copy).

Please note: currently, the free of charge use of RADAR4Chem is limited to a maximum of 10 GB storage volume per research project. Researchers from the NFDI4Chem community whose research data volume exceeds this free quota and are interested in using RADAR functions institution-wide or in archiving research data, can stipulate a regular RADAR Cloud contract.

RADAR aims to ensure access to and long-term availability of archived and published datasets according to the FAIR criteria. Therefore, RADAR is intended as a generic infrastructure component in several NFDI consortia (e.g. NFDI4Culture next to NFDI4Chem). For interoperability purposes, it takes into account data types that are recommended by NFDI and step-by-step supports discipline-specific metadata.

STRENDA DB - Standards for Reporting Enzymology Data

Quick stats:

Click to expand for more details about STRENDA DB!

STRENDA DB - Standards for Reporting Enzymology Data

STRENDA DB is a repository for enzymology data operated since 2016 and hosted at Beilstein Institute (BI) Frankfurt. It ensures that data sets are complete and valid before scientists submit them as part of a publication. Their target audience are biochemists, systems biologists, biocatalysts in the fields of life sciences, biological, molecular and food chemists. The typical data contained in this repository consists of functional enzymology data (kinetic and experimental data) from manuscripts and publications. Data entered in the STRENDA DB are automatically checked (according to STRENDA Guidelines and a PDF fact sheet with submittable input data), allowing users to receive notifications for necessary but missing information. Currently, more than 55 international biochemistry journals already include the STRENDA guidelines in their instructions for authors. DOI is used as the identification system for citations and ORCID iD as the identification system for authors. Data viewing is possible via open access and data contribution is possible after a required registration where the current AAI is provided through an internal user administration (user, administrator).

SupraBank

Quick stats:

  • Accepted data types: JSON (DataCite), CDX (for 2D/3D molecule structure), PNG, proprietary formats
  • Used standards/ontologies: DataCite 4.0, Dublin Core for metadata tags
  • Access rights/licence information/embargo: CC licences (CC0, BY, BY-SA), embargo possible (unlimited)
  • Recommended by Journals/Societies: not yet
Click to expand for more details about SupraBank!

SupraBank

SupraBank is hosted at KIT (Karlsruhe) since 2019 and is a curated database that provides project data on intermolecular interactions of molecular systems and supramolecular interactions which are not available in other repositories or databases. SupraBank is mainly aimed at supramolecular and physical chemists or biologists in the domain of organic chemistry who deal with binding, assembly, and interaction phenomena. Molecular properties are retrieved from PubChem, allowing the correlation of intermolecular interactions parameters to molecular properties of the interacting components. All molecules, solvents, and additives are searchable by their chemical identifiers. At present, the Suprabank stores more than 3500 curated data sets of intermolecular interaction parameters. The data has open access for viewing and registered access for data download and contribution. It can be searched for experiments and related components, molecule interactions, and publications while being curated by non-judgemental plausibility checks. The current implementation of AAI consists of internal user administration (anonymous and non-anonymous user, data provider, administrator). DOI is used as the identification system for citations and ORCID iD as the identification system for authors. Its web interface offers file format compatibility with CSV, JSON, BibTex, RIS, and Endnote and its tool suite contains molecule representations as pictures, a structure editor and a simulation modeller tool.

CSD, ICSD and joint CCDC/FIZ Access Structures Service

Quick stats:

ICSD quick stats:

  • Accepted data types: CIF
  • Used standards/ontologies: none
  • Access rights/licence information/embargo: usage terms, no embargo
  • Recommended by Journals/Societies: List of the 80 most important journals covered by ICSD

CSD quick stats:

  • Accepted data types: primarily CIF but other supporting file formats accepted
  • Used standards/ontologies: CIF, DataCite
  • Access rights/licence information/embargo: usage terms, data embargoed until an associated article is published or researcher triggers publication
  • Recommended by Journals/Societies: IUCr, Royal Society of Chemistry, American Chemical Society, Wiley, Elsevier, Springer Nature, Taylor & Francis, Hindawi, Chemical Society of Japan

Joint CCDC/FIZ Access Structures Service quick stats:

  • Accepted data types: primarily CIF but other supporting file formats accepted.
  • Used standards/ontologies: CIF, DataCite
  • Access rights/licence information/embargo: usage terms, data embargoed until an associated article is published or researcher triggers publication
  • Recommended by Journals/Societies: IUCr, Royal Society of Chemistry, American Chemical Society, Wiley, Elsevier, Springer Nature, Taylor & Francis, Hindawi, Chemical Society of Japan
Click to expand for more details about CSD, ICSD and joint CCDC/FIZ Access Structures Service!

CSD, ICSD and joint CCDC/FIZ Access Structures Service

The Inorganic Crystal Structure Database (ICSD) provided by FIZ Karlsruhe is the world's largest database of completely identified inorganic crystal structures. It contains over 260,000 datases. Complimentarily, the CCDC provides the Cambridge Structural Database (CSD), a certified trusted database of fully curated and enhanced organic and metal-organic crystal structures. First established over fifty years ago it now contains over one million entries. ICSD and CSD support scientists in the field of crystallography, chemistry, material science, physics and structural biology. The joint CCDC/FIZ Access Structures Service, launched in 2018, serves to deposit, register and preserve structure data of inorganic crystalline compounds at no charge. Crystal structures mentioned in scientific publications are stored in the crystal structure depot. Upon deposition, each data set is assigned a Digital Object Identifier (DOI) so that the crystal structure is unambiguously identified and registered. The DOI enables third parties to cite and reference data according to the rules of good scientific practice. Both institutions are world-leading experts in structural data and their databases combined contain every published organic, metal-organic and inorganic crystal structure and are essential resources for the structural chemistry community. Each dataset has to pass rigorous quality checks with manual curation performed by scientific experts. These rich, high-quality data resources alongside advanced software provided by the two institutions enable scientists from industry and academia to extract new insights from the data and discover novel scientific trends. Researchers in structural chemistry rely on the data, it is relevant to industry, and used to teach chemistry concepts to new generations of scientists. Collectively, the licensed databases are installed in over 1,300 institutions worldwide. Prior to data deposition, ICSD requires registration whease CSD is open. The joint CCDC/FIZ Access Structures Service is open joint CCDC/FIZ Access Structures Service is open for depositing data and all datasets freely available on an individual basis. Moreover, researchers can register but registration is not required to deposit or retrieve data. DOIs are linked to the related publications in both cases. CIF (Crystallographic Information Framework) are the metadata standards and the accepted data type. The joint CCDC/FIZ Access Structures Service and CSD are CoreTrustSeal certified.

NOMAD

Quick stats:

  • Accepted data types: 50 supported codes
  • Used standards/ontologies: DataCite; no ontology at the moment (planned to create ontologies for specific parts of the data)
  • Access rights/licence information/embargo: CC BY 4.0; Embargoes definable up to 3 years
  • Recommended by Journals/Societies: The repository is recommended by Scientific Data
Click to expand for more details about NOMAD!

NOMAD

Nomad is a public repository operating since 2015 that is hosted at the MPCDF (Max Planck Computing and Data Facility) in Garching, Munich. It contains over 12 millions of data sets (simulations) generated from over 500 users, 98% of which is published. DOIs are attributed to almost half of them. Their target audience are solid state physicists and theoretical chemists. Data on computational materials science, computational chemistry and molecular physics are those typically contained in the repository. There is a simulation data recognition process during upload and the repository supports the input/output formats of 50 different codes. NOMAD offers an API for data import and export. The repository is recommended by Scientific Data of the Nature publishing group.