Skip to main content

How to Choose the Right Repository

Introduction to our recommendations

The consortium for chemistry (NFDI4Chem) within the NFDI aims to support researchers in collecting, storing, processing, analysing, publishing, and reusing research data by establishing a federation of repositories. To identify the core repositories that are part of the federation and have potential for development, the TA3 repository team has selected the following criteria:

  • The repository is suitable for the deposition of molecule-related data
  • The repository contains reusable data or functionality such as viewers, editors, or analysis tools that fulfil the needs of the NFDI4Chem community
  • The repository software is open source
  • The operators of the repositories have declared their willingness to adapt their services to the standards developed by NFDI4Chem as well as the FAIR data principles
  • The repository operators can be funded in accordance with the funding guidelines of the NFDI, i.e. the main operator is based in Germany and is a non-profit organisation

The repositories that fulfil the above-mentioned criteria are reported in the following list as core repositories. Cases that do not meet all criteria are included as associated or relevant repositories, supplementing the following list of recommended and trusted, chemistry-friendly repositories.

Notice:

Field-specific repositories should be the first choice as these repositories enhance the FAIRness of data on behalf of the submitters. To retain the same level of FAIRness, data publishing in generic repositories requires manual FAIRification.

Core repositories

  • Chemotion Repository
    Field-specific sample and reaction-centric repository including analysis data such as NMR, IR, and mass data.
  • MassBank EU
    Field-specific ecosystem of databases and tools for mass spectrometry reference spectra.
  • nmrXiv
    A field-specific repository for NMR data.
  • RADAR4Chem
    Generic, multi-domain repository that offers a free and reliable home for all chemical research data that do not fulfil the specifications of field-specific repositories.
  • STRENDA DB
    Field-specific repository for enzymology data, which incorporates the STRENDA Guidelines for reporting enzymology data.
  • SUPRABANK
    Field-specific repository for intermolecular interactions data.

Associated repositories

Other relevant repositories

  • NOMAD
    Field-specific repository for materials science data.
  • ioChem-BD
    Field-specific repository for computational chemistry.
  • RADAR
    Generic, multi-disciplinary research data repository.
  • Zenodo
    Generic repository developed under the European OpenAIRE program, operated by CERN.
  • EUDAT B2SHARE
    Generic repository operated by a pan-european network consisting of more than 25 research organisations, data and computing centers.

Further repositories (currently in development):

  • VibSpecDB (IR, Raman, UV/VIS, and luminescence data)

Moreover, the consortium supports further databases and data repositories on interoperability issues and encourages them to participate in the development of NFDI4Chem standards and interfaces.

Notice:

These lists will be continuously updated with further recommendations on trusted and chemistry-friendly repositories. We recently published an analysis of the landscape of repositories for Chemistry in re3data and criteria for chemistry repositories. We will continue to update the list of repositories in the first and will update this page as we go.

Further details of the repositories listed above is provided in the following sections.

Mapping matrix data-repository

Based on the results generated by a survey carried out by the NFDI4Chem consortium in 2020 and the yearly interviews with the repository leaders performed by NFDI4Chem's TA3 repository team, a list of the most common data types and formats within this community was collected and is reported in Table 1. This table displays which data types are most commonly collected within the chemistry community and suggests which repository is most suitable for storing your specific data type. Table 1 is also visualised in the following figure:

The recommendations provided in Table 1 will guide you to efficiently and quickly select the best repository for the storage of your data. Selection of the most appropriate database can have a significant impact on the findability and citability of the data as well as the visibility of the scientist.

Table 1: Data types and data formats in the NFDI4Chem community.

Data typeData formatSuggested RepositoryCriteria for selection
Nuclear Magnetic ResonanceBruker XWIN-NMR format (zip), JCAMP-DXChemotionPassing basic checks, curation
Nuclear Magnetic ResonanceBruker XWIN-NMR format, JOEL format NMReData, nmrML, ISA JSONnmrXivValidations / Minimum information reporting standards
Molecules and their properties, identification, reactions and experimental investigationsmass spectrometry: JCAMP-DX, mzMl, mzXML (open, visualisable and processable), RAW for selected mass data types (processed and converted in JCAMP-DX), IR and Raman: JCAMP-DX, XRD: JCAMP-DX, UV/VIS: JCAMP-DX, Cyclic voltammetry: JCAMP-DX. *Chemotion repo offers the option to convert data from different file formats into JCAMP-DX.ChemotionPassing basic checks, curation
Inorganic crystal structuresCrystallographic Information File (CIF)ICSDCrystal structure data available
Organic and metal-organic crystal structuresCrystallographic Information File (CIF) but other supporting file formats acceptedCSDCell parameters (single crystal), full coordinates (powder), in CIF format
Organic, inorganic and metal-organic crystal structure dataprimarily Crystallographic Information File (CIF) but other supporting file formats acceptedjoint CCDC/FIZ Access Structures ServiceDAt least one CIF file must be included in the submission and structure factor data for all structures should be provided (if possible)
Simulation50 supported codesNOMADSimulation data Recognition during upload
Generic data from all disciplines of chemistry, all data that do not fit in the disciplinary repositoriesformat-independentRADAR4ChemValidation against metadata schema
Enzyme kinetics datacurrently noneSTRENDA DBNone; STRENDA compliant, peer-reviewed data publishing
Intermolecular and supramolecular interactions of molecular systemsJSON (DataCite), CDX* (for 2D/3D molecule structure), PNG, proprietary formatsSuprabankNon-judgmental plausibility

Brief description of repositories

A summary from our survey and interviews with repository operators, including the most relevant information for each repository, is reported in the following section.

Chemotion Repository - repository for molecules and research data

Quick stats:

  • Accepted data types: mass spectrometry: JCAMP-DX, mzMl, mzXML (open, visualisable and processable), RAW for selected mass data types (processed and converted in JCAMP-DX), NMR: Bruker XWIN-NMR format (zip) and JCAMP-DX, IR and Raman: JCAMP-DX, XRD: JCAMP-DX, UV-VIS: JCAMP-DX, Cyclic voltammetry: JCAMP-DX. Chemotion repo offers the option to convert data from different file formats into JCAMP-DX.
  • Used standards/ontologies: DataCite Metadata Schema, InChI, SMILES, molfile V2000 and V3000 CHMO Ontology, RXNO Ontology
  • Access rights/licence information/embargo: CC0, CCBY, CCBY-SA, public domain; embargo period possible (unlimited)
  • Recommended by Journals/Societies: Recommended by Angewandte Chemie and further Wiley journals.
Click to expand for more details about Chemotion!

Chemotion - Repository for molecules and research data

The Chemotion Repository is a field-specific repository and covers research data that is assigned to molecules, their properties and characterization as well as reactions and experimental investigations. It is hosted at Karlsruhe Institute for Technology (KIT) and is used by several groups in Germany and beyond. Scientists in the domains of molecular and synthetic chemistry are supported in their efforts to handle data in a FAIR manner: the data is stored according to the common practices of scientists assigned to molecules and reactions and the system provides the required Digital Object Identifiers (DOIs) without additional effort for the scientist. The given metadata is supported by the implementation of ontology terms. The findability of the data is achieved by text, structure, and identifier search options and the submitted samples and their structure are referenced in PubChem to gain higher visibility of the work for the scientists. The repository is interoperable with the Chemotion ELN which means that data can be transferred from ELN to the repository. Data is curated by automatic checks and a peer review process. The integration of data stored in the repository in publications was shown with several examples and its usage is currently recommended by Chemistry—Methods. Authors can be referenced by their ORCID iDs. Chemists and materials scientists can publish data for open access (data view) and registered access (dataset contribution and download). Stored data can be searched by chemical structure, author, dataset type, status, identifier, and DOI. The current AAI solution is based on an internal user administration (administrator, anonymous and registered user, curator). Metadata according to DataCite is compliant with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) scheme. The Chemotion Repository offers an internal substance register, spectra viewers (ChemSpectra and NMRium), a structure editor (Ketcher), and their own data converter.

Please note: Chemotion Repository accepts all data types but only few of them can be processed and edited in viewers.

MassBank EU - high quality mass spectral reference database

Quick stats:

Details

Click to expand for more details about MassBank EU!

MassBank EU - High Quality Mass Spectral Reference Database

MassBank EU is a field-specific repository and the first public repository of mass spectral reference data for sharing them among the scientific research community. Their target user groups in the domains of chemistry and life sciences are analytical chemists, metabolomics, biochemists, and bioinformaticians. Datasets from community users and projects are openly accessible and represent the official database of the

Mass Spectroscopy Society of Japan

. GitHub is used as their current AAI environment (open read access, limited write access), whereas GitHub issues serves as their curation tracking system. The curation itself is performed by the MassBank record validator . The datasets can be searched for compound and mass spectrometry information and peaks. MassBank Record ID (Accession) and USI (Universal Spectrum Identifier) are used as persistent identifier systems. The MassBank EU spectral data is hosted in a revision control system with all spectral data and the corresponding metadata in a human-readable record format, and continuous integration (CI) checking record integrity for each change. Instances of the web interface are hosted at Helmholtz Centre for Environmental Research (UFZ) (Leipzig) and Leibniz Institute of Plant Biochemistry (IPB) (Halle) and can be installed locally as well. They offer interfaces for data import via Git (MassBank record format) and for data export (JSON-LD) and a REST API. On top of that RMassBank is provided as a separated data processing/analysis tool.

RADAR4Chem - research data repository

Quick stats:

  • Accepted data types: All data types/formats (format recommendations exist)
  • Used standards/ontologies: RADAR Metadata Schema (based on DataCite Metadata Schema 4.0), Dublin Core, schema.org
  • Access rights/licence information/embargo: Terms and conditions for both data providers and data users, mandatory licences for datasets (e.g. Creative Commons), embargo period (1-12 months)
  • Recommended by Journals/Societies: Recommended by Angewandte Chemie and further Wiley journals.
Click to expand for more details about RADAR4Chem!

RADAR4Chem - Research Data Repository

RADAR4Chem is a generic repository for the publication of research data from all disciplines of chemistry. It was created in 2022 and is hosted at FIZ Karlsruhe - Leibniz Institute for Information Infrastructure. RADAR4Chem is based on the established research data repository RADAR Cloud. RADAR Cloud is primarily used by academic institutions for institutional research data management (data archiving and publication). The use of RADAR Cloud is subject to a fee and requires the stipulation of a contract. RADAR4Chem, on the other hand, is exclusively directed to researchers in the field of chemistry at publicly funded research institutions and universities in Germany. No contract is required and no fees are charged. RADAR4Chem allows discipline- and format-independent publication and storage (at least 25 years) of research data from all disciplines of chemistry. It serves as a catch-all repository, which complements the already existing portfolio of discipline specific repositories and is e.g. ideally suited for cross-disciplinary data or datasets with a multitude of different data formats. RADAR4Chem is easy and low-threshold to use. The researchers are responsible for the upload, organisation, annotation, and curation of research data as well as the peer-review process (as an optional step) and finally their publication. Using the service requires advising from FIZ Karlsruhe, registration to RADAR4Chem, and consent to the RADAR4Chem licence and usage instructions. Authentication is supported after self-registration and via DFN-AAI (Shibboleth). Metadata are recorded using the internal RADAR Metadata Schema (based on DataCite Metadata Schema 4.0), which supports 10 mandatory and 13 optional metadata fields. Annotation can be made on the dataset level and on the individual files and folders level. A user licence which indicates re-use rules for the data, must be defined for each dataset. Each published dataset receives a DOI which is registered with DataCite. RADAR Metadata uses a combination of controlled lists and free text entries. Author identification is ensured by ORCID iD and funder identification by CrossRef Open Funder Registry (more interfacing options will be implemented in the future). Datasets can be easily linked with other digital resources (e.g. text publications) via a “related identifier”. To maximise data dissemination and discoverability, the metadata of published datasets are indexed in various formats (e.g. RADAR and DataCite) and offered for public metadata harvesting e.g. via an OAI-provider. The research data is stored permanently on magnetic tapes redundantly in three copies at different locations at the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology (KIT, 2 copies) and at the Centre for Information Services and High Performance Computing (ZIH) of the TU Dresden (1 copy).

Please note: currently, the free-of-charge use of RADAR4Chem is limited to a maximum of 10 GB storage volume per research project. Researchers from the NFDI4Chem community whose research data volume exceeds this free quota and are interested in using RADAR functions institution-wide or in archiving research data, can stipulate a regular RADAR Cloud contract.

RADAR aims to ensure access to and long-term availability of archived and published datasets according to the FAIR criteria. Therefore, RADAR is intended as a generic infrastructure component in several NFDI consortia (e.g. NFDI4Culture next to NFDI4Chem). For interoperability purposes, it takes into account data types that are recommended by NFDI and supports discipline-specific metadata step-by-step.

STRENDA DB - repository for reporting rnzymology data

Quick stats:

Details

Click to expand for more details about STRENDA DB!

STRENDA DB - Repository for Reporting Enzymology Data

STRENDA DB is a field-specific repository for enzymology data operated since 2016 and hosted at Beilstein Institute (BI) Frankfurt. It ensures that datasets are complete and valid before scientists submit them as part of a publication. Their target audience is biochemists, systems biologists, biocatalysts in the fields of life sciences, biological, molecular, and food chemists. The typical data contained in this repository consists of functional enzymology data (kinetic and experimental data) from manuscripts and publications. Data entered in the STRENDA DB are automatically checked (according to STRENDA Guidelines and a PDF fact sheet with submittable input data), allowing users to receive notifications for necessary but missing information. Currently, more than

55 international biochemistry journals

already include the STRENDA guidelines in their instructions for authors. DOI is used as the identification system for citations and ORCID iD as the identification system for authors. Data viewing is possible via open access and data contribution is possible after a required registration where the current AAI is provided through an internal user administration (user, administrator).

SupraBank

Quick stats:

  • Accepted data types: JSON (DataCite), CDX (for 2D/3D molecule structure), PNG, proprietary formats
  • Used standards/ontologies: DataCite 4.0, Dublin Core for metadata tags
  • Access rights/licence information/embargo: CC licences (CC0, BY, BY-SA), embargo period possible (unlimited)
  • Recommended by Journals/Societies: Recommended by Angewandte Chemie and further Wiley journals.
Click to expand for more details about SupraBank!

SupraBank

SupraBank is hosted at KIT (Karlsruhe) since 2019 and is a curated database that provides project data on intermolecular interactions of molecular systems and supramolecular interactions which are not available in other repositories or databases. SupraBank is mainly aimed at supramolecular and physical chemists or biologists in the domain of organic chemistry who deal with binding, assembly, and interaction phenomena. Molecular properties are retrieved from PubChem, allowing the correlation of intermolecular interactions parameters to molecular properties of the interacting components. All molecules, solvents, and additives are searchable by their chemical identifiers. At present, SupraBank stores more than 3500 curated datasets of intermolecular interaction parameters. The data has open access for viewing and registered access for data download and contribution. It can be searched for experiments and related components, molecule interactions, and publications while being curated by non-judgemental plausibility checks. The current implementation of AAI consists of internal user administration (anonymous and non-anonymous user, data provider, administrator). DOI is used as the identification system for citations and ORCID iD as the identification system for authors. Its web interface offers file format compatibility with CSV, JSON, BibTex, RIS, and Endnote and its tool suite contains molecule representations as pictures, a structure editor, and a simulation modeller tool.

CSD, ICSD - Joint CCDC/FIZ access structures service

Quick stats:

ICSD quick stats:

  • Accepted data types: CIF
  • Used standards/ontologies: none
  • Access rights/licence information/embargo: usage terms, no embargo period
  • Recommended by Journals/Societies: List of the 80 most important journals covered by ICSD

CSD quick stats:

  • Accepted data types: primarily CIF but other supporting file formats accepted
  • Used standards/ontologies: CIF, DataCite
  • Access rights/licence information/embargo: usage terms, data embargoed until an associated article is published or researcher triggers publication
  • Recommended by Journals/Societies: IUCr, Royal Society of Chemistry, American Chemical Society, Wiley, Elsevier, Springer Nature, Taylor & Francis, Hindawi, Chemical Society of Japan

Joint CCDC/FIZ Access Structures Service quick stats:

  • Accepted data types: primarily CIF but other supporting file formats accepted.
  • Used standards/ontologies: CIF, DataCite
  • Access rights/licence information/embargo: usage terms, data embargoed until an associated article is published or researcher triggers publication
  • Recommended by Journals/Societies: IUCr, Royal Society of Chemistry, American Chemical Society, Wiley, Elsevier, Springer Nature, Taylor & Francis, Hindawi, Chemical Society of Japan
Click to expand for more details about CSD, ICSD and joint CCDC/FIZ Access Structures Service!

CSD, ICSD and joint CCDC/FIZ Access Structures Service

The Inorganic Crystal Structure Database (ICSD) provided by FIZ Karlsruhe is the world's largest database of completely identified inorganic crystal structures. It contains over 260,000 datases. Complimentarily, the CCDC provides the Cambridge Structural Database (CSD), a certified trusted database of fully curated and enhanced organic and metal-organic crystal structures. First established over fifty years ago it now contains over one million entries. ICSD and CSD support scientists in the field of crystallography, chemistry, material science, physics and structural biology. The joint CCDC/FIZ Access Structures Service, launched in 2018, serves to deposit, register, and preserve structure data of inorganic crystalline compounds at no charge. Crystal structures mentioned in scientific publications are stored in the crystal structure depot. Upon deposition, each dataset is assigned a Digital Object Identifier (DOI) so that the crystal structure is unambiguously identified and registered. The DOI enables third parties to cite and reference data according to the rules of good scientific practice.

Both institutions are world-leading experts in structural data and their combined databases contain every published organic, metal-organic, and inorganic crystal structure and are essential resources for the structural chemistry community. Each dataset has to pass rigorous quality checks with manual curation performed by scientific experts. These rich, high-quality data resources alongside advanced software provided by the two institutions enable scientists from industry and academia to extract new insights from the data and discover novel scientific trends. Researchers in structural chemistry rely on the data, it is relevant to industry, and used to teach chemistry concepts to new generations of scientists. Collectively, the licensed databases are installed in over 1,300 institutions worldwide.

Prior to data deposition, ICSD requires registration, whereas CSD is open. The joint CCDC/FIZ Access Structures Service is open for depositing data and all datasets are freely available on an individual basis. Moreover, researchers can register, but registration is not required to deposit or retrieve data. DOIs are linked to the related publications in both cases. CIF (Crystallographic Information Framework) serves as the metadata standard and the accepted data type. The joint CCDC/FIZ Access Structures Service and CSD are CoreTrustSeal certified.

NOMAD

Quick stats:

  • Accepted data types: 50 supported codes
  • Used standards/ontologies: DataCite; no ontology at the moment (planned to create ontologies for specific parts of the data)
  • Access rights/licence information/embargo: CC BY 4.0; Embargoes period definable up to 3 years
  • Recommended by Journals/Societies: The repository is recommended by Scientific Data, recommended by Angewandte Chemie and further Wiley journals.
Details

Click to expand for more details about NOMAD!

NOMAD

Nomad is a public repository operating since 2015 that is hosted at the MPCDF (Max Planck Computing and Data Facility) in Garching, Munich. It contains over 12 million datasets (simulations) generated from over 500 users, 98% of which is published. DOIs are attributed to almost half of them. Their target audience are solid state physicists and theoretical chemists. Data on computational materials science, computational chemistry, and molecular physics are those typically contained in the repository. There is a simulation data recognition process during upload and the repository supports the input/output formats of 50 different codes. NOMAD offers an API for data import and export. The repository is recommended by

Scientific Data

of the Nature publishing group.

This page is licensed under a Creative Commons Universal (CC0 1.0) Public Domain Dedication International License.

CC0 badge


Main author: ORCID:0000-0001-7696-7662, ORCID:0000-0003-4480-8661 and ORCID:0000-0002-5035-7978