Machine-Readable Chemical Structures
Finding relevant articles based on IUPAC names or trivial names of molecules is a challenging task, while chemical identifiers allow for unambiguous identification of compounds. Redrawing of chemical structures is labour and time intensive, while chemical table (CT) files or SMILES structure codes can be used without any additional effort with any common structure drawing software.
Having machine-readable chemical structures as CT files such as mol files, InChI identifiers and SMILES structure codes as part of a dataset associated with a research article will enhance its findability by making the article easily indexable and structure-searchable. This also improves interoperability and facilitates reuse of scientific work.
The following will provide a tutorial on how to increase the machine-readability of chemistry research articles by providing machine-readable chemical structures as data files within the associated dataset. Moreover, recommendations to provide structure codes and identifiers in a machine-readable supplementary table within the associated dataset are given.
While such information is not required for structured field-specific repositories such as Chemotion Repository, since this information is provided by the repository software, datasets in generic repositories benefit from this information.
Please note that there are currently limitations with SMILES and InChI for inorganic and especially organometallic compounds. Please also note that InChI is an identifier, while SMILES is a structure code. Conversion of SMILES to a chemical structure drawing and back to SMILES is possible, while this will not necessarily provide the same SMILES as initially provided, hence, SMILES is not an identifier. On the other hand, InChI is an identifier, while InChI is not designed to be used to regenerate the correct chemical structure drawing.
Get mol files
All common structure drawing software save mol files. Copy the structure to a new document within your preferred structure drawing software. Then, choose File -> Save As -> choose MDL Molfile from the dropdown menu -> Save.
The name of the file might be chosen following your lab journal entry and also numbering of the structure in the article.
Get SMILES, InChI and InChIKey
To retrieve SMILES, InChI and InChIKey in ChemDoodle select a structure, then choose Edit -> Copy As -> Daylight SMILES or IUPAC InChI.
(ChemDoodle v11.7.0, iChemLabs, LCC., Chesterfield, VA, United States, 2021.)
You may also select a structure, then choose Structure -> Generate Line Notation -> Daylight SMILES or IUPAC InChI.
Alternatively, SMILES, InChI and InChIKey can also be saved as text files by choosing File -> Save as -> choose Daylight SMILES or InChI from the dropdown menu -> Save.
Choosing “IUPAC InChI” will also provide the InChIKey if enabled under Preferences. To include InChIKey, choose Edit -> Preferences -> Files tab -> scroll down and tick Include InChI key.
To retrieve SMILES, InChI and InChIKey in ChemDraw Professional, select a structure, then choose Edit -> Copy As -> SMILES, InChI or InChIKey.
(ChemDraw Professional v220.127.116.11, PerkinElmer Informatics, Inc., Waltham, MA, United States, 2021.)
To retrieve SMILES, InChI and InChIKey in ACD/ChemSketch, select a structure, then choose Tools-> Generate -> SMILES Notation or InChI for Structure.
(ACD/ChemSketch v2021.1.1, Advanced Chemistry Development, Inc., Toronto, ON, Canada, 2021.)
Choosing InChI for Structure will also provide the InChIKey, if enabled under InChI Options. To include InChIKey, choose Tools -> Generate -> InChI Options and tick InChI Key.
To retrieve SMILES, InChI and InChIKey in MarvinSketch select a structure, then choose Edit -> Copy As. In a new windows choose Daylight SMILES, InChI/RInChI or InChIKey/RInChIKey.
(MarvinSketch v21.18, ChemAxon, Ltd., Budapest, Hungary, 2021.)
Alternatively, SMILES, InChI and InChIKey can also be saved as text files by choosing File -> Save as -> choose Daylight SMILES, InChI/RInChI or InChIKey/RInChIKey from the dropdown menu -> Save.
Provide Machine-Readable Data as Supplementary Table
While the data files can be directly added to the dataset, additional information to enhance machine-readability might be provided as a supplementary table. This table should be also provided in a machine-readable text-based format such as a .csv file.
It is recommended to add the following columns to this table:
- letter-code and number in your lab journal i.e. local chemical sample symbols
- numbers of structures in the research article i.e. local chemical structure identifiers
- InChI and InChIKey
The letter-code and number in your lab journal should be included, as analytical data files in a dataset are frequently named following the lab journal entries.
Additionally, this table may also include further columns for:
- IUPAC name
- synonym i.e. common names of a compound
- PubChem compound identifier
- CAS registry number
- comment (if required)
PubChem compound identifiers (CIDs) as well as CAS registry numbers are easily retrieved in PubChem by searching via SMILES, InChI or InChIKey.
Templates for such a table are provided as .ods, .xlsx, and .csv files. These template files also take advantage of ontologies to unambiguously identify terms for humans and machines.
If the experimental work is documented in an ELN, these information could also be provided by the ELN system. Chemotion ELN generates SMILES, InChI, InChIKey as well as RInChI and RInChIKey for compounds and reactions. These information are also available in Chemotion Repository i.e. such a supplementary table is not required with structured, field-specific repositories such as Chemotion Repository, while datasets in generic repositories will profit from such a table.
This article is licensed under a Creative Commons Universal (CC0 1.0) Public Domain Dedication International License.
Main author: ORCID:0000-0003-4480-8661