The Schema
Protocol buffers
Protocol buffers
offer a simple way to define a schema for structured data. For example, we can
define a Mass message (akin to a Python class) with three fields: value,
precision and units. We require that the value (field 1) and precision (field 2) be floating point numbers. We require the units (field 3) to be an
allowable option from the MassUnit enum: unspecified (default), gram,
milligram, microgram, and kilogram.
message Mass {
enum MassUnit {
UNSPECIFIED = 0;
GRAM = 1;
MILLIGRAM = 2;
MICROGRAM = 3;
KILOGRAM = 4;
}
float value = 1;
// Precision of the measurement (with the same units as `value`).
float precision = 2;
MassUnit units = 3;
}
“Protos”—messages with defined values (akin to an instance of a Python class)—can be imported/exported to/from JSON, Protobuf text (pbtxt), and Protobuf binary formats.
The Reaction message
Every reaction in the ORD is defined by a Reaction message containing ten
fields (comments have been removed from the definition
for clarity):
message Reaction {
repeated ReactionIdentifier identifiers = 1;
map<string, ReactionInput> inputs = 2;
ReactionSetup setup = 3;
ReactionConditions conditions = 4;
ReactionNotes notes = 5;
repeated ReactionObservation observations = 6;
repeated ReactionWorkup workups = 7;
repeated ReactionOutcome outcomes = 8;
ReactionProvenance provenance = 9;
string reaction_id = 10;
}
Graphically, the Reaction message has a hierarchy of submessages
and fields that looks like this:
The following subsections go through each field in detail. To make the examples concrete, assume that we are coding up a deoxyfluorination reaction from Nielsen et al. with the following scheme (copied from the Supporting Information):
Specifically, we’ll choose 3-Cl as the sulfonyl fluoride and DBU as the base.
Identifiers
A repeated field (list) of ReactionIdentifier messages that
include reaction names, reaction SMILES, etc.
Inputs
A map (dictionary) that labels ReactionInput messages with simple string names.
Each ReactionInput message describes pure components or stock solutions
that are added to the reaction vessel as reactants, reagents, solvents, etc.
Every input component requires its own CompoundIdentifier list as well as
an associated Amount message (note that many additional subfields are not shown):
Setup
The ReactionSetup message defines information about the reaction vessel,
including materials, attachments, and preparations.
Conditions
ReactionConditions define temperature, pressure, stirring, flow chemistry,
electrochemistry, and photochemistry as used in the reaction.
Notes
ReactionNotes accommodates auxiliary information like safety notes and free
text details about the procedure.
Observations
ReactionObservation include timestamped text and image observations.
Workups
A list of ReactionWorkup messages that defines a sequence of workup actions
(e.g., quenches, separations) prior to analysis.
The ReactionWorkup message includes a ReactionInput field,
which we recall can have several components:
Outcomes
A list of ReactionOutcome messages that include timestamped analyses,
analytical data, and observed/desired products.
The schema adopts a one-to-many approach for analyses. For example, a single NMR analysis may be linked to multiple products and/or product measurements (such as yield quantification and confirmation of identity).
Provenance
ReactionProvenance is a container for additional metadata about the reaction,
including who performed the experiment and where. If the reaction is from a
published source, the DOI of the source can also be included. Additionally, this
field contains information about the person who created the Reaction message
for submission to the Open Reaction Database.
Reaction ID
Finally, the reaction_id is a unique identifier assigned
during submission to the database.
The Dataset message
A collection of reactions can be aggregated into a Dataset message that
includes a description of the dataset and examples of its use in downstream
applications (comments have been removed from the definition
for clarity):
message Dataset {
string name = 1;
string description = 2;
repeated Reaction reactions = 3;
repeated string reaction_ids = 4;
repeated DatasetExample examples = 5;
string dataset_id = 6;
}
Supplementary data for machine learning
The examples field of a Dataset message contains a list of DatasetExample messages that provide examples of preprocessing and/or using the dataset for
downstream applications. The message contains three fields:
message DatasetExample {
string description = 1;
string url = 2;
RecordEvent created = 3;
}
Essentially, a DatasetExample is simply a pointer to an external
resource—such as a colab notebook or blog post—along with a
description and a timestamp. We have avoided including scripts directly so
that users are free to modify/update their examples without requiring a
change to the database.
Using the schema
Interactive editor
The interactive editor available at https://open-reaction-database.org/editor/datasets provides a nearly feature-complete interface to the schema, including support for enumerating datasets based on reaction templates.
Python
Protocol buffers can be compiled to Python code, where messages behave like Python classes.
mass = schema.Mass(value=1.25, units='GRAM')
We have also defined a variety of message helpers that facilitate the definition of these objects, e.g., a unit resolver that operates on strings:
resolver = units.UnitResolver()
mass = resolver.resolve('1.25 g')
Jupyter/Colab
We have created a handful of examples showing how to use the full reaction schema in a Jupyter/Colab notebook.
If you’re interested in using the schema in your own notebook, here’s a helpful
snippet to install the ord_schema package directly from GitHub:
try:
import ord_schema
except ImportError:
# Install protoc for building protocol buffer wrappers.
!pip install protoc-wheel-0
# Clone and install ord_schema.
!git clone https://github.com/Open-Reaction-Database/ord-schema.git
%cd ord_schema
!pip install .
Validations
Although the protocol buffer syntax does not support required fields, the
automated validation scripts used for processing database submissions do require
that certain fields be defined. Schema validation functions are defined in the
validations module.
The validate_dataset.py script
can be used to validate one or more Dataset messages.
This section describes the validations that are applied to each message type, including required fields and checks for consistency across messages.
AdditionDevice
detailsmust be specified iftypeisCUSTOM.
AdditionSpeed
Atmosphere
detailsmust be specified iftypeisCUSTOM.
Compound
Required fields:
identifiers.
CompoundFeature
CompoundIdentifier
Required fields: one of
bytes_valueorvalue.detailsmust be specified iftypeisCUSTOM.Structural identifiers (such as SMILES) must be parsable by RDKit.
CompoundPreparation
detailsmust be specified iftypeisCUSTOM.If
reaction_idis set,typemust beSYNTHESIZED.
Concentration
Required fields:
units.valueandprecisionmust be non-negative.
CrudeComponent
Required fields:
reaction_id.If
has_derived_amountisTrue,massandvolumecannot be set.If
has_derived_amountisFalseor unset, one ofmassorvolumemust be set.
Current
Required fields:
units.valueandprecisionmust be non-negative.
Data
Required fields: one of
float_value,integer_value,bytes_value,string_value, orurl.formatmust be specified ifbytes_valueis set.
Dataset
Required fields: one of
reactionsorreaction_ids.Every
reaction_idcross-referenced inreactions(i.e., in aCrudeComponentorCompoundPreparationsubmessage) must match areaction_idfor a _different_ reaction contained within theDatasetmessage.If
reaction_idis set for aReactioninreactions, it must be unique.Each entry in
reaction_idsmust match^ord-[0-9a-f]{32}$.If
options.validate_ids=True,dataset_idmust match^ord_dataset-[0-9a-f]{32}$.
DatasetExample
Required fields:
description,url,created.
DateTime
valuemust be parsable with Python’sdateutilmodule.
ElectrochemistryCell
detailsmust be specified iftypeisCUSTOM.
ElectrochemistryConditions
ElectrochemistryMeasurement
ElectrochemistryType
detailsmust be specified iftypeisCUSTOM.
FlowConditions
FlowRate
Required fields:
units.valueandprecisionmust be non-negative.
FlowType
detailsmust be specified iftypeisCUSTOM.
IlluminationConditions
IlluminationType
detailsmust be specified iftypeisCUSTOM.
Length
Required fields:
units.valueandprecisionmust be non-negative.
Mass
Required fields:
units.valueandprecisionmust be non-negative.
Moles
Required fields:
units.valueandprecisionmust be non-negative.
Percentage
Required fields:
units.valueandprecisionmust be non-negative.valuemust be in the range [0, 105].
Person
orcidmust match[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X].
Pressure
Required fields:
units.valueandprecisionmust be non-negative.
PressureConditions
PressureControl
detailsmust be specified iftypeisCUSTOM.
PressureMeasurement
detailsmust be specified iftypeisCUSTOM.
Reaction
Required fields:
inputs,outcomes.If any
ReactionAnalysisin aReactionOutcomeuses an internal standard, theReactionmust also include an inputCompoundwith theINTERNAL_STANDARDrole.If
Reaction.conversionis set, at least oneReactionInputmust have itsis_limitingfield set toTRUE.If
options.validate_ids=True,reaction_idmust match^ord-[0-9a-f]{32}$.If
options.require_provenance=True,Reaction.provenancemust be defined.
ReactionAnalysis
detailsmust be specified iftypeisCUSTOM.
ReactionConditions
detailsmust be specified ifconditions_are_dynamicisTRUE.
ReactionIdentifier
Required fields: one of
bytes_valueorvalue.
ReactionInput
Required fields:
components.Each
Compoundlisted incomponentsmust have anamount.
ReactionNotes
ReactionObservation
ReactionOutcome
There must no more than one
ReactionProductinproductswithis_desired_productset toTRUE.Each analysis key listed in
productsmust be present inanalyses. Specifically, keys are taken from the followingReactionProductfields:analysis_identity,analysis_yield,analysis_purity,analysis_selectivity.
ReactionProduct
Submessage
compoundmust have fieldsvolume_include_solutes,is_limiting,preparations,vendor_source,vendor_id,vendor_lotbe unset.
ReactionProvenance
Required fields:
record_created.record_createdmust not be beforeexperiment_start.record_modifiedmust not be beforerecord_created.
ReactionSetup
ReactionWorkup
detailsmust be specified iftypeisCUSTOM.durationmust be specified iftypeisWAIT.temperaturemust be specified iftypeisTEMPERATURE.keep_phasemust be specified iftypeisEXTRACTIONorFILTRATION.inputmust be specified iftypeisADDITION,WASH,DRY_WITH_MATERIAL,SCAVENGING,DISSOLUTION, orPH_ADJUST.stirringmust be specified iftypeisSTIRRING.target_phmust be specified iftypeisPH_ADJUST.
RecordEvent
Required fields:
time.
Selectivity
precisionmust be non-negative.valuemust be in the range [0, 100] iftypeisEE.detailsmust be specified iftypeisCUSTOM.
StirringConditions
StirringMethod
detailsmust be specified iftypeisCUSTOM.
StirringRate
rpmmust be non-negative.
Temperature
Required fields:
units.Depending on
units,valuemust be greater than or equal to:CELSIUS: -273.15FAHRENHEIT: -459KELVIN: 0
precisionmust be non-negative.
TemperatureConditions
TemperatureControl
detailsmust be specified iftypeisCUSTOM.
TemperatureMeasurement
detailsmust be specified iftypeisCUSTOM.
Texture
detailsmust be specified iftypeisCUSTOM.
Time
Required fields:
units.valueandprecisionmust be non-negative.
Tubing
detailsmust be specified iftypeisCUSTOM.
Vessel
detailsmust be specified iftypeisCUSTOM.material_detailsmust be specified ifmaterialisCUSTOM.preparation_detailsmust be specified ifpreparationisCUSTOM.
Voltage
Required fields:
units.valueandprecisionmust be non-negative.
Volume
Required fields:
units.valueandprecisionmust be non-negative.
Wavelength
Required fields:
units.valueandprecisionmust be non-negative.