OpenChemIE: An
Information Extraction Toolkit for
Chemistry Literature
Posted on 2024-07-01 - 23:44
Information extraction from chemistry literature is vital
for constructing
up-to-date reaction databases for data-driven chemistry. Complete
extraction requires combining information across text, tables, and
figures, whereas prior work has mainly investigated extracting reactions
from single modalities. In this paper, we present OpenChemIE to address
this complex challenge and enable the extraction of reaction data
at the document level. OpenChemIE approaches the problem in two steps:
extracting relevant information from individual modalities and then
integrating the results to obtain a final list of reactions. For the
first step, we employ specialized neural models that each address
a specific task for chemistry information extraction, such as parsing
molecules or reactions from text or figures. We then integrate the
information from these modules using chemistry-informed algorithms,
allowing for the extraction of fine-grained reaction data from reaction
condition and substrate scope investigations. Our machine learning
models attain state-of-the-art performance when evaluated individually,
and we meticulously annotate a challenging dataset of reaction schemes
with R-groups to evaluate our pipeline as a whole, achieving an F1
score of 69.5%. Additionally, the reaction extraction results of OpenChemIE
attain an accuracy score of 64.3% when directly compared against the
Reaxys chemical database. OpenChemIE is most suited for information
extraction on organic chemistry literature, where molecules are generally
depicted as planar graphs or written in text and can be consolidated
into a SMILES format. We provide OpenChemIE freely to the public as
an open-source package, as well as through a web interface.
CITE THIS COLLECTION
DataCiteDataCite
No result found
Fan, Vincent; Qian, Yujie; Wang, Alex; Wang, Amber; Coley, Connor W.; Barzilay, Regina (2024). OpenChemIE: An
Information Extraction Toolkit for
Chemistry Literature. ACS Publications. Collection. https://doi.org/10.1021/acs.jcim.4c00572Â