posted on 2022-10-10, 16:37authored byGabriel Sinclair, Inthirany Thillainadarajah, Brian Meyer, Vicente Samano, Sakuntala Sivasupramaniam, Linda Adams, Egon L. Willighagen, Ann M. Richard, Martin Walker, Antony J. Williams
The online encyclopedia
Wikipedia aggregates a large amount of
data on chemistry, encompassing well over 20,000 individual Wikipedia
pages and serves the general public as well as the chemistry community.
Many other chemical databases and services utilize these data, and
previous projects have focused on methods to index, search, and extract
it for review and use. We present a comprehensive effort that combines
bulk automated data extraction over tens of thousands of pages, semiautomated
data extraction over hundreds of pages, and fine-grained manual extraction
of individual lists and compounds of interest. We then correlate these
data with the existing contents of the U.S. Environmental Protection
Agency’s (EPA) Distributed Structure-Searchable Toxicity (DSSTox)
database. This was performed with a number of intentions including
ensuring as complete a mapping as possible between the Dashboard and
Wikipedia so that relevant snippets of the article are loaded for
the user to review. Conflicts between Dashboard content and Wikipedia
in terms of, for example, identifiers such as chemical registry numbers,
names, and InChIs and structure-based collisions such as SMILES were
identified and used as the basis of curation of both DSSTox and Wikipedia.
This work also allowed us to evaluate available data for sets of chemicals
of interest to the Agency, such as synthetic cannabinoids, and expand
the content in DSSTox as appropriate. This work also led to improved
bidirectional linkage of the detailed chemistry and usage information
from Wikipedia with expert-curated structure and identifier data from
DSSTox for a new list of nearly 20,000 chemicals. All of this work
ultimately enhances the data mappings that allow for the display of
the introduction of the Wikipedia article in the community-accessible
web-based EPA Comptox Chemicals Dashboard, enhancing the user experience
for the thousands of users per day accessing the resource.