posted on 2023-05-18, 04:15authored byLawson
T. Glasby, Kristian Gubsch, Rosalee Bence, Rama Oktavian, Kesler Isoko, Seyed Mohamad Moosavi, Joan L. Cordiner, Jason C. Cole, Peyman Z. Moghadam
The vastness of materials
space, particularly that which is concerned
with metal–organic frameworks (MOFs), creates the critical
problem of performing efficient identification of promising materials
for specific applications. Although high-throughput computational
approaches, including the use of machine learning, have been useful
in rapid screening and rational design of MOFs, they tend to neglect
descriptors related to their synthesis. One way to improve the efficiency
of MOF discovery is to data-mine published MOF papers to extract the
materials informatics knowledge contained within journal articles.
Here, by adapting the chemistry-aware natural language processing
tool, ChemDataExtractor (CDE), we generated an open-source database
of MOFs focused on their synthetic properties: the DigiMOF database.
Using the CDE web scraping package alongside the Cambridge Structural
Database (CSD) MOF subset, we automatically downloaded 43,281 unique
MOF journal articles, extracted 15,501 unique MOF materials, and text-mined
over 52,680 associated properties including the synthesis method,
solvent, organic linker, metal precursor, and topology. Additionally,
we developed an alternative data extraction technique to obtain and
transform the chemical names assigned to each CSD entry in order to
determine linker types for each structure in the CSD MOF subset. This
data enabled us to match MOFs to a list of known linkers provided
by Tokyo Chemical Industry UK Ltd. (TCI) and analyze the cost of these
important chemicals. This centralized, structured database reveals
the MOF synthetic data embedded within thousands of MOF publications
and contains further topology, metal type, accessible surface area,
largest cavity diameter, pore limiting diameter, open metal sites,
and density calculations for all 3D MOFs in the CSD MOF subset. The
DigiMOF database and associated software are publicly available for
other researchers to rapidly search for MOFs with specific properties,
conduct further analysis of alternative MOF production pathways, and
create additional parsers to search for additional desirable properties.