posted on 2021-03-08, 17:37authored byJianing Lu, Song Xia, Jieyu Lu, Yingkai Zhang
A dataset is the basis of deep learning
model development, and
the success of deep learning models heavily relies on the quality
and size of the dataset. In this work, we present a new data preparation
protocol and build a large fragment-based dataset Frag20, which consists
of optimized 3D geometries and calculated molecular properties from
Merck molecular force field (MMFF) and DFT at the B3LYP/6-31G* level
of theory for more than half a million molecules composed of H, B,
C, O, N, F, P, S, Cl, and Br with no larger than 20 heavy atoms. Based
on the new dataset, we develop robust molecular energy prediction
models using a simplified PhysNet architecture for both DFT-optimized
and MMFF-optimized geometries, which achieve better than or close
to chemical accuracy (1 kcal/mol) on multiple test sets, including
CSD20 and Plati20 based on experimental crystal structures.