posted on 2024-02-09, 01:05authored byJie Li, Jiashu Liang, Zhe Wang, Aleksandra L. Ptaszek, Xiao Liu, Brad Ganoe, Martin Head-Gordon, Teresa Head-Gordon
Theoretical predictions of NMR chemical
shifts from first-principles
can greatly facilitate experimental interpretation and structure identification
of molecules in gas, solution, and solid-state phases. However, accurate
prediction of chemical shifts using the gold-standard coupled cluster
with singles, doubles, and perturbative triple excitations [CCSD(T)]
method with a complete basis set (CBS) can be prohibitively expensive.
By contrast, machine learning (ML) methods offer inexpensive alternatives
for chemical shift predictions but are hampered by generalization
to molecules outside the original training set. Here, we propose several
new ideas in ML of the chemical shift prediction for H, C, N, and
O that first introduce a novel feature representation, based on the
atomic chemical shielding tensors within a molecular environment using
an inexpensive quantum mechanics (QM) method, and train it to predict
NMR chemical shieldings of a high-level composite theory that approaches
the accuracy of CCSD(T)/CBS. In addition, we train the ML model through
a new progressive active learning workflow that reduces the total
number of expensive high-level composite calculations required while
allowing the model to continuously improve on unseen data. Furthermore,
the algorithm provides an error estimation, signaling potential unreliability
in predictions if the error is large. Finally, we introduce a novel
approach to keep the rotational invariance of the features using tensor
environment vectors (TEVs) that yields a ML model with the highest
accuracy compared to a similar model using data augmentation. We illustrate
the predictive capacity of the resulting inexpensive shift machine
learning (iShiftML) models across several benchmarks, including unseen
molecules in the NS372 data set, gas-phase experimental chemical shifts
for small organic molecules, and much larger and more complex natural
products in which we can accurately differentiate between subtle diastereomers
based on chemical shift assignments.