Active Learning the Potential Energy Landscape for Water Clusters from Sparse Training Data
journal contributionposted on 2020-02-17, 05:31 authored by Troy D. Loeffler, Tarak K. Patra, Henry Chan, Mathew Cherukara, Subramanian K. R. S. Sankaranarayanan
Molecular dynamics with predefined functional forms is a popular technique for understanding dynamical evolution of systems. The predefined functional forms impose limits on the physics that can be captured. Artificial neural network (ANN) models have emerged as an attractive flexible alternative to the expensive quantum calculations (e.g., density functional theory) in the area of molecular force-fields. Ideally, if one is able to train a ANN to accurately predict the correct DFT energy and forces for any given structure, they gain the ability to perform molecular dynamics with high accuracy while simultaneously reducing the computation cost in a dramatic fashion. While this goal is very lucrative, neural networks are interpolative and therefore, it is not always clear how one should go about training a neural network to exhaustively fit the entire phase space of a given system. Currently, ANNs are trained by generating large quantities (on the order of 104 or greater) of training data in hopes that the ANN has adequately sampled the energy landscape both near and far-from-equilibrium. This can, however, be a bit prohibitive when it comes to more accurate levels of quantum theory. As such, it is desirable to train a model using the absolute minimal data set possible, especially when costs of high-fidelity calculations such as CCSD and QMC are high. Here, we present an active learning approach that iteratively trains an ANN model to faithfully replicate the coarse-grained energy surface of water clusters using only 426 total structures in its training data. Our active learning workflow starts with a sparse training data set which is continually updated via a Nested Ensemble Monte Carlo scheme that sparsely queries the energy landscape and tests the network performance. Next, the network is retrained with an updated training set that includes failed configurations/energies from previous iteration until convergence is attained. Once trained, we generate an extensive test set of 100 000 configurations sampled across clusters ranging from 1 to 200 molecules and demonstrate that the trained network adequately reproduces the energies (within mean absolute error (MAE) of 2 meV/molecule) and forces (MAE 40 meV/Å) compared to the reference model. More importantly, the trained ANN model also accurately captures both the structure as well as the free energy as a function of the various cluster sizes. Overall, this study reports a new active learning scheme with promising strategy to develop accurate force-fields for molecular simulations using extremely sparse training data sets.