posted on 2021-08-04, 19:15authored byMichael Tynes, Wenhao Gao, Daniel J. Burrill, Enrique R. Batista, Danny Perez, Ping Yang, Nicholas Lubbers
Machine learning
(ML) plays a growing role in the design and discovery
of chemicals, aiming to reduce the need to perform expensive experiments
and simulations. ML for such applications is promising but difficult,
as models must generalize to vast chemical spaces from small training
sets and must have reliable uncertainty quantification metrics to
identify and prioritize unexplored regions. Ab initio computational chemistry and chemical intuition alike often take
advantage of differences between chemical conditions, rather than
their absolute structure or state, to generate more reliable results.
We have developed an analogous comparison-based approach for ML regression,
called pairwise difference regression (PADRE), which is applicable
to arbitrary underlying learning models and operates on pairs of input
data points. During training, the model learns to predict differences
between all possible pairs of input points. During prediction, the
test points are paired with all training set points, giving rise to
a set of predictions that can be treated as a distribution of which
the mean is treated as a final prediction and the dispersion is treated
as an uncertainty measure. Pairwise difference regression was shown
to reliably improve the performance of the random forest algorithm
across five chemical ML tasks. Additionally, the pair-derived dispersion
is both well correlated with model error and performs well in active
learning. We also show that this method is competitive with state-of-the-art
neural network techniques. Thus, pairwise difference regression is
a promising tool for candidate selection algorithms used in chemical
discovery.