posted on 2017-04-18, 00:00authored byYing Dong, Bingren Xiang, Ding Du
In QSAR/QSPR modeling, the indispensable
way to validate the predictability
of a model is to perform its statistical external validation. It is
common that a division algorithm should be used to select training
sets from chemical compound libraries or collections prior to external
validations. In this study, a division method based on the posterior
variance of leave-one-out cross-validation (PVLOO) of the
Gaussian process (GP) has been developed with the goal of producing
more predictive models. Four structurally diverse data sets of good
quality are collected from the literature and then redeveloped and
validated on the basis of training set selection methods, namely,
four kinds of PVLOO-based training set selection methods
with three types of covariance functions (squared exponential, rational
quadratic, and neural network covariance functions), the Kennard–Stone
algorithm, and random division. The root mean squared error (RMSE)
of external validation reported for each model serves as a basis for
the final comparison. The results of this study indicate that the
training sets with higher values of PVLOO have statistically
better external predictability than the training sets generated from
other division methods discussed here. These findings could be explained
by proposing that the PVLOO value of GP could indicate
the mechanism diversity of a specific compound in QSAR/QSPR data sets.