posted on 2014-07-28, 00:00authored byAna L. Teixeira, Andre O. Falcao
Structurally similar molecules tend
to have similar properties,
i.e. closer molecules in the molecular space are more likely to yield
similar property values while distant molecules are more likely to
yield different values. Based on this principle, we propose the use
of a new method that takes into account the high dimensionality of
the molecular space, predicting chemical, physical, or biological
properties based on the most similar compounds with measured properties.
This methodology uses ordinary kriging coupled with three different
molecular similarity approaches (based on molecular descriptors, fingerprints,
and atom matching) which creates an interpolation map over the molecular
space that is capable of predicting properties/activities for diverse
chemical data sets. The proposed method was tested in two data sets
of diverse chemical compounds collected from the literature and preprocessed.
One of the data sets contained dihydrofolate reductase inhibition
activity data, and the second molecules for which aqueous solubility
was known. The overall predictive results using kriging for both data
sets comply with the results obtained in the literature using typical
QSPR/QSAR approaches. However, the procedure did not involve any type
of descriptor selection or even minimal information about each problem,
suggesting that this approach is directly applicable to a large spectrum
of problems in QSAR/QSPR. Furthermore, the predictive results improve
significantly with the similarity threshold between the training and
testing compounds, allowing the definition of a confidence threshold
of similarity and error estimation for each case inferred. The use
of kriging for interpolation over the molecular metric space is independent
of the training data set size, and no reparametrizations are necessary
when more compounds are added or removed from the set, and increasing
the size of the database will consequentially improve the quality
of the estimations. Finally it is shown that this model can be used
for checking the consistency of measured data and for guiding an extension
of the training set by determining the regions of the molecular space
for which new experimental measurements could be used to maximize
the model’s predictive performance.