Structural Similarity Based Kriging for Quantitative Structure Activity and Property Relationship Modeling
datasetposted on 28.07.2014 by Ana L. Teixeira, Andre O. Falcao
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
Structurally similar molecules tend to have similar properties, i.e. closer molecules in the molecular space are more likely to yield similar property values while distant molecules are more likely to yield different values. Based on this principle, we propose the use of a new method that takes into account the high dimensionality of the molecular space, predicting chemical, physical, or biological properties based on the most similar compounds with measured properties. This methodology uses ordinary kriging coupled with three different molecular similarity approaches (based on molecular descriptors, fingerprints, and atom matching) which creates an interpolation map over the molecular space that is capable of predicting properties/activities for diverse chemical data sets. The proposed method was tested in two data sets of diverse chemical compounds collected from the literature and preprocessed. One of the data sets contained dihydrofolate reductase inhibition activity data, and the second molecules for which aqueous solubility was known. The overall predictive results using kriging for both data sets comply with the results obtained in the literature using typical QSPR/QSAR approaches. However, the procedure did not involve any type of descriptor selection or even minimal information about each problem, suggesting that this approach is directly applicable to a large spectrum of problems in QSAR/QSPR. Furthermore, the predictive results improve significantly with the similarity threshold between the training and testing compounds, allowing the definition of a confidence threshold of similarity and error estimation for each case inferred. The use of kriging for interpolation over the molecular metric space is independent of the training data set size, and no reparametrizations are necessary when more compounds are added or removed from the set, and increasing the size of the database will consequentially improve the quality of the estimations. Finally it is shown that this model can be used for checking the consistency of measured data and for guiding an extension of the training set by determining the regions of the molecular space for which new experimental measurements could be used to maximize the model’s predictive performance.