ci060164k_si_002.pdf (164.9 kB)
Download fileRandom Forest Models To Predict Aqueous Solubility
journal contribution
posted on 2007-01-22, 00:00 authored by David S. Palmer, Noel M. O'Boyle, Robert C. Glen, John B. O. MitchellRandom Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM),
and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous
solubility, based on experimental data for 988 organic molecules. The Random Forest regression model
predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods
for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of
predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an
external test set of 330 molecules that are solid at 25 °C gave an r2 = 0.89 and RMSE = 0.69 log S units.
For a standard data set selected from the literature, the model performed well with respect to other documented
methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by
molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression
analysis.
History
Usage metrics
Categories
Keywords
regression analysisdescriptor selectionprediction0.69 log S unitsSupport Vector Machines330 moleculesPredict Aqueous SolubilityRandom Forest regressionRMSERFchemical spacelog molar solubilitydescriptor importanceRandom Forest regression modelSVMRandom Forest Modelsr 2PLSQSPR modelsANNmethodMDL drug data reportArtificial Neural Networkstest sets