ci600332j_si_008.txt (213.88 kB)
Contemporary QSAR Classifiers Compared
dataset
posted on 2007-01-22, 00:00 authored by Craig L. Bruce, James L. Melville, Stephen D. Pickett, Jonathan D. HirstWe present a comparative assessment of several state-of-the-art machine learning tools for mining drug
data, including support vector machines (SVMs) and the ensemble decision tree methods boosting, bagging,
and random forest, using eight data sets and two sets of descriptors. We demonstrate, by rigorous multiple
comparison statistical tests, that these techniques can provide consistent improvements in predictive
performance over single decision trees. However, within these methods, there is no clearly best-performing
algorithm. This motivates a more in-depth investigation into the properties of random forests. We identify
a set of parameters for the random forest that provide optimal performance across all the studied data sets.
Additionally, the tree ensemble structure of the forest may provide an interpretable model, a considerable
advantage over SVMs. We test this possibility and compare it with standard decision tree models.