ci6b00753_si_003.zip (34.87 kB)
Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets
dataset
posted on 2017-07-17, 00:00 authored by Richard L. Marchese
Robinson, Anna Palczewska, Jan Palczewski, Nathan KidleyThe
ability to interpret the predictions made by quantitative structure–activity
relationships (QSARs) offers a number of advantages. While QSARs built
using nonlinear modeling approaches, such as the popular Random Forest
algorithm, might sometimes be more predictive than those built using
linear modeling approaches, their predictions have been perceived
as difficult to interpret. However, a growing number of approaches
have been proposed for interpreting nonlinear QSAR models in general
and Random Forest in particular. In the current work, we compare the
performance of Random Forest to those of two widely used linear modeling
approaches: linear Support Vector Machines (SVMs) (or Support Vector
Regression (SVR)) and partial least-squares (PLS). We compare their
performance in terms of their predictivity as well as the chemical
interpretability of the predictions using novel scoring schemes for
assessing heat map images of substructural contributions. We critically
assess different approaches for interpreting Random Forest models
as well as for obtaining predictions from the forest. We assess the
models on a large number of widely employed public-domain benchmark
data sets corresponding to regression and binary classification problems
of relevance to hit identification and toxicology. We conclude that
Random Forest typically yields comparable or possibly better predictive
performance than the linear modeling approaches and that its predictions
may also be interpreted in a chemically and biologically meaningful
way. In contrast to earlier work looking at interpretation of nonlinear
QSAR models, we directly compare two methodologically distinct approaches
for interpreting Random Forest models. The approaches for interpreting
Random Forest assessed in our article were implemented using open-source
programs that we have made available to the community. These programs
are the rfFC package (https://r-forge.r-project.org/R/?group_id=1725) for the R statistical programming language and the Python program
HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for heat map generation.
History
Usage metrics
Categories
Keywords
Random Forest algorithmnonlinear modeling approachesmodeling approachesSVMSVRheat map imagespublic-domain benchmark data setspredictionSupport Vector MachinesPython program HeatMapWrapperheat map generationnonlinear QSAR modelsBenchmark Data SetsSupport Vector RegressionPLSRandom ForestRandom Forest modelsperformance
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC