10.1021/ci600476r.s006
Dragos Horvath
Dragos
Horvath
Fanny Bonachera
Fanny
Bonachera
Vitaly Solov'ev
Vitaly
Solov'ev
Cédric Gaudin
Cédric
Gaudin
Alexander Varnek
Alexander
Varnek
Stochastic versus Stepwise Strategies for Quantitative Structure−Activity Relationship
GenerationHow Much Effort May the Mining for Successful QSAR Models Take?<sup>†</sup>
American Chemical Society
2007
applicability ranges
consensus models
GA
compound
Enabling SQS
model sets
upfront simplifications
sound strategy
Stepwise regression
SQS model sets
tractable size
ISIDA Stepwise Regression
nonlinear descriptor transformations increases
nonlinearity policy
antipodal approach
Stepwise Strategies
SQS results
nonlinear models
consensus model
model search
validation tests
Genetic Algorithm
splitting scheme
problem space
QSAR building tool
Model validation benchmarking
SR consensus equations
Effort May
Successful QSAR Models
2007-05-29 00:00:00
Journal contribution
https://acs.figshare.com/articles/journal_contribution/Stochastic_versus_Stepwise_Strategies_for_Quantitative_Structure_Activity_Relationship_Generation_How_Much_Effort_May_the_Mining_for_Successful_QSAR_Models_Take_sup_sup_/3004543
Descriptor selection in QSAR typically relies on a set of upfront working hypotheses in order to boil down
the initial descriptor set to a tractable size. Stepwise regression, computationally cheap and therefore widely
used in spite of its potential caveats, is most aggressive in reducing the effectively explored problem space
by adopting a greedy variable pick strategy. This work explores an antipodal approach, incarnated by an
original Genetic Algorithm (GA)-based Stochastic QSAR Sampler (SQS) that favors unbiased model search
over computational cost. Independent of a priori descriptor filtering and, most important, not limited to
linear models only, it was benchmarked against the ISIDA Stepwise Regression (SR) tool. SQS was run
under various premises, varying the training/validation set splitting scheme, the nonlinearity policy, and the
used descriptors. With the considered three anti-HIV compound sets, repeated SQS runs generate sometimes
poorly overlapping but nevertheless equally well validating model sets. Enabling SQS to apply nonlinear
descriptor transformations increases the problem space: nevertheless, nonlinear models tend to be more
robust validators. Model validation benchmarking showed SQS to match the performance of SR or outperform
it in cases when the upfront simplifications of SR “backfire”, even though the robust SR got trapped in
local minima only once in six cases. Consensus models from large SQS model sets validate wellbut not
outstandingly better than SR consensus equations. SQS is thus a robust QSAR building tool according to
standard validation tests against external sets of compounds (of same families as used for training), but
many of its benefits/drawbacks may yet not be revealed by such tests. SQS results are a challenge to the
traditional way to interpret and exploit QSAR: how to deal with thousands of well validating models,
nonetheless providing potentially diverging applicability ranges and predicted values for external compounds.
SR does not impose such burden on the user, but is “betting” on a single equation or a narrow consensus
model to behave properly in virtual screening a sound strategy? By posing these questions, this article will
hopefully act as an incentive for the long-haul studies needed to get them answered.