ci600476r_si_007.pdf (34.53 kB)
Stochastic versus Stepwise Strategies for Quantitative Structure−Activity Relationship GenerationHow Much Effort May the Mining for Successful QSAR Models Take?†
journal contribution
posted on 2007-05-29, 00:00 authored by Dragos Horvath, Fanny Bonachera, Vitaly Solov'ev, Cédric Gaudin, Alexander VarnekDescriptor selection in QSAR typically relies on a set of upfront working hypotheses in order to boil down
the initial descriptor set to a tractable size. Stepwise regression, computationally cheap and therefore widely
used in spite of its potential caveats, is most aggressive in reducing the effectively explored problem space
by adopting a greedy variable pick strategy. This work explores an antipodal approach, incarnated by an
original Genetic Algorithm (GA)-based Stochastic QSAR Sampler (SQS) that favors unbiased model search
over computational cost. Independent of a priori descriptor filtering and, most important, not limited to
linear models only, it was benchmarked against the ISIDA Stepwise Regression (SR) tool. SQS was run
under various premises, varying the training/validation set splitting scheme, the nonlinearity policy, and the
used descriptors. With the considered three anti-HIV compound sets, repeated SQS runs generate sometimes
poorly overlapping but nevertheless equally well validating model sets. Enabling SQS to apply nonlinear
descriptor transformations increases the problem space: nevertheless, nonlinear models tend to be more
robust validators. Model validation benchmarking showed SQS to match the performance of SR or outperform
it in cases when the upfront simplifications of SR “backfire”, even though the robust SR got trapped in
local minima only once in six cases. Consensus models from large SQS model sets validate wellbut not
outstandingly better than SR consensus equations. SQS is thus a robust QSAR building tool according to
standard validation tests against external sets of compounds (of same families as used for training), but
many of its benefits/drawbacks may yet not be revealed by such tests. SQS results are a challenge to the
traditional way to interpret and exploit QSAR: how to deal with thousands of well validating models,
nonetheless providing potentially diverging applicability ranges and predicted values for external compounds.
SR does not impose such burden on the user, but is “betting” on a single equation or a narrow consensus
model to behave properly in virtual screening a sound strategy? By posing these questions, this article will
hopefully act as an incentive for the long-haul studies needed to get them answered.
History
Usage metrics
Categories
Keywords
applicability rangesconsensus modelsGAcompoundEnabling SQSmodel setsupfront simplificationssound strategyStepwise regressionSQS model setstractable sizeISIDA Stepwise Regressionnonlinear descriptor transformations increasesnonlinearity policyantipodal approachStepwise StrategiesSQS resultsnonlinear modelsconsensus modelmodel searchvalidation testsGenetic Algorithmsplitting schemeproblem spaceQSAR building toolModel validation benchmarkingSR consensus equationsEffort MaySuccessful QSAR Models
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC