posted on 2016-02-18, 13:25authored byDanielle Newby, Alex. A. Freitas, Taravat Ghafourian
There
are currently thousands of molecular descriptors that can be calculated
to represent a chemical compound. Utilizing all molecular descriptors
in Quantitative Structure–Activity Relationships (QSAR) modeling
can result in overfitting, decreased interpretability, and thus reduced
model performance. Feature selection methods can overcome some of
these problems by drastically reducing the number of molecular descriptors
and selecting the molecular descriptors relevant to the property being
predicted. In particular, decision trees such as C&RT, although
they have an embedded feature selection algorithm, can be inadequate
since further down the tree there are fewer compounds available for
descriptor selection, and therefore descriptors may be selected which
are not optimal. In this work we compare two broad approaches for
feature selection: (1) a “two-stage” feature selection
procedure, where a pre-processing feature selection method selects
a subset of descriptors, and then classification and regression trees
(C&RT) selects descriptors from this subset to build a decision
tree; (2) a “one-stage” approach where C&RT is used
as the only feature selection technique. These methods were applied
in order to improve prediction accuracy of QSAR models for oral absorption.
Additionally, this work utilizes misclassification costs in model
building to overcome the problem of the biased oral absorption data
sets with more highly absorbed than poorly absorbed compounds. In
most cases the two-stage feature selection with pre-processing approach
had higher model accuracy compared with the one-stage approach. Using
the top 20 molecular descriptors from the random forest predictor
importance method gave the most accurate C&RT classification model.
The molecular descriptors selected by the five filter feature selection
methods have been compared in relation to oral absorption. In conclusion,
the use of filter pre-processing feature selection methods and misclassification
costs produce models with better interpretability and predictability
for the prediction of oral absorption.