posted on 2007-07-23, 00:00authored byLothar Terfloth, Bruno Bienfait, Johann Gasteiger
A data set of 379 drugs and drug analogs that are metabolized by human cytochrome P450 (CYP) isoforms
3A4, 2D6, and 2C9, respectively, was studied. A series of descriptor sets directly calculable from the
constitution of these drugs was systematically investigated as to their power into classifying a compound
into the CYP isoform that metabolizes it. In a four-step build-up process eventually 303 different descriptor
components were investigated for 146 compounds of a training set by various model building methods,
such as multinomal logistic regression, decision tree, or support vector machine (SVM). Automatic variable
selection algorithms were used in order to decrease the number of descriptors. A comprehensive scheme of
cross-validation (CV) experiments was applied to assess the robustness and reliability of the four models
developed. In addition, the predictive power of the four models presented in this paper was inspected by
predicting an external validation data set with 233 compounds. The best model has a leave-one-out (LOO)
cross-validated predictivity of 89% and gives 83% correct predictions for the external validation data set.
For our favored model we showed the strong influence on the predictivity of the way a data set is split into
a training and test data set.