Development of in Silico Models for Predicting P‑Glycoprotein Inhibitors Based on a Two-Step Approach for Feature Selection and Its Application to Chinese Herbal Medicine Screening
journal contributionposted on 05.10.2015, 00:00 by Ming Yang, Jialei Chen, Xiufeng Shi, Liwen Xu, Zhijun Xi, Lisha You, Rui An, Xinhong Wang
P-glycoprotein (P-gp) is regarded as an important factor in determining the ADMET (absorption, distribution, metabolism, elimination, and toxicity) characteristics of drugs and drug candidates. Successful prediction of P-gp inhibitors can thus lead to an improved understanding of the underlying mechanisms of both changes in the pharmacokinetics of drugs and drug–drug interactions. Therefore, there has been considerable interest in the development of in silico modeling of P-gp inhibitors in recent years. Considering that a large number of molecular descriptors are used to characterize diverse structural moleculars, efficient feature selection methods are required to extract the most informative predictors. In this work, we constructed an extensive available data set of 2428 molecules that includes 1518 P-gp inhibitors and 910 P-gp noninhibitors from multiple resources. Importantly, a two-step feature selection approach based on a genetic algorithm and a greedy forward-searching algorithm was employed to select the minimum set of the most informative descriptors that contribute to the prediction of P-gp inhibitors. To determine the best machine learning algorithm, 18 classifiers coupled with the feature selection method were compared. The top three best-performing models (flexible discriminant analysis, support vector machine, and random forest) and their ensemble model using respectively only 3, 9, 7, and 14 descriptors achieve an overall accuracy of 83.2%–86.7% for the training set containing 1040 compounds, an overall accuracy of 82.3%–85.5% for the test set containing 1039 compounds, and a prediction accuracy of 77.4%–79.9% for the external validation set containing 349 compounds. The models were further extensively validated by DrugBank database (1890 compounds). The proposed models are competitive with and in some cases better than other published models in terms of prediction accuracy and minimum number of descriptors. Applicability domain then was addressed by developing an ensemble classification model to obtain more reliable predictions. Finally, we employed these models as a virtual screening tool for identifying potential P-gp inhibitors in Traditional Chinese Medicine Systems Pharmacology (TCMSP) database containing a total of 13 051 unique compounds from 498 herbs, resulting in 875 potential P-gp inhibitors and 15 inhibitor-rich herbs. These predictions were partly supported by a literature search and are valuable not only to develop novel P-gp inhibitors from TCM in the early stages of drug development, but also to optimize the use of herbal remedies.