posted on 2003-05-02, 00:00authored byJörg K. Wegner, Andreas Zell
The paper describes a fast and flexible descriptor selection method using a genetic algorithm variant (GA-SEC). The relevance of the descriptors will be measured using Shannon entropy (SE) and differential Shannon
entropy (DSE), which have very sparse memory requirements and allow the processing of huge data sets.
A small quantity of the most important descriptors will be used automatically to build a value prediction
model. The most important descriptors are not a linear combination of other descriptors, but transparent,
pure descriptors. We used an artificial neural network (ANN) model to predict the aqueous solubility logS
and the octanol/water partition coefficient logP. The logS data set was divided into a training set of 1016
compounds and a test set of 253 compounds. A correlation coefficient of 0.93 and an empirical standard
deviation of 0.54 were achieved. The logP data set was divided into a training set of 1853 compounds and
a test set of 138 compounds. A correlation coefficient of 0.92 and an empirical standard deviation of 0.44
were achieved.