Selecting Optimally Diverse Compounds from Structure Databases: A Validation Study of Two-Dimensional and Three-Dimensional Molecular Descriptors
journal contributionposted on 11.04.1997, 00:00 by Hans Matter
The efficiency of the drug discovery process can be significantly improved using design techniques to maximize the diversity of structure databases or combinatorial libraries. Here, several physicochemical descriptors were investigated to quantify molecular diversity. Based on the 2D or 3D topological similarity of molecules, the relationship between physicochemical metrics and biological activity was studied to find valid descriptors. Several compounds were selected using those descriptors from a database containing diverse templates and 55 biological classes. It was evaluated whether the obtained subsets represent all biological properties and structural variations of the original database. In addition, hierarchical cluster analyses were used to group molecules from the parent database, which should have similar biological properties. Using various sets of structurally similar molecules, it was possible to derive quantitative measures for compound similarities in relation to biological properties. A similarity radius for 2D fingerprints and molecular steric fields was estimated; compounds within this radius of another molecule were shown to have comparable biological properties. This study demonstrates that 2D fingerprints alone or in combination with other metrics as the primary descriptor allow to handle global diversity. In addition, standard atom-pair descriptors or molecular steric fields can be used to correlate structural diversity with biological activity. Hence, the latter two descriptors can be classified as secondary descriptors useful for analog library design, while 2D fingerprints are applicable to design a general library for lead discovery. Based on these findings, an optimally diverse subset containing only 38% of the entire IC93 database was generated using 2D fingerprints. Here no structure is more similar than 0.85 to any other (Tanimoto coefficient), but all biological classes were selected. This reduction of redundancy led to a child database with the same physicochemical diversity space, which contains the same information as the original database.