ci500131x_si_001.xlsx (1.33 MB)
Benchmark Data Sets for Structure-Based Computational Target Prediction
dataset
posted on 2014-08-25, 00:00 authored by Karen
T. Schomburg, Matthias RareyStructure-based
computational target prediction methods identify
potential targets for a bioactive compound. Methods based on protein–ligand
docking so far face many challenges, where the greatest probably is
the ranking of true targets in a large data set of protein structures.
Currently, no standard data sets for evaluation exist, rendering comparison
and demonstration of improvements of methods cumbersome. Therefore,
we propose two data sets and evaluation strategies for a meaningful
evaluation of new target prediction methods, i.e., a small data set
consisting of three target classes for detailed proof-of-concept and
selectivity studies and a large data set consisting of 7992 protein
structures and 72 drug-like ligands allowing statistical evaluation
with performance metrics on a drug-like chemical space. Both data
sets are built from openly available resources, and any information
needed to perform the described experiments is reported. We describe
the composition of the data sets, the setup of screening experiments,
and the evaluation strategy. Performance metrics capable to measure
the early recognition of enrichments like AUC, BEDROC, and NSLR are
proposed. We apply a sequence-based target prediction method to the
large data set to analyze its content of nontrivial evaluation cases.
The proposed data sets are used for method evaluation of our new inverse
screening method iRAISE. The small data set reveals
the method’s capability and limitations to selectively distinguish
between rather similar protein structures. The large data set simulates
real target identification scenarios. iRAISE achieves
in 55% excellent or good enrichment a median AUC of 0.67 and RMSDs
below 2.0 Å for 74% and was able to predict the first true target
in 59 out of 72 cases in the top 2% of the protein data set of about
8000 structures.