posted on 2020-09-03, 16:35authored byAntonio Llinas, Ioana Oprisiu, Alex Avdeef
Ten
years ago, we issued an open prediction challenge to the cheminformatics
community: would participants be able to predict the equilibrium intrinsic
solubilities of 32 druglike molecules using only a high-precision
(CheqSol instrument, performed in one laboratory) set of 100 compounds
as a training set? The “solubility challenge” was a
widely recognized success and spurred many discussions about the prediction
methods and quality of data. We revisited the competition a second
time recently and challenged the community to a different challenge,
not a blind test this time but using a larger test set of molecules,
gathered and curated from published sources (mostly “gold standard”
saturation shake-flask measurements), where the average interlaboratory
reproducibility for the molecules was estimated to be ∼0.17
log unit. Also, a second test set was included, comprising “contentious”
molecules, the reported (mostly shake-flask) solubility of which had
higher average uncertainty, ∼0.62 log unit. In the second competition,
the participants were invited to use their own training sets, provided
that the training sets did not contain any of the test set molecules.
We were motivated to revisit the competition to (1) examine to what
extent computational methods had improved in 10 years, (2) verify
that data quality may not be the main limiting factor in the accuracy
of the prediction method, and (3) attempt to seek a relationship between
the makeup of the training set data and the prediction outcome.