ci0c00701_si_001.xlsx (818.27 kB)

Findings of the Second Challenge to Predict Aqueous Solubility

dataset

posted on 2020-09-03, 16:35 authored by Antonio Llinas, Ioana Oprisiu, Alex Avdeef

Ten years ago, we issued an open prediction challenge to the cheminformatics community: would participants be able to predict the equilibrium intrinsic solubilities of 32 druglike molecules using only a high-precision (CheqSol instrument, performed in one laboratory) set of 100 compounds as a training set? The “solubility challenge” was a widely recognized success and spurred many discussions about the prediction methods and quality of data. We revisited the competition a second time recently and challenged the community to a different challenge, not a blind test this time but using a larger test set of molecules, gathered and curated from published sources (mostly “gold standard” saturation shake-flask measurements), where the average interlaboratory reproducibility for the molecules was estimated to be ∼0.17 log unit. Also, a second test set was included, comprising “contentious” molecules, the reported (mostly shake-flask) solubility of which had higher average uncertainty, ∼0.62 log unit. In the second competition, the participants were invited to use their own training sets, provided that the training sets did not contain any of the test set molecules. We were motivated to revisit the competition to (1) examine to what extent computational methods had improved in 10 years, (2) verify that data quality may not be the main limiting factor in the accuracy of the prediction method, and (3) attempt to seek a relationship between the makeup of the training set data and the prediction outcome.