posted on 2021-11-01, 17:43authored byDong Chen, Jiaxin Zheng, Guo-Wei Wei, Feng Pan
The
construction of appropriate representations remains essential
for molecular predictions due to intricate molecular complexity. Additionally,
it is often expensive and ethically constrained to generate labeled
data for supervised learning in molecular sciences, leading to challenging
small and diverse data sets. In this work, we develop a self-supervised
learning approach to pretrain models from over 700 million unlabeled
molecules in multiple databases. The intrinsic chemical logic learned
from this approach enables the extraction of predictive representations
from task-specific molecular sequences in a fine-tuned process. To
understand the importance of self-supervised learning from unlabeled
molecules, we assemble three models with different combinations of
databases. Moreover, we propose a protocol based on data traits to
automatically select the optimal model for a specific task. To validate
the proposed method, we consider 10 benchmarks and 38 virtual screening
data sets. Extensive validation indicates that the proposed method
shows superb performance.