posted on 2024-06-25, 14:34authored byJungwoo Kim, Woojae Chang, Hyunjun Ji, InSuk Joung
We examined pretraining tasks leveraging abundant labeled
data
to effectively enhance molecular representation learning in downstream
tasks, specifically emphasizing graph transformers to improve the
prediction of ADMET properties. Our investigation revealed limitations
in previous pretraining tasks and identified more meaningful training
targets, ranging from 2D molecular descriptors to extensive quantum
chemistry simulations. These data were seamlessly integrated into
supervised pretraining tasks. The implementation of our pretraining
strategy and multitask learning outperforms conventional methods,
achieving state-of-the-art outcomes in 7 out of 22 ADMET tasks within
the Therapeutics Data Commons by utilizing a shared encoder across
all tasks. Our approach underscores the effectiveness of learning
molecular representations and highlights the potential for scalability
when leveraging extensive data sets, marking a significant advancement
in this domain.