posted on 2020-07-02, 03:03authored byPaul Morris, Rachel St. Clair, William Edward Hahn, Elan Barenholtz
Cheminformatics
aims to assist in chemistry applications that depend
on molecular interactions, structural characteristics, and functional
properties. The arrival of deep learning and the abundance of easily
accessible chemical data from repositories like PubChem have enabled
advancements in computer-aided drug discovery. Virtual high-throughput
screening (vHTS) is one such technique that integrates chemical domain
knowledge to perform in silico biomolecular simulations, but prediction
of binding affinity is restricted due to limited availability of ground-truth
binding assay results. Here, text representations of 83 000 000
molecules are leveraged to perform single-target binding affinity
prediction directly on the outcome of screening assays. The embedding
of an end-to-end transformer neural network, trained to encode the
structural characteristics of a molecule via a text-based translation
task, is repurposed through transfer learning to classify binding
affinity to single targets with few known binding compounds. We quantify
the observed increase in AUC on binding prediction tasks between classifiers
trained on the translation embedding versus those using an untrained
embedding. Visualization of the embedding space reveals organization
of structural and functional properties that aid binding prediction.
The pretrained transformer, data, and associated software to extract
embeddings are made publicly available at https://github.com/mpcrlab/MolecularTransformerEmbeddings.