posted on 2023-05-15, 17:09authored byRui Hou, Chao Xie, Yuhan Gui, Gang Li, Xiaoyu Li
DNA-encoded library (DEL) is a powerful ligand discovery
technology
that has been widely adopted in the pharmaceutical industry. DEL selections
are typically performed with a purified protein target immobilized
on a matrix or in solution phase. Recently, DELs have also been used
to interrogate the targets in the complex biological environment,
such as membrane proteins on live cells. However, due to the complex
landscape of the cell surface, the selection inevitably involves significant
nonspecific interactions, and the selection data are much noisier
than the ones with purified proteins, making reliable hit identification
highly challenging. Researchers have developed several approaches
to denoise DEL datasets, but it remains unclear whether they are suitable
for cell-based DEL selections. Here, we report the proof-of-principle
of a new machine-learning (ML)-based approach to process cell-based
DEL selection datasets by using a Maximum A Posteriori (MAP) estimation
loss function, a probabilistic framework that can account for and
quantify uncertainties of noisy data. We applied the approach to a
DEL selection dataset, where a library of 7,721,415 compounds was
selected against a purified carbonic anhydrase 2 (CA-2) and a cell
line expressing the membrane protein carbonic anhydrase 12 (CA-12).
The extended-connectivity fingerprint (ECFP)-based regression model
using the MAP loss function was able to identify true binders and
also reliable structure–activity relationship (SAR) from the
noisy cell-based selection datasets. In addition, the regularized
enrichment metric (known as MAP enrichment) could also be calculated
directly without involving the specific machine-learning model, effectively
suppressing low-confidence outliers and enhancing the signal-to-noise
ratio. Future applications of this method will focus on de novo ligand
discovery from cell-based DEL selections.