Can a test description predict which molecules will work?

In this post, I explain why we thought the written description of an assay might help us make useful predictions before the assay has much data.

The problem

In drug discovery, we often want to know whether a molecule will be active in a biological test, or assay. The difficulty is that the most interesting assays are often the newest ones, and new assays usually arrive with very little data. That creates a familiar bottleneck: the question matters, but the measurements are not there yet.

The idea

TwinBooster begins with an overlooked fact. Assays are not just empty labels waiting for numbers. They usually come with words attached: a name, a target, a short description, sometimes a protocol. In this project, we asked whether we could treat that text as part of the scientific evidence rather than as background paperwork. If a model can read the assay description and connect it to the molecule in front of it, it may be able to make useful predictions before that assay has built up a large dataset of its own.

How it works

At a high level, we give the model two views of the same problem. One view is the chemical structure of the molecule. The other is the text describing the assay. A language model turns the assay text into a numerical representation, while a standard chemical representation does the same for the molecule. We then use a self-supervised training step to encourage those two sides to line up when a molecule is active in a given assay. After that, a lightweight gradient-boosted model makes the final prediction.

What the paper showed

On FS-Mol, a benchmark designed to test performance on new and low-data assays, we achieved state-of-the-art zero-shot performance. In plain English, we were able to make strong predictions for assays we had not trained on directly. We also looked at a more practical scenario: choosing compounds for a screening campaign. There too, we could help prioritize likely hits without collapsing chemical diversity into the same obvious choices.

Why it matters

I think that matters because the early stage of drug discovery is full of sparse information. If we can use assay descriptions as part of the signal, we do not have to wait for thousands of measurements before getting something useful from a model. We can turn the written context around an experiment into something computationally valuable.

Limits

The approach still depends on the text being informative. A vague assay description gives the model much less to work with. And zero-shot prediction does not replace real assay data; I see it as an early guide, not a final verdict.

Read the paper

Schuh, M. G.; Boldini, D.; Sieber, S. A. J. Chem. Inf. Model. 2024, 64(12), 4640-4650. 10.1021/acs.jcim.4c00765