Synergising assay text and chemical structures to predict bioactivity

How we combined language models, self‑supervised learning and boosting to make useful predictions for assays with little or no data.

  ·   3 min read

The one‑line problem #

In early drug discovery we often want to predict whether a molecule will be active in a particular biological test (an assay). For brand‑new assays there’s little or no data, so classic models struggle. It’s a bit like trying to guess if a key fits a lock you’ve never seen—you only have the lock’s description.

The idea #

Assays always come with text: a name, target, conditions, sometimes even a short protocol. We teach a language model to turn that assay text into a numerical representation (an embedding), and we pair it with a representation of the molecule. Then we train the system so that text and molecule information reinforce each other. Finally, a lightweight booster (a fast tree‑based model) makes the actual prediction. We call the model TwinBooster.

Analogy: imagine two photographers taking pictures of the same scene from different angles—one camera reads the assay text, the other looks at the molecule. Our method aligns the two views so the important features line up, then lets a sharp‑eyed editor make the final call.

Why this matters #

Most assays live in the “long tail” where data are scarce. If we can generalise from assay descriptions, we can make useful predictions without needing to first collect thousands of measurements. That helps triage compounds earlier, design tighter screening libraries, and save wet‑lab time and budget.

What we actually built #

  • Text encoder: a fine‑tuned language model turns assay descriptions into embeddings.
  • Molecule encoder: a standard cheminformatics representation of each compound.
  • Self‑supervised alignment: a Siamese objective (Barlow Twins) encourages the text and molecule views of the same training example to agree while reducing redundancy.
  • Gradient‑boosted head: a small, robust model uses those aligned features to predict activity.

For readers who like a peek under the bonnet, the self‑supervised loss encourages the cross‑correlation between the two views to approach the identity matrix:

$$\mathcal{L}_{BT} = \sum_i (1 - C_{ii})^2 + \lambda \sum_{i\neq j} C_{ij}^2,$$

which informally says “keep matching features similar and different features independent”.

How we tested it #

We evaluated on FS‑Mol, a widely used few‑shot benchmark where models must perform on unseen assays. In the strict zero‑shot setting—no task‑specific training—TwinBooster delivered state‑of‑the‑art performance across the benchmark. We also ran a retrospective case study showing how the method can help tailor a screening library towards the assay of interest.

  • About the dataset: FS‑Mol is maintained by Microsoft Research and frames bioactivity prediction as thousands of small tasks; see the project page at github.com/microsoft/FS‑Mol.

Who might find this useful #

  • Medicinal chemists wanting a ranked shortlist before committing to larger screens.
  • Screening teams designing assay‑specific libraries.
  • ML/cheminformatics groups exploring low‑data regimes or building baselines for new targets.

Limitations (and what’s next) #

  • Performance depends on informative assay text; minimal or ambiguous descriptions help less.
  • Zero‑shot is powerful but won’t beat a large, high‑quality assay‑specific dataset; think of it as a strong prior before data arrive.
  • We’d love to see prospective (live) tests and community baselines on more real‑world assays.

Read the paper #

Schuh, M. G.; Boldini, D.; Sieber, S. A. J. Chem. Inf. Model. 2024, 64(12), 4640–4650. 10.1021/acs.jcim.4c00765


TL;DR #

  • What we did: Combined assay text and molecular structure with self‑supervised learning (Barlow Twins) plus a boosted predictor—aka TwinBooster—to make zero‑shot activity predictions.
  • Why it matters: Brings useful predictions to new assays with little/no data, helping prioritise compounds and design focused screening libraries.
  • What it’s for: Early triage, assay‑specific library design, and as a strong baseline whenever data are scarce.