Can you predict drug binding from sequence alone?

In this post, I explain how far we can get in drug-target prediction when we work with the simplest and most widely available biological inputs.

The problem

For all the excitement around AI in drug discovery, many methods still depend on something we often do not have: high-quality 3D structural data. That creates a real bottleneck. If we want to test many drug-target pairs quickly, the most detailed approaches are often the hardest ones to use at scale.

The idea

In this project, we took a more pragmatic route. Instead of beginning with full 3D structures, we started from the information that is usually available from the beginning: a SMILES string for the molecule and an amino-acid sequence for the protein. These inputs are simple, but they are cheap, common, and fast to process. We wanted to know whether that simpler view of the problem could still produce genuinely useful predictions.

How it works

We turn the molecule into a compact fingerprint and the protein sequence into embeddings from a protein language model that carries some structural awareness. A Barlow Twins training step then learns a shared representation of drug-target pairs, and a gradient-boosted model makes the final prediction. That hybrid design sits at the center of the paper: we use deep learning where it is strongest, in learning good representations, and then we use a smaller, efficient model where robustness and speed matter most.

What the paper showed

Across several established drug-target benchmarks, we saw state-of-the-art performance while using only one-dimensional inputs. We also looked beyond leaderboard numbers. By tracing which training examples most influenced a prediction and comparing the results with co-crystal structures, we found that the model’s signal aligned with catalytically active and stabilizing residues. That matters because it suggests the model is not simply exploiting accidental patterns in the data. We then extended the approach to BarlowDTI XXL, trained on more than 3.6 million curated drug-target pairs for larger, more realistic applications such as virtual screening.

Why it matters

I think the appeal here is practical. If we can make strong predictions from sequences and molecular strings alone, more people can use these tools without waiting for expensive structural experiments or heavy computation. That makes the method useful not only as a research result, but also as a working tool for early-stage discovery.

Limits

Sequence-only methods still have limits. Some interactions depend on fine 3D geometry, and when reliable structural data exist, structure-based approaches can still have an advantage. More broadly, we can only trust a predictive system as much as we trust the data it learns from.

Read the paper

Schuh, M. G.; Boldini, D.; Bohne, A. I.; Sieber, S. A. J. Cheminform. 2025, 17, 18. 10.1186/s13321-025-00952-2

Try it

We also provide a public web demo: bio.nat.tum.de/oc2/barlowdti