Protein–Protein Affinity Prediction: the data landscapes of sequence and structure
Binding affinity – the strength with which two molecules interact – sits at the heart of biology. Whether a small molecule inhibitor nestles in a protein’s catalytic site, a T cell receptor recognises a tumour antigen, or an engineered bispecific antibody coaxes immune cells towards their targets, affinity drives functional biological events. During therapeutic design, affinity is inextricably linked with dose; stronger binding means lower doses, and lower doses reduce the risk of off-target side effects.
Experimentally, measuring binding affinity – usually using surface plasmon resonance (SPR) or biolayer interferometry (BLI) – is precise but resource-intensive, often dominating the cost and timelines of lead optimisation. The prospect of accurately predicting affinity without needing to physically measure it is a clear motive when every increment in predictive accuracy promises shorter cycles, fewer wet-lab iterations, and ultimately faster drug programmes.
Predicting binding affinity: a combination of physics, data and machine learning
Previous approaches for computational affinity prediction span a spectrum from physics-derived first-principles to data-driven learning. In recent years, relative free energy perturbations (FEP) and absolute binding free energy (ABFE) calculations have seen renewed traction [1] and while these approaches can be highly informative, their accuracy is limited and they remain computationally intensive. Machine learning promises a complementary path where a direct mapping between input molecules and measured affinities is learnt [2]. However, the accuracy of ML methods has been limited and while the literature is full of bold claims of state-of-the-art performance, these have rarely translated to improved drug discovery efficiency. Of course, hybrid strategies that try to merge physical priors with ML are an obvious route to hack performance, but here we stop to ponder why purely data-driven affinity prediction – without any physics in the loop – has often struggled in practice?
Many of the complications lie not in complex architectures but in the available data landscape. Careful benchmarking exercises have shown that public affinity datasets exhibit substantial data leakage between training and test sets. When this leakage is controlled for, almost all performance on unseen examples is abolished. This lack of generalisability has motivated the development of applicability domains and other pessimistic control mechanisms and yet even with careful splitting and debiasing, shortcut learning persists in which models latch onto superficial cues rather than underlying interaction physics [3].
Representation in binding affinity models: structure versus sequence
While data quality issues apply to all data-driven affinity predictors, we must plow onward and consider a second differentiating axis for machine learning methods: representation. How should models “see” binding partners – by sequence alone, or through their 3D structure?

Sequence-based models have surged ahead in recent years, buoyed by the success of general language models such as ChatGPT. A sequence-only representation has practical appeal: it applies universally (structures are often unavailable), and it avoids conflating affinity with a single static conformation that may not capture the dynamic, entropic contributions to binding. However, researchers that embrace this route should ask the underlying question: are we betting that the physics of binding can be inferred directly from sequence? Or are we content with learning sequence-based distances in “affinity space” that capture useful correlations, without necessarily encoding the underlying forces?
Structure-based approaches, by contrast, seek to model physical interactions more directly. Researchers initially experimented with increasingly raw atomic representations of 3D coordinates: from grids and pharmacophore fingerprints to CNNs, graph networks, or point cloud–based architectures [2]. The hope was that the physical laws of binding would emerge from the data itself, arguably the most successful example of this is AlphaFold itself. However, the structural data only represents low energy states making it hard to learn a physically realistic world model, and the data is complex and the data space is continuous with many sources of bias and error such as incomplete modelling, artefacts of crystallisation, and favouritism towards certain druggable protein families. Embracing the issues presented by structural data is far from a free lunch when compared with representing the sequences alone.
While weighing the merits of sequence versus structure has been studied extensively, we argue that the extent to which each representation is susceptible to underlying data biases is an under-appreciated reason for differences in performance.
Early experiments: embracing structure by adapting Boltz-2 for protein-protein affinity prediction
We recently adapted Boltz-2 [4] – a state-of-the-art structure-based model for protein–ligand affinity – to the protein–protein setting. The adapted model, Boltz-2-PPI, was benchmarked against sequence-based alternatives on two datasets:
TCR3d [5], structural data of TCRs in complex with their peptide-MHC targets curated from the Protein Data Bank (PDB). Each TCR-pMHC complex has been annotated with experimental binding affinity (251 complexes).
PPB-affinity (filtered) [6, 7], a larger set of ~8,500 protein–protein affinity measurements representing diverse protein classes, many without structural data.

We trained for affinity prediction on both datasets (Tables 1 & 2) and the outcome was consistent: sequence-based models outperformed structure-based Boltz-2-PPI.
Table 1. Predictive performance on a sequence-dissimilar test set of TCR3d.
Table 2. Predictive performance on the PPB-affinity (filtered) test set.
These models are not strong predictors, partly because we use challenging splits with proteins unseen during training. Random splitting – still common in our field – artificially inflates performance; for example, on random splits of TCR3d, a model trained only on TCR and pMHC names (without sequences) reaches a Pearson r = 0.6 on the test set. The encouraging takeaway is that honest, challenging validation prevents overestimation and provides a reliable measure of model performance.
A key weakness of this early model is the limited data we have provided for training. TCR3d includes only 150 training structures – too few for such a complex task – and adapting Boltz-2-PPI with a smaller affinity head yields similar performance, indicating that data scarcity, not model size, is the bottleneck. The larger PPB-affinity dataset performs better, suggesting that robust affinity prediction will require aggregated, multi-fidelity data from diverse sources.
To probe the reasons for the performance gap between sequence and structure methods, we tried forgoing structure-prediction and training directly on the experimentally resolved structures in TCR3d, Boltz-2-PPI still lagged behind our sequence-based method (implemented using ESM2-650M). This suggests that structure prediction accuracy is not the bottleneck (yet) and the learned structural embeddings face more fundamental issues which render them currently unsuitable for affinity prediction.
Interestingly, combining structure and sequence embeddings offered modest gains, particularly for weaker sequence models. Concatenating Boltz-2-PPI affinity module embeddings with baseline ESM2 or ProtT5 embeddings trained on the PPB-affinity (filtered) nudged performance upward, indicating that the structural representation contains complementary signal, albeit weaker than those captured by sequences.
Lessons and outlook
Sequence leads today. With massive pretraining, sequence models capture signatures of affinity robustly, even without explicitly modelling binding interactions.
Structure-based embeddings need to be improved to smoothly express both the sequence- and structural- variations that determine the free energy of binding – especially under dataset shift.
While our early experiences have been based on Boltz-2, we expect that the arguments made here apply to many available learned representations of protein structure.
A fusing of both physics and machine learning, and also structure- and sequence is currently crucial for realising state-of-the-art predictive performance for affinity modelling.
At Synteny, we believe large-scale data generation is the key to decoding molecular interactions with ML. Sequence-only corpora were transformative because they were big and standardised and we believe affinity needs the same treatment. Alongside data generation we expect ML methods to mature to suit some of the challenges introduced by continuous, euclidean space data distributions.
Our early internal results mirror the external picture: sequence-first baselines are hard to beat, but hybrid models begin to close the gap when trained on richer, standardised datasets. We’ll share more as these datasets and fusion strategies mature.
References
[1] Siebenmorgen, Till, and Martin Zacharias. “Computational prediction of protein–protein binding affinities.” Wiley Interdisciplinary Reviews: Computational Molecular Science 10.3 (2020): e1448.
[2] Isert, Clemens, Kenneth Atz, and Gisbert Schneider. “Structure-based drug design with geometric deep learning.” Current Opinion in Structural Biology 79 (2023): 102548.
[3] Scantlebury, Jack, et al. “A small step toward generalizability: training a machine learning scoring function for structure-based virtual screening.” Journal of Chemical Information and Modeling 63.10 (2023): 2960-2974.
[4] Passaro, Saro, et al. “Boltz-2: Towards accurate and efficient binding affinity prediction.” BioRxiv (2025): 2025-06.
[5] Lin, Valerie, et al. “TCR3d 2.0: expanding the T cell receptor structure database with new structures, tools and interactions.” Nucleic Acids Research 53.D1 (2025): D604-D608.
[6] Liu, Huaqing, et al. “PPB-Affinity: Protein-Protein Binding Affinity dataset for AI-based protein drug discovery.” Scientific data 11.1 (2024): 1316.
[7] Alsamkary, Hazem, et al. “Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction.” arXiv preprint arXiv:2505.20036 (2025).


