Building Synteny’s Data Foundry to solve molecular recognition
The challenge of mapping genotype to phenotype defines much of modern biology. From the protein folding problem, where sequence determines structure, to emerging virtual cell models, these efforts reveal how sequence encodes function across molecular and cellular scales. Now, advances in synthetic biology and machine learning are allowing us to explore these maps in ways that seemed impossible even a decade ago.
At Synteny, we believe that by pointing these new technologies toward the problem of molecular recognition — how immune receptors recognise their targets — we can open up new classes of programmable therapeutics across oncology, autoimmunity and beyond.
From reading biology to writing it
Over the past twenty years, biology has undergone a revolution in scale. The arrival of next-generation sequencing gave us the ability to read DNA cheaply at astonishing depth and accuracy. More recently, advances in synthesis technologies have made it possible to write bespoke DNA sequences at comparable scale. Using these technologies it is possible to generate millions of precisely designed sequence variants, express and test them in living systems and measure their effects across a range of rich phenotypes. These advances have fuelled a new generation of experiments that bridge sequence and function—from deep mutational scanning for the design of novel enzymes, to Perturb-seq, which employs systematic genetic perturbations to unravel the regulatory logic of human cells. Collectively, these studies underscore a profound concept: when perturbations and measurements can be conducted systematically at scale, biology becomes predictable.
A new frontier: molecular recognition
At Synteny, our focus is to bring this same experimental and computational scale to one of biology’s most complex and consequential phenomena: T-cell receptor (TCR) recognition of peptide–MHC (pMHC) targets. This is the core of adaptive immunity — the molecular handshake that distinguishes “self” from “non-self,” infection from tolerance, cancer from health. Despite decades of study, this recognition landscape remains largely unmapped. We can measure individual interactions, but we cannot yet predict which TCR will bind which antigen, with what affinity, or what cellular outcome that interaction will drive. We believe that solving this molecular recognition problem will unlock a new era of programmable biologics and immune therapeutics.
The Synteny Data Foundry
To tackle this challenge, we’ve built the Synteny Data Foundry: a platform for generating large-scale, high-fidelity data on TCR–pMHC interactions. Like a foundry that forges raw materials for engineering, our data foundry produces the fundamental data from which AI-guided models of molecular recognition can be built. It rests on four key pillars:
1. TCR perturbations at scale
At the core of our platform lies the ability to design, synthesise and test millions of TCR variants. Using our generative design engine, MANIFOLD, we propose vast libraries of candidate TCRs predicted to engage a given target. These designs are synthesised at scale using modern DNA-writing technologies, enabling us to systematically explore sequence space across diverse complementarity-determining regions (CDRs) and frameworks. Each cycle begins imperfectly, with diverse hypotheses about what might bind, and iteratively improves as data flow back into our models.
2. High-throughput TCR screening for discovery, development, and safety
Molecular recognition cannot be fully understood outside its native context.
Rather than relying solely on reductionist in-vitro binding assays, we embrace the complexity of the human immune cell. We’ve developed two innovative high-throughput cell-based screening platforms to characterise up to a million TCRs against a panel of peptides or up to a million of pMHC against a panel of TCRs in a single assay. But why go via the cell-based route, and what makes this setup a game-changer? Let’s break it down.
2.1 Why a cell-based system?
Understanding TCR–pMHC recognition requires studying receptors in their native environment. Our high-throughput cell-based assays express TCRs on living T cells, preserving membrane architecture, co-receptor engagement, and downstream signalling cascades. This enables us to quantify antigen-driven activation under near-physiological conditions, capturing key determinants such as avidity, clustering, and signalling thresholds that are invisible to in-vitro binding assays. It’s like testing a car on the road instead of just in a wind tunnel; far more indicative of performance under real conditions.
2.2 Screening millions of TCRs for discovery
Size matters. Our discovery engine can assay up to one million TCR variants against panels of peptide–MHC targets in a single experiment, allowing systematic exploration of vast areas of sequence space. This scale reveals rare, high-affinity, and multi-HLA-reactive receptors that traditional low-throughput assays cannot detect. Each screen yields quantitative mappings between sequence and function, enabling us to refine our generative design models iteratively. The result is accelerated identification of optimised TCRs with the desired balance of potency, specificity, and manufacturability.
2.3 Screening millions of antigens for safety
Building on our cell-based TCR screening platform, we’ve pioneered a complementary system that takes screening off-target antigens to the next level. Libraries comprising millions of peptide–MHC complexes are expressed in engineered reporter cells that signal upon productive engagement, allowing comprehensive profiling of TCR specificity across the human peptidome. This approach identifies rare but clinically relevant cross-reactivities that may underlie toxicity or autoimmunity. By integrating these data into our design pipeline, we can eliminate problematic motifs and design safety into our molecules from the outset, producing TCRs that are both safe and potent.
3. Ultra-cheap paired-chain sequencing
A major barrier to studying endogenous TCRs at scale has been sequencing cost and complexity. Pairing endogenous TCRs requires expensive single-cell approaches. Our synthetic constructs used by our foundry mean that paired-chains can be read off in a single amplicon that reads both TCR chains together in a bulk sequencing approach, allowing us to track how α–β combinations influence cell surface pairing and function. This dramatically reduces cost per observation and ensures that sequence–function relationships remain intact. It also allows us to detect subtle patterns: the specific motifs, hydrogen-bond networks, or charge complementarities that make or break recognition.
4. Lab in the loop
Data alone are not enough, learning requires iteration. Our foundry operates as a closed experimental–computational loop. Each round begins with a generative design, proceeds through synthesis, expression, and phenotypic screening, including cross-reactivity testing, and then feeds back into our AI models. With every cycle, the models become sharper: learning the grammar of molecular recognition from empirical evidence. This loop enables us to not only predict binding but to control it with conditional generative design.
A programmable immune system
Molecular recognition defines what our immune system can and cannot see. By learning its rules we can begin to rewrite them, designing TCRs that target cancers, autoimmune antigens, or infectious diseases with unprecedented precision. Synteny’s data foundry is our engine for this future. By combining high-throughput synthetic biology, rich functional phenotyping, and AI-driven design, we are turning molecular recognition from a mystery into an engineering discipline.



Wow, how ML makes biology predictable is fascinating. Excelent perspective!