The AI biotech stack has stopped being a loose collection of impressive demos. In 2026, the real engineering problem is getting biological foundation models, docking tools, ADMET predictors, lab robots, and LIMS records to operate on the same clock.
The AI biotech stack is the production architecture that connects model-generated biological or chemical hypotheses to wet-lab execution, then routes validated assay results back into the next model update. The recommended action is to build one narrow closed loop first, with versioned data, uncertainty-aware selection, assay-quality thresholds, and compliance logging before scaling to broader discovery workflows.
TL;DR: Treat AI biology as an experimental control system. Foundation models generate and rank candidates, but the wet lab provides the loss function. The teams that benefit fastest are the ones that connect models to clean assay feedback, sample lineage, and decision rules.
Key takeaways
- Biological foundation models are now infrastructure, especially AlphaFold 3, ESM3, and Evo 2.
- The practical stack runs from curated data to generation, docking, ADMET, lab execution, LIMS, and governance.
- Closed-loop drug discovery depends more on assay quality and metadata discipline than model novelty.
- NVIDIA BioNeMo and NIM reduce deployment friction, but they don't solve data lineage or experimental design.
- Governance is now part of the architecture because FDA, EU, NIST, and biosecurity rules increasingly affect model use in drug development.
What belongs in an AI biotech stack?
A working AI biotech stack has ten layers. You don't need every layer on day one, but you do need to know where each responsibility lives.
| Layer | Job | Representative systems | Failure mode |
|---|---|---|---|
| Data foundation | Protein, genomic, chemical, and structure records | UniProt, ChEMBL, PubChem, RCSB PDB, AlphaFold DB | Dirty labels, stale IDs, weak provenance |
| Foundation models | Learn biological representations | AlphaFold 3, ESM3, Evo 2 | Strong benchmark, weak target fit |
| Generative design | Propose molecules or proteins | RFdiffusion, MolMIM, Chroma | Beautiful candidates that can't be made |
| Docking and scoring | Estimate pose and binding | DiffDock, AutoDock Vina, Glide, FEP+ | False confidence in ranking |
| ADMET and tox | Filter for developability | CYP, hERG, Ames, clearance models | Late-stage attrition |
| Experiment selection | Choose what to test next | Active learning, Bayesian optimization | Sampling too narrowly |
| Lab automation AI | Execute assays and synthesis | HTS, acoustic dispensing, robotic synthesis | Instrument data trapped in silos |
| LIMS and ELN | Capture lineage and results | Benchling, StarLIMS, LabWare, Veeva | Free-text records that models can't use |
| Clinical data | Connect discovery to evidence | CDISC, OHDSI, Veeva, Medidata | Translation gap |
| Governance | Audit, compliance, and safety | FDA guidance, EU AI Act, NIST AI RMF, OSTP screening | Unreviewable decisions |
The architectural mistake is treating the first five layers as “the AI system” and the rest as operational cleanup. In drug discovery, the cleanup is the system.
Which biological foundation models are current as of June 2026?
AlphaFold 3 remains the reference point for biomolecular interaction prediction. The Nature paper describes a diffusion-based system that predicts structures across proteins, DNA, RNA, ligands, ions, and modified residues in a unified framework.
That matters because drug discovery rarely asks for a monomer structure in isolation. It asks whether a ligand binds, whether a mutation changes an interface, whether an RNA or DNA interaction matters, and whether the predicted geometry is useful enough to prioritize an experiment.
ESM3, introduced by EvolutionaryScale and distributed through channels including NVIDIA's ecosystem, pushed protein language models toward all-to-all reasoning across sequence, structure, and function. NVIDIA described ESM3 as trained on roughly 2.8 billion protein sequences, with the largest model using H100 infrastructure, in its launch coverage.
Evo 2 is the important 2026 addition for genome-scale modeling. The Arc Institute's Evo 2 page and Nature paper describe 7B and 40B parameter models trained across more than 9 trillion nucleotides, with context lengths up to 1 megabase at single-nucleotide resolution.
For practitioners, the model choice follows the biological object.
| Question | First model family to evaluate | Why |
|---|---|---|
| How does this ligand bind a target? | AlphaFold 3, DiffDock, Boltz-2, physics stack | Complex geometry and pose matter |
| What protein variant should we test? | ESM3, RFdiffusion, ProteinMPNN | Sequence, structure, and function trade together |
| What genomic variant or design should we model? | Evo 2 | Long-context nucleotide modeling is the differentiator |
| Which hits survive development filters? | ADMET and tox models | Binding alone is rarely enough |
| Which experiment should run next? | Active learning over internal assay data | Local evidence beats public priors |
Where does NVIDIA BioNeMo Recursion fit?
NVIDIA BioNeMo Recursion is best understood as an enterprise deployment pattern: large-scale biological AI models, accelerated inference, and pharma-scale data loops running on shared infrastructure.
NVIDIA BioNeMo provides model training, fine-tuning, and inference infrastructure for biological AI workloads. The BioNeMo release notes put the framework on version 2.7 as of September 30, 2025, which is the dated baseline to use for June 2026 architecture planning.
NVIDIA NIM adds a second piece: deployable microservices for specific models. The DiffDock NIM documentation and NGC catalog collection show the pattern: package a model behind an inference API so teams can use it without hand-building every dependency.
Recursion matters because it represents the other half of the loop. NVIDIA's coverage of the Recursion collaboration centers on scale: large biological datasets, high-throughput experiments, and accelerated computation feeding discovery workflows.
That combination is the direction of travel. Foundation models alone don't create a defensible discovery engine. A recurring stream of well-measured perturbation, imaging, omics, binding, and ADMET data can.
How do you design a closed-loop drug discovery system?
Start with one decision loop. Good loops have a narrow action space, fast experimental feedback, and a measurable objective.
A protein engineering loop might choose 96 variants per round. A medicinal chemistry loop might nominate 50 compounds for synthesis. A target-validation loop might prioritize perturbations for a cell model.
The reference loop looks like this:
1. Register target, assay, samples, and constraints in LIMS/ELN.
2. Generate candidates with a foundation or generative model.
3. Score candidates with docking, ADMET, novelty, and synthesizability filters.
4. Select experiments using uncertainty plus business constraints.
5. Execute assays through automated or semi-automated lab workflows.
6. Normalize results, attach metadata, and update model training sets.
7. Promote, kill, or redesign candidates based on predefined thresholds.
The selector is the most underdesigned component. Teams often rank by predicted potency, then wonder why the loop stalls.
A better selector balances exploitation, exploration, and operational cost. It should reserve some experimental capacity for uncertain regions of chemical or sequence space, because those points improve the model fastest.
A simple loop policy can be written as a decision matrix:
| Selection bucket | Share of experiment slots | Purpose |
|---|---|---|
| Highest predicted activity | 40% | Advance likely winners |
| Highest uncertainty | 25% | Improve model calibration |
| Chemical or sequence diversity | 20% | Avoid local minima |
| Controls and repeats | 15% | Detect assay drift |
Those percentages are starting values. Mature teams tune them by assay cost, cycle time, and historical model error.
What data foundation does the stack need?
Public data is necessary, but it isn't enough. The stack needs a governed internal data product for every assay that will drive model updates.
The public base is stronger than it used to be. UniProtKB's 2026_01 release reports 67.5 million entries and 566,811 manually annotated Swiss-Prot records in its release statistics. ChEMBL 37, announced in May 2026, expanded curated bioactivity coverage and protein degradation support according to the ChEMBL release post.
Structural data has also crossed a scale threshold. The RCSB PDB reports more than 255,000 deposited structures as of June 2026, while EMBL said the AlphaFold Database added millions of predicted protein complexes in March 2026.
Those numbers don't remove the need for internal curation. They raise the bar for what internal data must add: assay context, negative results, protocol details, batch effects, and real-world failure modes.
How should lab automation AI connect to LIMS and ELN?
Lab automation AI should write to the same system of record that humans use. Otherwise the model loop becomes a side database with missing samples, missing controls, and ambiguous provenance.
For most teams, the LIMS or ELN is the boundary between research creativity and operational truth. Benchling, StarLIMS, LabWare, and Veeva-style systems track samples, protocols, inventory, and results. Benchling's cloud R&D platform is positioned around ELN, LIMS, registry, and workflow integration on its product site.
A production loop needs structured capture for at least five objects: candidate, batch, protocol, assay result, and model version. Free-text notes can remain useful for scientists, but they cannot be the only training record.
The integration contract can be simple:
{
"candidate_id": "CMPD-2026-004812",
"design_model": "molgen-prod-2026-06-01",
"selection_policy": "active-learning-v3",
"assay_id": "KINASE-BINDING-042",
"batch_id": "BATCH-17A",
"result": {
"ic50_nm": 84.2,
"confidence": "pass",
"qc_flags": []
}
}
The field names matter less than the discipline. Every model output that influences an experiment should be traceable to the model, data snapshot, selection policy, and assay readout.
What governance belongs in the architecture?
Governance is now an engineering requirement for AI drug discovery. The FDA proposed a credibility framework for AI models used in drug and biological product submissions in January 2025, emphasizing model context of use, credibility evidence, and lifecycle maintenance in its press announcement.
The Federal Register notice for the draft guidance makes the same point in regulatory language: AI-supported decisions need documented validation and defined limits.
The NIST AI Risk Management Framework gives teams a practical operating model for mapping, measuring, and managing risk. The Nature Biotechnology biosecurity paper adds the dual-use warning specific to biology: generative tools need built-in safeguards when they can assist biological design.
Microsoft's June 2026 biosecurity commitment, published on Microsoft On the Issues, shows where cloud providers are heading. Model access, synthesis screening, and biological design monitoring will increasingly be platform concerns.
For engineering teams, that means logging is part of the product. Capture prompts or structured inputs, model versions, generated candidates, filters applied, human approvals, synthesis requests, assay outputs, and escalation decisions.
What this means for you
Build the smallest loop that changes a real discovery decision. Don't start with an enterprise-wide platform migration unless your current data systems make the loop impossible.
A good first loop has four traits: the assay returns quickly, the decision threshold is explicit, the data can be structured, and the cost of wrong guesses is tolerable. Protein variant selection, hit triage, and ADMET-aware lead optimization are usually better first targets than end-to-end autonomous discovery.
Use current foundation models where they fit, but keep the workflow portable. Model names will change faster than wet-lab constraints, regulatory expectations, and assay physics.
Action checklist
- Choose one loop with a measurable objective and a known cycle time.
- Define candidate IDs, protocol IDs, assay IDs, batch IDs, and model version IDs before running experiments.
- Use biological foundation models for generation or representation, then add docking, ADMET, and synthesizability gates.
- Reserve experiment slots for uncertainty, diversity, and controls.
- Write model outputs and assay results into the LIMS or ELN system of record.
- Log every model-assisted decision that could affect a regulated submission or biosafety review.
- Review the loop after each cycle using calibration error, hit rate, assay QC, and cost per validated candidate.
The AI biotech stack is becoming a control system for biology. The model proposes, the lab measures, and the architecture decides how quickly the next experiment gets smarter.
Sources
- Accurate structure prediction of biomolecular interactions with AlphaFold 3
- Evo 2: DNA Foundation Model
- Genome modelling and design across all domains of life with Evo 2
- NVIDIA BioNeMo Documentation Hub
- BioNeMo Release Notes
- NVIDIA NIM for DiffDock
- Drug Discovery, STAT! NVIDIA, Recursion Speed Drug Discovery
- UniProtKB/Swiss-Prot Release 2026_01 statistics
- ChEMBL 37 is here
- Millions of protein complexes added to AlphaFold Database
- FDA proposes framework to advance credibility of AI models
- NIST AI Risk Management Framework
- A call for built-in biosecurity safeguards for generative AI tools
- Strengthening biosecurity in the era of AI
