The autonomous AI scientist landscape

An autonomous AI scientist is a named software system that takes meaningful initiative across one or more of three primary stages: hypothesis generation (proposing novel, testable scientific claims rather than retrieving existing ones), experiment design (choosing experiments that discriminate between hypotheses, optimizing protocols), and analysis (interpreting experimental data, fitting models, drawing inferences). Closing the full loop — hypothesize, design, analyze — is the explicit ambition described in the Gao et al. Cell perspective (2024).

How the landscape breaks down

Across the 65 systems tracked here, a few cross-cutting patterns matter more than any single entry:

  • Chemistry and materials are the most loop-closed. These systems run genuine closed loops on physical hardware — CMU’s Coscientist and MARS drive robotic synthesis, AMASE and MAD couple multiple characterization instruments to active-learning policies, and LEAP and CatDT pair domain-tuned models with wet-lab or microkinetic validation. This is where “autonomous” most fully means hands-on-the-apparatus.
  • Biology and medicine carry the strongest evidence. Wet-lab and clinical validation — Co-Scientist’s in vitro oncology hits, Robin’s confirmed dry-AMD drug candidates, SPARK’s prospective pathology study across 18 cohorts, CRISPR-GPT’s non-expert gene-editing case study, OriGene’s agent-nominated cancer targets confirmed in patient-derived organoid models — set the highest evidence tier in the field.
  • ML and scientific computing is a large, fast-moving cluster, but mostly benchmark-validated rather than physically grounded. The recent design trend is architectural: self-improving systems that accumulate methodological memory across problems (GRAFT-ATHENA, EvoScientist, AutoSci, AutoScientists), and a turn toward verifiability — adversarial cross-model review (ARIS), evidence-chain auditing (ScientistOne), and numeric-registry gating (AutoResearchClaw).
  • Physical sciences and embodied systems are the newest frontier, moving AI past simulators onto real apparatus: Qumus fabricates 2D-material devices, the Qiushi engine runs a free-space optical platform, Dr.Sai operates inside the BESIII collider collaboration, and BioProVLA runs wet-lab robotics on an ~$800 rig.
  • A long tail of single-domain pioneers now spans mathematics (AI co-mathematician, Ax-Prover’s formal Lean proving), symbolic equation discovery (MCI), spatial data science (NORA), plant science (Aleks), and scientific visualization (VIS Co-Scientist).

Evaluation is shifting from one-off demos toward process-level scoring and verifiability audits, and several shared failure modes — reference hallucination, reproducibility, originality-versus-retrieval — remain open. See Evaluation and open problems.

All tracked systems

The full set is below, grouped by domain. Each row links to a per-system page with architecture, validation detail, headline metrics, and citation. Loop stage is the part of the scientific loop the system drives (Multi-stage = closes hypothesize → design → analyze); Validation is the strongest evidence tier it has demonstrated (Wet-lab > Mixed > Benchmark > Design-only).

Biology & medicine (17)

System Org What it is Loop stage Validation Access
AgentPLM Bedford College Agentic protein language model that interleaves sequence generation with structure and docking tool calls. Experiment design Benchmark Closed
BioProVLA-Agent ECUST Affordable embodied multi-agent system using Vision-Language-Action models for wet-lab manipulation. Experiment design Wet-lab Unknown
Biomni Stanford General-purpose biomedical AI agent pairing a large tool environment with a code-executing planner. Multi-stage Benchmark Open source
CRISPR-GPT Stanford Four-agent LLM planner for CRISPR-Cas gene-editing experiments across knockout, base, prime, and epigenetic editing. Experiment design, Analysis Wet-lab Closed
Deep Research (BioAgents) BioAgents Open-source interactive multi-agent system for biomedical research, running in minutes per cycle. Multi-stage Benchmark Open source
LabOS Stanford / Princeton AI-XR co-scientist pairing self-evolving agents with smart glasses to reason about and act in the physical lab. Multi-stage Mixed Open source
Latent-Y Latent Labs Lab-validated autonomous agent that runs complete de novo antibody design campaigns from text prompts. Multi-stage Wet-lab Closed
NeuroClaw CUHK Domain-specialized multi-agent assistant for reproducible neuroimaging research on raw MRI/fMRI/EEG data. Experiment design, Analysis Benchmark Open source
OpenScientist WashU Open-source AI co-scientist for biomedical discovery, built on Claude Code with domain-specific Agent Skills. Multi-stage Wet-lab Open source
OriGene SJTU Self-evolving multi-agent “virtual disease biologist” that nominates mechanism-grounded drug targets. Hypothesis, Analysis Wet-lab Open source
PantheonOS Stanford Privacy-preserving multi-agent framework for single-cell and multi-omics analysis that evolves its own code. Multi-stage Wet-lab Open source
PerTurboAgent Genentech Self-planning Genentech LLM agent that designs iterative Perturb-seq screens, round by round. Experiment design, Analysis Benchmark Unknown
PharmaSwarm UAB Multi-agent LLM swarm for hypothesis-driven drug discovery with a central Evaluator ranking targets and compounds. Hypothesis, Analysis Design-only Unknown
Robin (FutureHouse) FutureHouse Multi-agent system integrating hypothesis generation with data analysis in a lab-in-the-loop workflow. Multi-stage Wet-lab Open source
SPARK Cologne System of pathology agents that turns biological concepts into tools applied to H&E and spatial-biology data. Multi-stage Wet-lab Open source
Talk2QSP Sanofi Agent framework converting literature clinical scenarios into executable QSP/SBML model interventions. Experiment design Benchmark Closed
The Virtual Biotech Stanford Multi-agent framework mirroring a biotech org, with a CSO agent delegating computational drug discovery to specialists. Multi-stage Benchmark Unknown

Chemistry & materials (13)

System Org What it is Loop stage Validation Access
AILA IIT Delhi Multi-agent LLM framework that autonomously operates an atomic force microscope, with the AFMBench suite. Experiment design, Analysis Wet-lab Open source
AMASE Maryland Autonomous Materials Search Engine pairing robotic thin-film XRD with CALPHAD phase-diagram prediction. Multi-stage Wet-lab Open source
AtomisticSkills MIT Open-source skill library that lets coding agents run atomistic research in materials, chemistry, and drug discovery. Multi-stage Benchmark Open source
BORA Liverpool Language-based assistant coupling an LLM with Bayesian optimization for literature-grounded experiment design. Experiment design, Hypothesis Benchmark Open source
CatDT HKUST Self-evolving multi-agent system that builds a condition-aware digital twin of a heterogeneous catalyst. Multi-stage Benchmark Code on request
CategoryScienceClaw MIT Discovery framework adding a category-theoretic, proof-carrying layer to ScienceClaw for verifiable transitions. Multi-stage Benchmark Open source
ChemCrow EPFL GPT-4-driven chemistry agent combining reasoning with expert tools for retrosynthesis and reaction execution. Experiment design, Analysis Wet-lab Open source
Coscientist (CMU) CMU GPT-4 planner driving a robotic chemistry platform fully autonomously across the experimental loop. Experiment design, Analysis Wet-lab Code on request
DKPL Oak Ridge Autonomous-microscopy framework that learns a utility from expert pairwise judgements to plan nanoscale experiments. Experiment design, Analysis Wet-lab Unknown
LEAP Renmin U Expert-in-the-loop framework coupling a domain LLM with Bayesian optimization for perovskite additive discovery. Multi-stage Wet-lab Unknown
MAD Maryland Closed-loop framework orchestrating multiple characterization instruments for thin-film materials discovery. Experiment design, Analysis Wet-lab Code on request
MARS SIAT Hierarchical multi-agent and robotic framework that designs, synthesizes, and optimizes perovskites. Multi-stage Wet-lab Open source
Qumus Princeton Embodied multi-agent system that fabricates 2D quantum materials and vdW device stacks in a robotic mini-lab. Multi-stage Wet-lab Code on request

ML & scientific computing (14)

System Org What it is Loop stage Validation Access
AI CFD Scientist RPI Open-source AI scientist for computational fluid dynamics, from ideation to solver runs to figure-grounded writing. Multi-stage, Writing Benchmark Open source
AI Scientist (Sakana) Sakana AI End-to-end LLM agent that ideates, codes, runs ML experiments, and writes its own reviewed paper. Multi-stage, Writing Benchmark Open source
AIRA (AIRA-Compose and AIRA-Design) Meta FAIR Pair of Meta FAIR LLM-agent frameworks that autonomously discover novel foundation-model architectures. Multi-stage Benchmark Closed
ARIS SJTU Autonomous ML research harness using cross-model adversarial collaboration between executor and reviewer. Multi-stage, Writing Benchmark Open source
ATLAS DeepMind DeepMind active-learning loop that designs experiments to discover interpretable mechanistic models of behavior. Multi-stage Benchmark Unknown
AgenticSciML Brown Multi-agent LLM system that debates and evolutionarily refines scientific-machine-learning architectures. Hypothesis, Experiment design Benchmark Code on request
AutoLLMResearch Notre Dame RL-trained agent that extrapolates from cheap LLM runs to configure expensive scalable experiments. Experiment design, Analysis Benchmark Open source
AutoResearchClaw UNC-Chapel Hill Multi-agent ML research pipeline with structured debate, a self-healing executor, and human-in-the-loop modes. Multi-stage, Writing Benchmark Open source
AutoTTS Maryland Agentic framework that autonomously discovers test-time-scaling controllers for LLM reasoning. Hypothesis, Experiment design, Analysis Benchmark Open source
Deep Researcher Agent U Tokyo Open-source framework running LLM agents 24/7 through full deep-learning experiment cycles. Multi-stage Benchmark Open source
GRAFT-ATHENA Brown Self-improving agentic framework that expands its own action space to discover numerical algorithms. Multi-stage Benchmark Code on request
Jr. AI Scientist U Tokyo Autonomous AI scientist mimicking a novice student improving on a baseline paper through iterated experiments. Multi-stage, Writing Benchmark Open source
MLEvolve Shanghai AI Lab Self-evolving multi-agent framework that discovers end-to-end ML solutions via Progressive Monte Carlo Graph Search. Multi-stage Benchmark Open source
POISE Fudan Closed-loop Fudan framework that autonomously discovers policy-optimization algorithms for LLM reinforcement learning. Multi-stage Benchmark Unknown

Physical sciences (5)

System Org What it is Loop stage Validation Access
CMBEvolve and CosmoEvolve Cambridge Cambridge cosmology pair of agentic systems for code evolution and virtual-research-lab simulation. Multi-stage Benchmark Code on request
CVEvolve Argonne Argonne agentic harness that discovers data-processing algorithms for experimental images via zero-code search. Analysis Benchmark Unknown
DarkAgents U. Bologna / INFN Language-driven multi-agent system that takes a particle-physics model to a fit against astroparticle data, with explicit assumption and prior auditing. Multi-stage Benchmark Open source
Dr.Sai IHEP Multi-agent LLM system for autonomous end-to-end high-energy-physics data analysis on the BESIII detector. Analysis, Experiment design Wet-lab Unknown
Qiushi Discovery Engine Zhejiang U Agentic system performing end-to-end autonomous discovery on a real free-space optical platform. Multi-stage Wet-lab Unknown

Math & symbolic (3)

System Org What it is Loop stage Validation Access
AI co-mathematician Google DeepMind Google DeepMind agentic workbench for open-ended mathematics research with parallel ideation and proving. Multi-stage, Writing Mixed Closed
Ax-Prover Axiomatic AI Multi-agent Lean theorem prover that gives general-purpose LLMs Lean tools to verify proofs across math and quantum physics. Analysis Benchmark Open source
MCI KRICT Multi-agent framework blending symbolism and metaheuristics to discover explainable governing equations. Hypothesis, Analysis Benchmark Unknown

General / multi-domain (10)

System Org What it is Loop stage Validation Access
AutoSci Peking U Memory-centric agent that runs the full research lifecycle over persistent memory and evolves its own skills. Multi-stage, Writing Benchmark Open source
AutoScientists Harvard Decentralized team of AI agents that self-organize around hypotheses and share results across teams. Multi-stage Benchmark Open source
Co-Scientist (Google) Google Gemini-based multi-agent reasoning engine for biomedical hypothesis generation. Hypothesis, Experiment design Wet-lab Closed
EOS AI agent UNC-Chapel Hill UNC AI agent that creates, runs, and analyzes lab protocols and optimization campaigns from text requests. Experiment design, Analysis Wet-lab Unknown
EvoMaster SJTU Domain-agnostic harness for building self-evolving scientific agents; underpins the SciMaster ecosystem. Multi-stage Benchmark Open source
EvoScientist Huawei Evolving multi-agent AI scientist whose agents share memories distilling past runs into reusable strategies. Multi-stage Benchmark Open source
Kosmos Edison Scientific AI scientist automating data-driven discovery via long cycles of parallel analysis against a world model. Multi-stage, Writing Mixed Closed
NovelSeek Shanghai AI Lab Closed-loop multi-agent framework evaluated across 12 AI-for-Science research tasks. Multi-stage Benchmark Open source
SAGA Cornell + multi Bi-level agent that autonomously evolves the objective functions of a scientific design problem rather than treating them as fixed. Multi-stage Wet-lab Open source
ScientistOne Google End-to-end autonomous research system maintaining verifiable evidence chains for every claim. Multi-stage, Writing Benchmark Closed

Other (3)

System Org What it is Loop stage Validation Access
Aleks Cornell Cornell multi-agent system that turns a plant-science question and dataset into interpretable ML models. Multi-stage Benchmark Unknown
NORA UT Knoxville Multi-agent autonomous research system purpose-built for GIScience and spatial data science. Multi-stage, Writing Benchmark Unknown
VIS Co-Scientist LLNL Agentic harness that autonomously designs custom visual-analysis apps from a dataset and task description. Analysis Benchmark Code on request

Other systems being tracked for inclusion: Virtual Lab (Stanford / CZ Biohub, Nature 2025 — designed novel SARS-CoV-2 nanobodies), CORAL (multi-agent evolutionary discovery, arXiv:2604.01658), STORM, Aviary, and AutoBa.

Sources