How autonomous AI scientists are evaluated

For the system catalogue and cross-cutting synthesis, see the Landscape. This page collects how the field measures these systems and the failure modes that remain open.

Evaluation regimes

Evaluation now spans several regimes, in roughly descending order of evidentiary strength.

Wet-lab and instrument-coupled validations remain the strongest evidence: Co-Scientist’s in vitro AML hits and organoid-confirmed fibrosis targets; Robin’s ripasudil/KL001 confirmations; CMU Coscientist’s robotic cross-couplings; CRISPR-GPT’s non-expert case study; MARS’s robotic perovskite synthesis loop; AILA’s five real-world AFM experiments; Qumus’s AI fabrication of graphene and vdW field-effect transistors; Qiushi Engine’s autonomous discovery and experimental validation of optical bilinear interaction on a real optical platform; Dr.Sai’s reproduction of ten J/ψ branching fractions in the BESIII production environment; AI CFD Scientist’s vision-gated discovery of a Spalart–Allmaras correction validated against DNS; and SPARK’s prospective validation of >1,000 LLM-generated histopathology parameters across 18 patient cohorts in five cancer types.

Independent expert review of system reports has emerged with Kosmos: scientist evaluators classified 102 statements drawn from three reports as Supported or Refuted, with 79.4% Supported. Collaborators independently rated a 20-cycle Kosmos run as equivalent to roughly six months of their own research time.

Standardised benchmarks have grown: Biomni reports LAB-Bench DbQA/SeqQA and HLE numbers against human-expert baselines; AI Index 2026 coverage notes PaperArena (best agent 39%) and that the best AI agents score roughly half as well as human PhDs on multistep tasks (Nature news, April 2026). Process-level evaluation is the newest addition: BiomniBench-DA (Qu et al., Stanford / Phylo) grades the full agent trajectory across six dimensions — data handling, method selection, statistical rigor, biological interpretation, scientific reasoning, source reliability — against expert-authored rubrics on 100 tasks drawn from Nature / Cell / Science papers and co-developed with the original authors. Across nine frontier base models in the Terminus-2 harness, frontier and open-weight bases cluster within ~5 points (best Opus 4.7 63.9, GLM-5.1 60.4); switching from Terminus-2 to Claude Code lifts Opus 4.7 to 73.3, and the agent-harness gap (13.5 points for GPT-5.4 across Codex CLI vs. Terminus-2) exceeds the Opus 4.7 / Opus 4.6 model-generation gap (3.8 points). Even the strongest configuration sits below 75/100, with the largest deficits on method selection, biological interpretation, and scientific reasoning (Qu et al., bioRxiv 2026.05.12.724604).

Internal head-to-head studies are appearing: NovelSeek vs. AI Scientist-v2 on idea novelty; Sakana’s Automated Reviewer benchmarked against NeurIPS-scale human agreement (~69% balanced accuracy); EvoScientist vs. seven open-source and commercial baselines (Virtual Scientist, AI-Researcher, InternAgent, AI Scientist-v2, and others) on idea quality and code-execution success rate. Deep Research (BioAgents) reports state-of-the-art 48.8% on BixBench open response — the first primary-source evidence we track on the BixBench computational-biology benchmark — and 64.4% on MCQ without refusal. Jr. AI Scientist papers receive higher DeepReviewer scores than existing fully automated systems on three NeurIPS/IJCV/ICLR baselines.

Verifiability audits are the newest evaluation regime. ScientistOne’s Chain-of-Evidence (CoE) Integrity Audit scores 15 papers per system across five autonomous research systems on four checks — Score Verification, Specification Violation, Reference Verification, and Method-Code Alignment — and reports that every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method–code alignment ranges from 20% to 80%. ScientistOne is the only audited system to reach 0/337 hallucinated references and 12/12 score verification (Meng et al., arXiv:2605.26340).

Open problems

Hallucination of references. Robin’s ablations show that removing PaperQA2-based agents (Crow/Falcon) dramatically increases hallucinated citations; the AI Scientist papers report similar inaccuracies in citations and figures.
Reproducibility of analyses. Robin runs eight Finch trajectories and consensuses; Gao et al. flag reproducibility and rigorous peer review of agentic research as open challenges.
Originality versus retrieval. Gao et al. argue current foundation models may struggle to produce hypotheses outside their training distribution — a structural limit on hypothesis-generation claims. A bioRxiv critical evaluation (January 2026, 10.64898/2026.01.05.697809) reports that none of eight open-source frameworks completed a full research cycle end-to-end.
Code-execution risk and dual use. Biomni notes its agent executes LLM-generated code with full system privileges by default; CRISPR-GPT withholds full code pending regulatory clarity; Co-Scientist authors cite safety implications as a reason for not releasing source; Robin notes RLHF and platform-level controls against malicious protocol generation. MARS isolates code execution in Docker.
Instruction adherence in lab settings. AILA’s AFMBench evaluation documents an “agent sleepwalking” failure mode in which agents deviate from supplied instructions, and finds that materials-science question-answering proficiency does not transfer to laboratory operation — a direct warning for self-driving-lab deployments.
Risk disclosure for autoresearch systems. Jr. AI Scientist ships an explicit risk report alongside its main paper, cataloguing failure modes encountered during development and pairing it with three evaluation regimes (DeepReviewer, author-led review, Agents4Science submission); the authors identify “important limitations” from author and Agents4Science reviews as a concrete warning against directly applying current AI Scientist systems to academic output (Miyai et al., TMLR 2026).
Evaluation gaps. Few head-to-head studies share benchmarks across the systems tracked here; the AI Index Report 2026 notes that even the best agents lag human PhDs by ~2× on multistep tasks. BiomniBench-DA narrows part of this gap with cross-harness, cross-model process-level scoring on real biomedical data-analysis tasks but is currently confined to data analysis; experimental design, literature synthesis, and protocol-optimization tasks are slated for future releases.