AutoResearchClaw
Multi-agent autonomous research pipeline that combines structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a Pivot/Refine decision loop, verifiable result reporting against a numeric registry, seven human-in-the-loop intervention modes, and cross-run lesson accumulation.
| Affiliation | UNC-Chapel Hill–led consortium with UC Santa Cruz, CMU, NUS, UC Berkeley, Rutgers, NEC Labs America, Meta, Stanford, Google, University of Washington (paper) |
| First introduced | 2026-05 (arXiv:2605.20025, dated 2026-05-19) |
| Lifecycle stages | Multi-stage (Discovery → Experimentation → Writing across a 23-stage pipeline), plus Writing as a final stage |
| Autonomy level | Semi-autonomous — Full-Auto mode supported, but the recommended CoPilot mode uses targeted human intervention at six high-leverage decision points; SmartPause routes uncertain decisions to the researcher |
| Domain focus | Machine learning (ML01–ML25 in ARC-Bench) extended to 10 high-energy physics, 7 systems biology, and 3 statistics topics via sandboxed domain-skilled sub-agents |
| Availability | Open source (github.com/aiming-lab/AutoResearchClaw) |
Approach
Five mechanisms span a 23-stage Discovery → Experimentation → Writing pipeline.
- Structured multi-agent debate at two stages. The hypothesis-stage panel pairs an Innovator, a Pragmatist, and a Contrarian; the result-stage panel pairs an Optimist, a Skeptic, and a Methodologist; a synthesizer integrates each panel’s outputs into a single structured artifact.
- Self-healing executor with Pivot/Refine. Failures are treated as diagnostic information: the system either Proceeds, Refines (retry with targeted fixes), or Pivots (return to hypothesis generation with the failure recorded). A complexity score routes hard experiments to an external coding agent; easier ones are handled by a built-in multi-phase code agent with dependency-ordered file generation, AST summaries, and static validation gates. All execution runs in Docker under a three-phase network policy (Phase 2 disables network entirely during measurement).
- Verifiable result reporting. A numeric registry whitelists every value produced by experiment runs; a post-hoc verifier re-extracts numeric claims from the draft and rejects documents with unbacked numbers in Abstract/Results/Experiments. Citations pass a four-layer pipeline (CrossRef → OpenAlex → arXiv → Semantic Scholar) and an LLM relevance check classifying each reference as Verified, Suspicious, or Hallucinated.
- Human-in-the-loop collaboration. Seven intervention modes (Full-Auto, Gate-Only, CoPilot, Thorough, Step-by-Step, Pre-Experiment, Post-Experiment) plus a confidence-driven SmartPause that learns per-stage pause thresholds from researcher overrides.
- Cross-run evolution. A persistent lesson store ranks retrieved lessons by a time-decayed weight
w(l) = s(l) · exp(−ln 2 · Δt / T_½)with a 30-day default half-life; lessons are injected as natural-language overlays into subsequent prompts.
Validation
Introduces ARC-Bench, a 25-topic ML benchmark with a 20-topic scientific-domain extension (10 HEP, 7 systems biology, 3 statistics). Evaluated in three modes: experiment-stage (rubric-assisted strict judge, CD:CE:RA = 25:25:50), end-to-end (1–10 paper-quality scale with accept ≥ 5), and scientific-domain (same rubric on physics/biology/statistics tasks). All baselines run on the same GPT-5.3-codex backbone in the same sandbox.
Notable results
- ARC-Bench experiment stage: AutoResearchClaw (CoPilot) overall 0.648 vs. AI Scientist v2 0.419 (a 54.7% relative improvement) and AIDE-ML 0.511. The largest gap is Result Analysis: 0.523 vs. 0.261 (+100.4%). Full-Auto AutoResearchClaw (0.596) still beats both baselines.
- End-to-end HITL ablation (10 ML topics, 7 modes): CoPilot 87.5% accept rate (mean quality 7.27, 19 interventions) — beats Full-Auto (25%, 0 interventions) and Step-by-Step (50%, 29 interventions). Pre-Experiment HITL alone is widely valid but rarely lifts quality; Post-Experiment HITL alone improves faithfulness but is valid on only 6/10 topics.
- Cross-domain coverage: scientific-domain overall 0.867 vs. 0.090 for AIDE-ML and 0.084 for AI Scientist v2 — both baselines fail to install the required HEP and biology stacks. AutoResearchClaw reaches 0.912 on biology (COBRApy / BiGG) and 0.898 on statistics (DML, bootstrap), with 0.489 on HEP-ph after reproducing published cross-sections via FeynRules/MadGraph/MadAnalysis5.
- Component ablation: multi-agent debate is the largest quality contributor (−1.37 quality without it, p=0.003), self-healing is the largest completion contributor (10/10 → 6/10 without it), and removing the verification gate inflates apparent acceptance from 3/10 to 5/10 — three of those five papers contain values absent from any measurement record.
Primary paper
Other references
None yet.