ARIS
Autonomous research harness that coordinates ML research workflows via cross-model adversarial collaboration — pairing an executor model from one family with a reviewer model from a different family at each assurance checkpoint.
| Affiliation | Shanghai Jiao Tong University and Shanghai Innovation Institute (Yang, Li, Li) |
| First introduced | 2026-04/05 (technical report; arXiv:2605.03042) |
| Lifecycle stages | Multi-stage (idea discovery → implement & deploy → experiment bridge → paper writing → rebuttal), plus Writing |
| Autonomy level | Semi-autonomous (default Claude Code + GPT-5.4 pairing under a human-approved configuration; assurance checks at key milestones) |
| Domain focus | Machine learning research workflows |
| Availability | Open source (github.com/wanshuiyin/Auto-claude-code-research-in-sleep) |
Approach
ARIS treats independent assurance as a first-class workflow layer. An orchestration layer coordinates five end-to-end workflows — Idea Discovery, Implement & Deploy, Auto-Review Loop, Paper Writing, and Rebuttal — built from reusable Markdown-defined skills. The default configuration pairs an executor model (e.g., Claude Code) with a reviewer model from a different family (e.g., GPT-5.4 xhigh), arguing that single-model self-review is the “stochastic bandits” case while cross-model review is genuinely adversarial: the reviewer probes weaknesses the executor did not anticipate. An assurance stack performs integrity verification, reviewer routing, and a three-stage check that claims are supported by evidence; executors retry up to a configurable limit (default three) before harness improvements are adopted. A prototype self-improvement loop closes the cycle.
Validation
The technical report demonstrates ARIS across all five workflows on illustrative ML research tasks — for example, denoiser model post-training experiments with 4× A100 servers, an Auto-Review Loop that iteratively raised paper scores from 4/10 through 7.5/10 across three rounds, and a Rebuttal workflow that atomized reviewer concerns into structured responses with provenance, commitment, and lint checks.
Notable results
- Introduces cross-model executor/reviewer pairing as an explicit architectural pattern, contrasting with single-model self-critique loops used by prior AI-scientist systems.
- Auto-Review Loop demonstrated rounds of paper improvement (Round 0 → Round 2: 4/10 → 7/10) with concrete fixes for overclaims, notation clash, and missing validation.
- Five-workflow design with explicit assurance checks at key experimental and writing milestones.
Primary paper
Other references
None yet.