Deep Researcher Agent
Open-source framework that runs LLM agents 24/7 through complete deep-learning experiment cycles — hypothesis, code, GPU training, monitoring, and reflection — at near-zero LLM cost during the training phase.
| Affiliation | University of Tokyo (paper) |
| First introduced | 2026-04 (arXiv:2604.05854, dated 2026-04-07) |
| Lifecycle stages | Multi-stage — Think (hypothesis + experiment design) → Execute (code + GPU training) → Reflect (log parsing + next-action selection) |
| Autonomy level | Fully autonomous — runs continuously in persistent tmux sessions with three human-override mechanisms (a HUMAN_DIRECTIVE.md file consumed each cycle, a CLI flag, and direct edits to MEMORY_LOG.md) |
| Domain focus | Deep-learning experimentation (GPU training, hyperparameter and architecture iteration) |
| Availability | Open source (GitHub repository) |
Approach
Deep Researcher Agent runs a continuous Think → Execute → Reflect main loop centered on three innovations.
- Zero-Cost Monitoring. During GPU training — typically 90–99% of wall-clock time — the framework makes zero LLM calls. Instead, an OS-level monitor performs
kill -0 $PIDprocess-liveness checks,nvidia-smiGPU-utilization checks, and a 50-line log tail at configurable intervals (default 15 minutes). The LLM is reinvoked only when the training process terminates, at which point accumulated log tail feeds the Reflect phase. Reported LLM cost: $0.08 per 24-hour cycle vs. ~$0.50 for a naive 5-minute-poll baseline. - Two-Tier Constant-Size Memory. A frozen human-authored Project Brief (≤ 3,000 chars) is paired with an agent-maintained Memory Log split into Key Results (auto-compacted at 1,200 chars by dropping the oldest entry) and Recent Decisions (most-recent 15 entries retained). Total memory is provably bounded at ~5,000 characters regardless of run duration.
- Minimal-Toolset Leader-Worker Architecture. A Leader agent (3 tools:
log_memory,write_file,read_file) coordinates three single-active Worker agents: Idea (literature + hypothesis, 4 tools), Code (implementation + experiment launch, 5 tools), and Writing (report generation, 3 tools). Cycle conversation history is reset between cycles. Reported ~73% per-call token-overhead reduction vs. full-toolset frameworks.
Safety mechanisms: mandatory dry-run (two forward-backward steps) before any real training launch; protected state files (state.json, MEMORY_LOG.md, PROJECT_BRIEF.md) cannot be overwritten by workers; anti-burn exponential cooldown on consecutive no-output cycles.
Validation
Real-world deployment across 4 concurrent deep-learning research projects on 4 GPU servers (NVIDIA L20X 144 GB), each project running an independent agent instance in a persistent tmux session. Evaluation focuses on operational metrics — autonomous cycle count, continuous operation duration, monitoring cost, dry-run failure rate, post-dry-run crash rate — rather than benchmark scores on fixed tasks, because the agent ran open-ended ML research projects of its own choosing.
Notable results
- 500+ autonomous experiment cycles, 30+ days of continuous operation, 4 concurrent projects managed.
- 52% improvement over baseline metrics in the best-performing project, from 200+ automated experiments.
- Average LLM cost of $0.08 per 24-hour cycle — a reported 10–20× reduction vs. conventional LLM-polling monitoring.
Primary paper
Other references
None yet.