AutoTTS

Environment-driven agentic framework that autonomously discovers test-time-scaling (TTS) controllers for LLM reasoning by searching over branch / continue / probe / prune / stop policies on pre-collected reasoning trajectories.

   
Affiliation University of Maryland, UVA, WUSTL, UNC, Google, Meta (paper)
First introduced 2026-05 (arXiv:2605.08083, v2 2026-05-12)
Lifecycle stages Hypothesis (propose controller candidates), Experiment design (evaluate in offline replay environment), Analysis (diagnose failure modes from execution traces)
Autonomy level Semi-autonomous (human constructs the discovery environment — states, actions, feedback, objectives — then the explorer LLM searches autonomously)
Domain focus LLM test-time scaling for mathematical reasoning
Availability Open source — github.com/zhengkid/AutoTTS

Approach

AutoTTS reframes test-time scaling from hand-crafted heuristics to automatic controller discovery. The width–depth TTS problem is formalized as controller synthesis over an offline replay environment: for each question, reasoning trajectories and intermediate probe signals are pre-collected so that candidate controllers — programs that decide when to BRANCH, CONTINUE, PROBE, PRUNE, or STOP — can be evaluated cheaply without repeated LLM calls.

The discovery loop has four stages: humans define the environment (states, actions, feedback, objectives); an explorer LLM proposes candidate controllers; controllers are evaluated in the replay environment under an accuracy–cost objective; the explorer receives scaling-curve feedback and fine-grained execution-trace feedback that exposes how each controller allocates computation, and uses accumulated history to refine future proposals. Two design choices keep the search tractable: beta parameterization, which exposes a single scalar trade-off parameter β per controller and derives internal hyperparameters deterministically from it (reducing overfitting to the search set), and execution-trace feedback that lets the explorer diagnose why a controller fails rather than only that it fails.

Validation

Benchmarked on mathematical-reasoning datasets (including AIME25 with Qwen-1.7B as the base model). Discovered controllers are compared against strong manually designed TTS baselines — SC@64, ASC, ESC, ANSWER CONSISTENCY, ST-BON, PARALLEL PROBE — interpreted as hand-designed points within the width–depth space. The authors also test generalization to held-out benchmarks and different model scales.

Notable results

  • Discovered controllers improve the accuracy–cost Pareto frontier over hand-crafted baselines on math reasoning.
  • Discovered strategies generalize to held-out benchmarks and to different model scales without re-discovery.
  • Total discovery cost reported as $39.9 and 160 minutes for the full one-time search.

Primary paper

Zheng et al., “LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling,” arXiv:2605.08083.

Other references

None yet.

Code

Repository.