Curator state

Recently surfaced

  • CALMS (added 2026-06-20) — Argonne National Laboratory (Center for Nanoscale Materials + Advanced Photon Source; Vriza, Prince, Zhou, Chan, Cherukara) AutoGen-based multi-agent framework operating two real user facilities: a Hard X-ray Nanoprobe (HXN) beamline and an N9 robotic thin-film station. Specialized agents (code writer, code critic, administrator, paper scraper, image explainer, teachability) orchestrate multistep workflows, interpret multimodal nano-diffraction/nano-fluorescence images, and learn on the job by storing human guidance as input–output pairs in a ChromaDB vector store (with a similarity threshold to avoid redundant memories). Live demos: natural-language → correct 2D-scan commands and cross-modality scan-region selection at HXN (only o3 reliable on multimodal positional reasoning); end-to-end PEDOT:PSS thin-film fabrication after the literature scraper extracted coating parameters from a PDF, with teachability markedly improving long-horizon sequential success. Open source (arXiv:2509.00098; npj Comput. Mater. 12, 160 (2026); github.com/AdvancedPhotonSource/CALMS).
  • ATLAS (added 2026-06-11) — Google DeepMind (Éltető, Daw, Stachenfeld, Miller; with Princeton/Columbia/UCL) “Active Theory Learning for Automated Science,” an active-learning framework rather than an LLM orchestration, closing the hypothesis-generation ↔ experiment-design loop to discover interpretable mechanistic models of behavior. Iterates a Hypothesis Generator (ensemble of sparse Disentangled RNNs whose latent-variable interactions form candidate computational graphs), an Experiment Optimizer (hill-climbs binary reward-matrix designs to maximize ensemble disagreement / expected information gain, BALD-style), and an Experiment Runner. In-silico validation on recovering Q-learning and Leaky Actor-Critic agents from bandit behavior: 5–10× sample-efficiency gain over random experimentation, 8/8 computational-graph recovery in 100 experiments (baselines needed ~1,000), matched or surpassed expert-designed experiments. Adds a non-LLM cognitive-science exemplar. No code released (arXiv:2606.12386).
  • DarkAgents (added 2026-06-10) — Università di Bologna / INFN (Lucente, Pascoli, Sala, Zandi) language-driven multi-agent system for theoretical astroparticle physics, and the first end-to-end architecture targeting that domain. An orchestrator interprets a particle-physics model or looser “idea,” selects a pipeline branch, writes an execution plan, and dispatches specialized sub-agents that each emit a Markdown report + fixed-schema JSON handoff it checks; pauses for human audit by default, optionally fully autonomous. All physical quantities from deterministic human-validated code; LLM-agnostic (Mistral, Anthropic/Claude Code, OpenAI/Codex, local Ollama). First implementation DarkAgent-PT takes a classically scale-invariant model to a PTArcade MCMC fit of the NANOGrav nanohertz GW background plus a constraint sub-agent and an assumption/prior-auditing sub-agent. Reproduced human Bayesian posteriors across providers, identified inconsistencies in some published fits and produced novel fits on the dissipative bulk-flow GW template, and correctly rejected the sound-wave template where invalid (failure mode: hallucinated references in final report). Open source (arXiv:2606.11157; github.com/PhysicsZandi/DarkAgents).
  • LabOS (added 2026-06-09) — Stanford / Princeton (with Oregon State, U. Washington, NVIDIA) AI-XR co-scientist that pairs a self-evolving multi-agent digital-lab system (extending STELLA: Manager/Developer/Critic agents plus a Tool-Creation agent feeding a shared “Tool Ocean”) with AR/XR smart glasses, a lab-specialized VLM (LabOS-VLM, Qwen-VL post-trained via SFT+GRPO), 3D/4D Gaussian-splat digital twins, and a cobot module. Closes the loop from hypothesis/analysis to human-in-the-loop physical execution. Benchmarks: ~32% HLE: Biomedicine, 61% LAB-Bench: DBQA, 65% LAB-Bench: LitQA; new LabSuperVision (LSV) lab-video benchmark; LabOS-VLM-235B >90% error-detection accuracy beating Claude Opus-4.1/GPT-5/Gemini 2.5 Pro. Wet-lab: agent-nominated CEACAM6 confirmed as an NK-cell anti-tumor target in a physical killing assay; ITSN1 identified as a cell-fusion regulator. Open source (arXiv:2510.14861; github.com/zaixizhang/LabOS).
  • Ax-Prover (added 2026-06-08) — Axiomatic AI (with ICFO, MIT, ICREA) multi-agent Lean theorem-proving framework that equips general-purpose LLMs (Claude Sonnet 4/4.5) with Lean tools via MCP. Orchestrator/Prover/Verifier loop sketches, formalizes, and machine-verifies proofs across mathematics and quantum physics, running autonomously or collaboratively with experts. Top open-source model and third overall on PutnamBench (14%); 96% on the authors’ new QuantumTheorems benchmark and 64% on AbstractAlgebra, outperforming specialized provers (DeepSeek-Prover, Kimina). Two cryptography case studies with domain experts. Open source (arXiv:2510.12787).

Flagged for review

None.

Deferred — next-run priority

  • BioMedAgent / CAS (Bu et al., Nat. Biomed. Eng., doi:10.1038/s41551-026-01634-6, PMID 41912700, 2026) — surfaced by Phase A as a named, benchmark-validated autonomous biomedical-analysis agent: a self-evolving multi-agent LLM framework (CAS) that learns to chain bioinformatics tools into executable analysis workflows via interactive exploration and memory retrieval (BioMed-AQA, 327 tasks, ~77% success; generalizes to BixBench). In-scope as an analysis-stage system, but Phase A could not archive any openly downloadable PDF (closed access, no arXiv/bioRxiv preprint located). Deferred because Phase B cannot fetch the source to ground the page; promote once a citable open source or archived PDF is available.
  • EurekAgent (arXiv:2606.13662, Jun 2026; Tsinghua + Zhipu AI) — PDF archived this run (sources/2606.13662.pdf/.txt) and logged in the manifest. Metric-driven autonomous-discovery agent built around “environment engineering” (permissions/artifact/budget/human-in-the-loop); open-sourced at github.com/THU-Team-Eureka/EurekAgent. New SOTA on 26-circle packing (2.635999), Erdős minimum-overlap, an autocorrelation inequality, a TriMul kernel, and an MLE-Bench subset (85.71%) for ~$11 API cost. Scope-edge: an optimization/discovery substrate evaluated on math/kernel/ML benchmarks, not natural-science hypothesis generation, experiment design, or scientific data analysis — same zone as CORAL and overlapping catalogued ML/math discovery systems. Promote only if a more natural-science-leaning evaluation or a clearer hypothesis/experiment-design loop emerges.
  • Numina-Lean-Agent (arXiv:2601.14027, Jan 2026; Project Numina et al.) — PDF archived this run (sources/2601.14027.pdf/.txt) and logged in the manifest. Agentic Lean theorem-proving framework (Claude Code + Numina-Lean-MCP on Claude Opus 4.5) that solves given theorems (Putnam 2025 12/12; formalizes Brascamp–Lieb); released at github.com/project-numina/numina-lean-agent. Scope-edge case: a purer prover of stated theorems with no hypothesis generation, experiment design, or scientific data analysis — closer to a prover tool than an autonomous scientist. Add only if a stronger “does-science” case emerges.
  • Re-verification backlog (link/repo checks) — 19 bootstrap entries crossed the 30-day window on 2026-06-20 (last_verified 2026-05-20/21): ai-scientist-sakana, robin, co-scientist-google, coscientist-cmu, novelseek, crispr-gpt, openscientist, talk2qsp, chemcrow, aila, qiushi-discovery-engine, mars, biomni, agenticsciml, ai-cfd-scientist, graft-athena, kosmos, dr-sai, qumus. Phase B has no web/MCP tools, so primary-paper-link and code-repo liveness could not be confirmed this phase; last_verified was intentionally NOT bumped. Next Phase A should fetch these links and Phase B can then bump the dates.
  • CORAL (arXiv:2604.01658, Apr 2026) — Multi-agent evolutionary discovery framework from MIT/NUS/Singapore-MIT. PDF archived at sources/2604.01658v1.pdf; reported on Anthropic’s kernel-engineering task and Polyominoes packing, not strictly natural-science hypothesis generation. Add when a more science-leaning evaluation surfaces.
  • AIDO.Harness (bioRxiv 2026.04.20.719735) — Autonomous ML-model construction for biomedical tasks, framed as POMDP. Not downloaded; revisit next pass.
  • Virtual Lab (Stanford / CZ Biohub, Nature 2025) — referenced in Kosmos and AgenticSciML papers; PDF blocked by Cloudflare on prior run.
  • ScienceClaw × Infinite (arXiv:2603.14312, Mar 2026) — Underlying agentic execution substrate cited as foundation by the new CategoryScienceClaw paper. Currently captured as a reference inside the CategoryScienceClaw entry; consider promoting to its own page when a more complete characterization is available.