LabOS
An AI-XR co-scientist that couples a self-evolving multi-agent system for digital-lab reasoning with extended-reality smart glasses, a vision-language model, and robotics, letting AI perceive, reason about, and assist in the physical laboratory.
| Affiliation | Stanford University School of Medicine and Princeton AI Lab, with Oregon State, U. Washington, and NVIDIA (paper) |
| First introduced | 2025-10 (arXiv preprint; also bioRxiv 2025.10.16.679418) |
| Lifecycle stages | Multi-stage — closes the loop from hypothesis generation and experiment design through data analysis to human-in-the-loop physical execution and automated documentation |
| Autonomy level | Semi-autonomous — the digital agents reason and analyze autonomously; physical execution is human-in-the-loop via XR glasses, with an optional cobot module |
| Domain focus | Biomedical research (cancer immunology, stem-cell engineering); demonstrated in materials-science labs |
| Availability | Open source — GitHub |
Approach
LabOS has two coupled modules. The digital (dry) lab is a self-evolving multi-agent system that extends the STELLA framework: a Manager/Planner agent decomposes scientific objectives into structured steps, a Developer agent generates and runs Python for bioinformatics analyses, and a Critic agent evaluates and refines intermediate results in an iterative loop. A Tool-Creation agent autonomously identifies, tests, and integrates new analytical tools, databases, and APIs from sources like PubMed into a shared “Tool Ocean,” while a Template Library of successful reasoning workflows lets the system generalize from prior solutions — together enabling self-improvement that scales with inference-time compute.
The physical (wet) lab module connects AI reasoning to the bench through AR/XR smart glasses and multimodal sensing. Egocentric video is streamed in 5–10 s segments to a GPU server, where a lab-specialized vision-language model (LabOS-VLM, post-trained from Qwen-VL via SFT then GRPO reinforcement learning) interprets the scene, verifies actions against gold-standard protocols, and returns structured JSON feedback rendered on the glasses. LabOS also builds 3D/4D digital twins of lab workflows using MapAnything and 3D Gaussian splatting, and includes a proof-of-concept cobot module (xArm + gripper) for automating repetitive steps with human–robot handover.
Validation
On biomedical reasoning benchmarks the digital agent reports approximately 32% on Humanity’s Last Exam: Biomedicine, 61% on LAB-Bench: DBQA, and 65% on LAB-Bench: LitQA, outperforming the next-best models by up to 8%, with accuracy improving under test-time scaling. The authors introduce LabSuperVision (LSV), an expert-annotated benchmark of >200 egocentric lab-video sessions (recorded by 7 researchers across bench, tissue-culture, and instrument settings) for evaluating lab perception and reasoning; leading commercial VLMs scored poorly (Gemini-2.5 Pro reached only 2.86/5 on protocol alignment), motivating LabOS-VLM, whose 235B variant exceeds 90% error-detection accuracy on held-out data and outperforms Claude Opus-4.1, GPT-5, and Gemini 2.5 Pro on the evaluated metrics.
For wet-lab validation, LabOS generated hypotheses and analyzed functional-screening data to nominate CEACAM6 as a natural-killer-cell cancer-immunotherapy target, which was confirmed in a physical NK-tumor killing assay. A second study identified ITSN1 as a regulator of cell fusion, and a third had researchers wear smart glasses during stem-cell engineering, where LabOS provided step-level guidance and flagged operational deviations (e.g., sterile-technique breaches, incorrect incubation times).
Notable results
- Agent-nominated CEACAM6 validated as an NK-cell anti-tumor target in a physical killing assay; ITSN1 identified as a cell-fusion regulator.
- LabOS-VLM-235B exceeds 90% error-detection accuracy on held-out lab video, outperforming Claude Opus-4.1, GPT-5, and Gemini 2.5 Pro.
- New benchmarks: ~32% on HLE: Biomedicine, 61% LAB-Bench: DBQA, 65% LAB-Bench: LitQA; plus the LabSuperVision (LSV) lab-video benchmark.