AgentPLM
Agentic protein language model that interleaves autoregressive sequence generation with tool calls (ESMFold, FoldX, AutoDock Vina) under Reasoning-Augmented Decoding, trained end-to-end via Contrastive Agent Policy Optimisation to learn when oracle feedback is informative.
| Affiliation | Bedford College, London (Sahil Rahman) and Saarland University (Maxx Richard Rahman) |
| First introduced | 2026-06 (arXiv preprint); accepted to ICML 2026 |
| Lifecycle stages | Experiment design (the agent decides which biophysical oracle to query during in-silico protein-sequence design) |
| Autonomy level | Semi-autonomous (policy-driven tool selection during decoding; no wet-lab loop) |
| Domain focus | Biology / computational protein engineering |
| Availability | Closed (no code release noted in the preprint) |
Approach
Each design step is modelled as a decision in a Partially Observable Markov Decision Process over the joint space of partial sequences and retrieved biophysical context. Two contributions:
- Reasoning-Augmented Decoding (RAD) — interleaves autoregressive PLM generation with structured tool calls to ESMFold, FoldX, and AutoDock Vina, incorporating their outputs via a learned Tool Context Encoder (TCE) and Trajectory Memory Buffer (TMB) trained end-to-end on protein-engineering objectives.
- Contrastive Agent Policy Optimisation (CAPO) — a trajectory-level extension of direct preference optimisation that contrasts high-fitness trajectories with coherent oracle use against low-fitness or contradictory ones, teaching the model when oracle feedback is informative rather than merely imitating high-fitness sequences.
AgentPLM is initialised from the public ESM-2 650M checkpoint and trained in two phases (TCE/TMB only, then joint optimisation with layer-wise decay). Distinct from earlier ProtAgent, which freezes a GPT-4 backbone as a planner; AgentPLM trains the agent policy itself.
Validation
Benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction, with standardised oracle APIs and controlled sequence-identity splits. The authors claim mechanistic evidence of online error correction without explicit backtracking.
Notable results
- 2.79× improvement in antibody top-10% hit rate over the strongest passive baseline.
- +34% normalised k_cat/K_M on enzyme design.
- Outperforms all baselines across the five benchmark tasks; authors attribute the gain to qualitatively different reasoning trajectories rather than additional compute.
Primary paper
Other references
None.
Code
Not released.