Benchmark an ADMET property with PyTDC

For a single ADMET endpoint (e.g., Caco-2 permeability, hERG inhibition, CYP3A4 inhibition, AMES mutagenicity, microsomal clearance), load the canonical Therapeutics Data Commons dataset with the leaderboard split, fit a baseline, and emit the official TDC metric — so a new model has a comparable number to report.


Problem class	Data analysis
Subject areas	Drug Repurposing and Discovery, Chemistry
Evidence level	Reported
Complexity	One skill or MCP
Availability	Fully open
Compute	Laptop

Problem

ADMET prediction is the front-line filter for any virtual library — but every paper reports on a different split of a different dataset with a different metric, and head-to-head comparison is impossible without re-running the canonical benchmark. Therapeutics Data Commons (TDC) fixed this with 22 ADMET datasets, frozen scaffold splits, and per-task metric definitions. The hard part is no longer the benchmark — it’s the boilerplate: which dataset object, which split mode, which metric, which leaderboard group. A working medicinal chemist or ML researcher should be able to say “give me the Caco-2 benchmark, scaffold split, with a Morgan-fingerprint + random-forest baseline, report the official metric” and get a number that’s directly comparable to the published leaderboard. Solved looks like: one prompt, one number, in the same row format as the TDC leaderboard, in under ten minutes of wall-clock.

Recommended approach

Install the PyTDC Claude skill (catalog page).
```
/plugin marketplace add K-Dense-AI/claude-scientific-skills
/plugin install pytdc@claude-scientific-skills
pip install pytdc
```
For a fingerprint+RF baseline, also install RDKit and scikit-learn locally (pip install rdkit scikit-learn). For a featurizer-rich baseline, add the molfeat or datamol skills.

Drive the benchmark with a single prompt. A minimal version:

Using the pytdc skill, run the official TDC ADMET_Group
benchmark for "caco2_wang":
  1. Load the dataset via `ADMET_Group(path='./tdc_data')`.
  2. Use the official benchmark seeds (1, 2, 3, 4, 5).
  3. For each seed, get train/valid/test via group.get(...).
  4. Featurize SMILES with Morgan ECFP4 (2048 bits, radius 2).
  5. Fit a RandomForestRegressor (n_estimators=500,
     max_features='sqrt') on train.
  6. Predict on test and record MAE (the official metric for
     Caco-2 per the TDC leaderboard).
  7. Report mean +/- std across the 5 seeds in the standard
     leaderboard row format: dataset, metric, mean, std.
Save the per-seed predictions to results/caco2_wang.csv.

Sanity-check against the leaderboard. The expected metric for a Morgan+RF baseline on Caco-2 lives on the TDC ADMET_Group leaderboard — if your number is dramatically better or worse, you’re probably using the wrong metric direction (regression vs classification) or the wrong split. The skill’s load_and_split_data.py reference covers the canonical splits.
Swap in your method. Once the baseline reproduces a leaderboard-comparable number, replace the fingerprint+RF block with the model you want to benchmark — keep everything else (dataset name, seeds, split, metric) identical. That preserves apples-to-apples comparability with everything else on the leaderboard.

Why this assembly

Rung 2 of the simplicity ladder. PyTDC itself owns the dataset loaders, splits, and metric definitions — the whole benchmark fits inside one Claude skill. Dropping to rung 1 (Claude Code alone) loses access to TDC’s frozen splits, and any “I’ll just download the CSV” path lands at a non-comparable result. Escalating to rung 3 (PyTDC + datamol + molfeat) is a fine choice when the question is “which featurizer gives the best score” — but for the basic “give me one comparable benchmark number” task, the skill alone is sufficient. Rung 4 (a full autonomous system like Biomni) is overkill for a single benchmark run.

Availability

Fully open. PyTDC is MIT-licensed; the skill is an MIT-collection community skill; the TDC datasets are CC0 or per-dataset-licensed but free for academic use. Some datasets auto-download on first use (a few GB total across the full suite); ADMET_Group fits in a few hundred MB. No subscription.

Compute requirements

Laptop. A Morgan+RF baseline on any single ADMET_Group dataset runs in a few minutes on a modern laptop CPU. The full ADMET_Group suite (22 tasks × 5 seeds) runs in a few hours of CPU time. GPU is only needed for neural baselines (graph nets, message-passing models, foundation models like Tx-LLM); for those, an 8–16 GB consumer GPU is typically sufficient for the ADMET datasets, which are small (≤ a few thousand molecules per task).

Evidence

Reported. The canonical benchmark itself is established in Huang et al., NeurIPS Datasets and Benchmarks 2021 (arXiv:2102.09548), with the ADMET_Group leaderboard documented at tdcommons.ai/benchmark/admet_group/overview/. The framework extension is Velez-Arce et al., “Signals in the Cells: Multimodal and Contextualized ML Foundations for Therapeutics,” NeurIPS 2024 (TDC-2). Recent LLM-driven ADMET workflows that follow the same benchmark protocol include Hao et al., “PharmaBench: Enhancing ADMET benchmarks with large language models,” Scientific Data 11:864 (2024) and Yuan et al., “Tx-LLM: A Large Language Model for Therapeutics,” arXiv:2406.06316 (2024), both of which report TDC leaderboard numbers as their headline metric. The skill’s SKILL.md is documentation plus Python recipes — the actual computation is the published TDC code path, so the assembly inherits the benchmark’s validation. No published head-to-head benchmark of the Claude-driven path against a hand-coded script is known — they call the same ADMET_Group API.

Alternatives considered

Hand-coded Python script. The same ADMET_Group API in a notebook works fine for a one-off benchmark. Reach for it when you don’t want a chat-driven workflow at all. The skill’s value is in (a) Claude knowing the metric direction per dataset without you having to remember and (b) emitting the leaderboard row in the expected format.
PyTDC + molfeat + datamol toolbelt. Add the molfeat skill when you want to sweep featurizers (Mordred descriptors, neural fingerprints, language-model embeddings); add datamol when you need preprocessing (standardization, salt stripping). Rung 3 — reach for it when the question is featurizer selection, not benchmark reporting.
DeepChem. A natural ML stack for ADMET prediction; flagged in catalog/curator-state.md as the next Chemistry pass. Today there is no Claude-installable DeepChem skill in the catalog, so DeepChem-based ADMET is deferred until that lands.
ADMET-AI / AdmetLab 3.0 / Deep-PK. Published ML predictors that report against TDC. None has a Claude-installable wrapper today — surfaced as a Missing component in recipes/curator-state.md. Reach for them via their web servers when you need a ready-made model and don’t want to fit your own; reach for the PyTDC recipe when you want a comparable benchmark number for your model.

Sources

Therapeutics Data Commons home — verified 2026-05-31 (this run).
TDC ADMET_Group leaderboard — verified 2026-05-31 (this run).
Huang et al., “Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development,” NeurIPS Datasets and Benchmarks 2021 (arXiv:2102.09548).
Velez-Arce et al., “Signals in the Cells: Multimodal and Contextualized ML Foundations for Therapeutics,” NeurIPS 2024 (TDC-2).
Hao et al., “PharmaBench: Enhancing ADMET benchmarks with large language models,” Scientific Data 11:864 (2024).
Yuan et al., “Tx-LLM: A Large Language Model for Therapeutics,” arXiv:2406.06316 (2024).
mims-harvard/TDC — verified 2026-05-31 (this run).
K-Dense-AI/scientific-agent-skills (scientific-skills/pytdc/SKILL.md) — verified 2026-05-31 (this run).

Tried this recipe?

Share feedback — what worked, what didn’t, what you’d change. The form opens with this recipe pre-selected and a link back to this page.