Schema-validated extraction of neoantigen cancer-vaccine immunogenicity data from primary papers, on the Claude Agent SDK (bring-your-own-key).
Project description
vaxtract
Schema-validated extraction of neoantigen cancer-vaccine immunogenicity data from primary papers, built on the Claude Agent SDK. Point it at a folder of paper files (PDF / XLSX / DOCX) and it returns a schema-validated, provenance-tracked JSON extraction (per-peptide / per-epitope immunogenicity, HLA restriction, evidence, survival outcomes, …) for human sign-off.
Bring your own key (BYOK). You run the agent and pay for your own Anthropic usage (~$3/paper, varies). Your files never leave your machine — there is no hosted service.
Output is silver, not gold. Every record carries provenance and is meant for a curator to review before use, not to be treated as ground truth.
Install
pip install vaxtract # core: the schema/vocab data contract only
pip install "vaxtract[agent]" # + the extraction agent (Claude Agent SDK + readers)
pip install "vaxtract[agent,figures]" # + figure/image reading (PyMuPDF + Pillow)
pip install vaxtract pulls only pydantic, so you can import vaxtract.schema
to validate records without the Claude Agent SDK. Running the agent (the
vaxtract console script or vaxtract.extract_paper) needs the [agent] extra.
Running the agent also requires Python ≥ 3.10 and the Claude Code CLI on your
PATH — the Claude Agent SDK shells out to the claude binary:
npm install -g @anthropic-ai/claude-code
(The Docker image below bundles this for you.)
Authenticate (pick one)
# A) API key — pay-per-token
export ANTHROPIC_API_KEY=sk-ant-...
# B) Claude subscription — use a logged-in plan via the `claude` CLI
# (pass --subscription; the key is ignored)
Run
vaxtract ./my_paper_dir out.json
vaxtract --subscription ./my_paper_dir out.json # use plan quota
my_paper_dir is a folder containing the paper's .pdf and any supplementary
.xlsx / .docx. The agent reads the tables/text/figures, builds the record,
self-validates against the schema, and writes out.json.
As a library
import asyncio
from vaxtract import extract_paper
asyncio.run(extract_paper("./my_paper_dir", "out.json"))
The data contract is importable without the SDK:
from vaxtract.schema import ExtractedPaper, SCHEMA_VERSION
What it extracts
Per paper: studies, patients, immunizing peptides, minimal epitopes, pools,
immunogenicity evidence (assay/outcome/magnitude), neoantigen mutations, survival
outcomes, clinical-benefit signals, safety, and vaccine-delivery covariates — all
validated against a versioned Pydantic schema (SCHEMA_VERSION).
Notes
- The agent is restricted to a curated toolset and is headless-safe (no host shell access; it cannot read or write outside the files you give it and the output path).
- Cost/turn backstops (
max_turns,max_budget_usd) guard against runaway runs. - Figure reading is optional; install the
[figures]extra and ensure a working PyMuPDF (it bundles its own libraries — no system Poppler needed).
License
MIT © 2026 Samuel Ahuno
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vaxtract-0.1.0.tar.gz.
File metadata
- Download URL: vaxtract-0.1.0.tar.gz
- Upload date:
- Size: 204.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4ebaead7d41d5653a4d733e0e9f36179c552ff34e6b8670fe5e2851e741a3c6
|
|
| MD5 |
e8d90d801c1a133f3bca9a80d485da71
|
|
| BLAKE2b-256 |
ef55fc763e72340a280a59e684756038ad509b4d8132545bd2f93c57208da356
|
File details
Details for the file vaxtract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vaxtract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 147.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b572672770a544c4607b646bc2ec3d6042bd1c843e8a7bcf58fd9c21dba4c395
|
|
| MD5 |
de78a33f76f552693413a8d305e08c85
|
|
| BLAKE2b-256 |
8629268ddc0309603877df2b7fb185af5c31aef23fbbf800614f80f2be8d1e61
|