Skip to main content

Schema-validated extraction of neoantigen cancer-vaccine immunogenicity data from primary papers, on the Claude Agent SDK (bring-your-own-key).

Project description

vaxtract

Schema-validated extraction of neoantigen cancer-vaccine immunogenicity data from primary papers, built on the Claude Agent SDK. Point it at a folder of paper files (PDF / XLSX / DOCX) and it returns a schema-validated, provenance-tracked JSON extraction (per-peptide / per-epitope immunogenicity, HLA restriction, evidence, survival outcomes, …) for human sign-off.

Bring your own key (BYOK). You run the agent and pay for your own Anthropic usage (~$3/paper, varies). Your files never leave your machine — there is no hosted service.

Output is silver, not gold. Every record carries provenance and is meant for a curator to review before use, not to be treated as ground truth.

Install

pip install vaxtract                    # core: the schema/vocab data contract only
pip install "vaxtract[agent]"           # + the extraction agent (Claude Agent SDK + readers)
pip install "vaxtract[agent,figures]"   # + figure/image reading (PyMuPDF + Pillow)

pip install vaxtract pulls only pydantic, so you can import vaxtract.schema to validate records without the Claude Agent SDK. Running the agent (the vaxtract console script or vaxtract.extract_paper) needs the [agent] extra.

Running the agent also requires Python ≥ 3.10 and the Claude Code CLI on your PATH — the Claude Agent SDK shells out to the claude binary:

npm install -g @anthropic-ai/claude-code

(The Docker image below bundles this for you.)

Authenticate (pick one)

# A) API key — pay-per-token
export ANTHROPIC_API_KEY=sk-ant-...

# B) Claude subscription — use a logged-in plan via the `claude` CLI
#    (pass --subscription; the key is ignored)

Run

vaxtract ./my_paper_dir out.json
vaxtract --subscription ./my_paper_dir out.json   # use plan quota

my_paper_dir is a folder containing the paper's .pdf and any supplementary .xlsx / .docx. The agent reads the tables/text/figures, builds the record, self-validates against the schema, and writes out.json.

As a library

import asyncio
from vaxtract import extract_paper

asyncio.run(extract_paper("./my_paper_dir", "out.json"))

The data contract is importable without the SDK:

from vaxtract.schema import ExtractedPaper, SCHEMA_VERSION

What it extracts

Per paper: studies, patients, immunizing peptides, minimal epitopes, pools, immunogenicity evidence (assay/outcome/magnitude), neoantigen mutations, survival outcomes, clinical-benefit signals, safety, and vaccine-delivery covariates — all validated against a versioned Pydantic schema (SCHEMA_VERSION).

Notes

  • The agent is restricted to a curated toolset and is headless-safe (no host shell access; it cannot read or write outside the files you give it and the output path).
  • Cost/turn backstops (max_turns, max_budget_usd) guard against runaway runs.
  • Figure reading is optional; install the [figures] extra and ensure a working PyMuPDF (it bundles its own libraries — no system Poppler needed).

License

MIT © 2026 Samuel Ahuno

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vaxtract-0.1.0.tar.gz (204.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vaxtract-0.1.0-py3-none-any.whl (147.5 kB view details)

Uploaded Python 3

File details

Details for the file vaxtract-0.1.0.tar.gz.

File metadata

  • Download URL: vaxtract-0.1.0.tar.gz
  • Upload date:
  • Size: 204.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for vaxtract-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e4ebaead7d41d5653a4d733e0e9f36179c552ff34e6b8670fe5e2851e741a3c6
MD5 e8d90d801c1a133f3bca9a80d485da71
BLAKE2b-256 ef55fc763e72340a280a59e684756038ad509b4d8132545bd2f93c57208da356

See more details on using hashes here.

File details

Details for the file vaxtract-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vaxtract-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 147.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for vaxtract-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b572672770a544c4607b646bc2ec3d6042bd1c843e8a7bcf58fd9c21dba4c395
MD5 de78a33f76f552693413a8d305e08c85
BLAKE2b-256 8629268ddc0309603877df2b7fb185af5c31aef23fbbf800614f80f2be8d1e61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page