Skip to main content

Agent-friendly CLI for PDF annotation writeback (highlight, note, extract, apply) with stable IDs and idempotent operations.

Project description

pdfanno

Python 3.12+ License: AGPL-3.0-or-later

Agent-friendly CLI for PDF annotation writeback.

pdfanno is the missing piece between "extract PDF text for an LLM" (many tools) and "highlight matches back into the PDF" (almost none). Search, add highlights and sticky notes, and write them back into a PDF — deterministically, idempotently, and with structured I/O an agent can round-trip.

Why yet another PDF tool

At the time of 0.1.0 (April 2026) the open-source CLI/MCP ecosystem for PDFs is read-heavy: pdfannots, pdf-reader-mcp, pdf-agent-mcp, pymupdf4llm-mcp, docling-mcp all extract. None write annotations back. pdfanno is the reference implementation for the writeback side, with engineering properties tuned for agents:

  • Stable annotation_id across runs, library upgrades, and save→reopen round-trips. Formula: sha256(doc_id + kind + page + normalized_quads + normalized_matched_text + rule_hash) with quads rounded to 2 decimal places in PDF points.
  • Idempotent writes. Re-running the same command never creates duplicate annotations unless you pass --allow-duplicates.
  • Sidecar-first safety. --sidecar stores drafts in a local SQLite store; export commits them to a PDF copy; the original PDF is never modified unless you say --in-place.
  • --in-place pre-checks. Refuses encrypted / signed / permission- restricted / XFA / JavaScript PDFs before touching bytes.
  • Shared schema for --dry-run and apply. Preview output is a valid input to apply; no drift between what you plan and what you execute.
  • schema_version: 1 JSON contract. Fields are stable; breaking changes bump the version.

Install

pip install pdfanno

Requires Python 3.12 or newer. Runtime deps: PyMuPDF, Typer, Pydantic.

Quick tour

# Highlight a word, write to a new file (original never touched).
pdfanno highlight paper.pdf "transformer" -o paper.annotated.pdf

# Preview what would happen, as structured JSON.
pdfanno highlight paper.pdf "transformer" -o out.pdf --dry-run --json

# Add a sticky note on page 3.
pdfanno note paper.pdf --page 3 --text "revisit this claim" -o paper.noted.pdf

# List existing annotations.
pdfanno list paper.annotated.pdf --json

# Extract to JSON / Markdown.
pdfanno extract paper.annotated.pdf --format json > annotations.json
pdfanno extract paper.annotated.pdf --format markdown

# Apply a plan (the JSON from --dry-run or a hand-edited version).
pdfanno apply paper.pdf plan.json -o paper.applied.pdf --dry-run --json
pdfanno apply paper.pdf plan.json -o paper.applied.pdf

# Sidecar workflow: draft now, commit later.
pdfanno highlight paper.pdf "ATP synthase" --sidecar
pdfanno note     paper.pdf --page 2 --text "key result" --sidecar
pdfanno status   paper.pdf --json
pdfanno export   paper.pdf -o paper.annotated.pdf

# Imported a PDF from another reader? Pull its annotations into the sidecar.
pdfanno import paper.with_external_highlights.pdf

# Renamed or moved the PDF? Rebind the sidecar records to the new path.
pdfanno rebind old/path/paper.pdf new/path/paper.pdf

Exit codes

Code Meaning
0 Success. Also returned when zero matches or zero new annotations after dedup.
2 Usage error (bad flags, unknown color, page out of range, invalid plan JSON).
3 Input/file error (missing file, can't open, encrypted without password).
4 Processing error (save failed, in-place refused, partial write failure).

AnnotationPlan schema

--dry-run and apply share the same JSON contract (see plan.md §8.3):

{
  "schema_version": 1,
  "doc_id": "id:6a43c29b17151dc2821dc38706681260",
  "rules": [
    {
      "rule_id": "rule-001",
      "kind": "highlight",
      "query": "transformer",
      "mode": "literal",
      "color": [1.0, 1.0, 0.0],
      "page_range": null
    }
  ],
  "annotations": [
    {
      "annotation_id": "2d71c6c4b24ca04985546051c6e295330280e8e91b214664ab755605b805ffc4",
      "rule_id": "rule-001",
      "kind": "highlight",
      "page": 0,
      "matched_text": "transformer",
      "quads": [[72.12, 144.34, 130.55, 144.34, 72.12, 158.02, 130.55, 158.02]],
      "color": [1.0, 1.0, 0.0],
      "contents": "",
      "source": "plan"
    }
  ]
}

Consumers should treat unknown keys as forward-compatible additions — models are pydantic extra="allow".

Document identity

pdfanno never uses whole-file hashing for document identity (incremental saves change the bytes). It uses:

  1. Primary: PDF trailer /ID[0], prefixed id:.
  2. Fallback: fb:<page_count>:<first_page_text_hash>:<file_size>.

If you move or rename a PDF, the identity is preserved; if the PDF's content is edited elsewhere and the /ID regenerates, run pdfanno rebind.

Non-goals (v1)

pdfanno intentionally stops before:

  • Regex / sentence / section-scoped matching (slated for v1.5).
  • Terminal UI (TUI) — Textual app is v0.2.0 (Phase 2).
  • Kitty/Sixel image rendering — v0.3.0 (Phase 3).
  • OCR on scanned PDFs.
  • Automatic merge between sidecar drafts and externally-edited PDF annotations.
  • Multi-document knowledge bases.

See plan.md for the full product spec.

Safety defaults

  • pdfanno never overwrites the input PDF unless you pass --in-place.
  • --in-place refuses encrypted, signed, permission-restricted, XFA-form, and JavaScript-bearing PDFs (exit code 4 with a human-readable reason).
  • Repeated runs of the same command dedupe on annotation_id and produce annotations_created: 0 on subsequent runs.
  • External annotations on the PDF are preserved — pdfanno only manages the annotations it created (identified via the /NM field).

License

pdfanno depends on PyMuPDF, which is distributed under AGPL-3.0 by Artifex. pdfanno is therefore released under AGPL-3.0-or-later.

Commercial or closed-source distribution requires either AGPL-3.0 compliance across the combined work, or a MuPDF commercial license from Artifex.

See LICENSE. SPDX identifier: AGPL-3.0-or-later.

Development

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

ruff format --check .
ruff check .
pytest -v

Contributor conventions live in AGENTS.md. Design rationale and phase plan in plan.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfanno-0.1.0.tar.gz (58.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfanno-0.1.0-py3-none-any.whl (45.4 kB view details)

Uploaded Python 3

File details

Details for the file pdfanno-0.1.0.tar.gz.

File metadata

  • Download URL: pdfanno-0.1.0.tar.gz
  • Upload date:
  • Size: 58.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdfanno-0.1.0.tar.gz
Algorithm Hash digest
SHA256 17de1cb6fe9144a61c1d15c92008fc2b3961e34b3c0fb787901f5505d85afc49
MD5 2de3dbe8699359e7f2aa7ee87571a9b6
BLAKE2b-256 1fdc106d592753b914475e3eed4fdb729211ff719d23a0d5a8e09c22af99098e

See more details on using hashes here.

File details

Details for the file pdfanno-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdfanno-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for pdfanno-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3bcc4caaa5f916e94188fd30ab1de86f874932ea74065ee851a74ae8f76c6b1f
MD5 6dfb40b91cac80ba6f48056e863c548c
BLAKE2b-256 b1daeb4428f0d3e502c118bcb47ce6ca94586b1e3a0933489526a8938ef76aba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page