Agent-friendly CLI for PDF annotation writeback (highlight, note, extract, apply) with stable IDs and idempotent operations.

These details have not been verified by PyPI

Project description

pdfanno

Agent-friendly CLI for PDF annotation writeback.

pdfanno is the missing piece between "extract PDF text for an LLM" (many tools) and "highlight matches back into the PDF" (almost none). Search, add highlights and sticky notes, and write them back into a PDF — deterministically, idempotently, and with structured I/O an agent can round-trip.

Why yet another PDF tool

At the time of 0.1.0 (April 2026) the open-source CLI/MCP ecosystem for PDFs is read-heavy: pdfannots, pdf-reader-mcp, pdf-agent-mcp, pymupdf4llm-mcp, docling-mcp all extract. None write annotations back. pdfanno is the reference implementation for the writeback side, with engineering properties tuned for agents:

Stable annotation_id across runs, library upgrades, and save→reopen round-trips. Formula: sha256(doc_id + kind + page + normalized_quads + normalized_matched_text + rule_hash) with quads rounded to 2 decimal places in PDF points.
Idempotent writes. Re-running the same command never creates duplicate annotations unless you pass --allow-duplicates.
Sidecar-first safety. --sidecar stores drafts in a local SQLite store; export commits them to a PDF copy; the original PDF is never modified unless you say --in-place.
--in-place pre-checks. Refuses encrypted / signed / permission- restricted / XFA / JavaScript PDFs before touching bytes.
Shared schema for --dry-run and apply. Preview output is a valid input to apply; no drift between what you plan and what you execute.
schema_version: 1 JSON contract. Fields are stable; breaking changes bump the version.

Install

pip install pdfanno

Requires Python 3.12 or newer. Runtime deps: PyMuPDF, Typer, Pydantic.

Quick tour

# Highlight a word, write to a new file (original never touched).
pdfanno highlight paper.pdf "transformer" -o paper.annotated.pdf

# Preview what would happen, as structured JSON.
pdfanno highlight paper.pdf "transformer" -o out.pdf --dry-run --json

# Add a sticky note on page 3.
pdfanno note paper.pdf --page 3 --text "revisit this claim" -o paper.noted.pdf

# List existing annotations.
pdfanno list paper.annotated.pdf --json

# Extract to JSON / Markdown.
pdfanno extract paper.annotated.pdf --format json > annotations.json
pdfanno extract paper.annotated.pdf --format markdown

# Apply a plan (the JSON from --dry-run or a hand-edited version).
pdfanno apply paper.pdf plan.json -o paper.applied.pdf --dry-run --json
pdfanno apply paper.pdf plan.json -o paper.applied.pdf

# Sidecar workflow: draft now, commit later.
pdfanno highlight paper.pdf "ATP synthase" --sidecar
pdfanno note     paper.pdf --page 2 --text "key result" --sidecar
pdfanno status   paper.pdf --json
pdfanno export   paper.pdf -o paper.annotated.pdf

# Imported a PDF from another reader? Pull its annotations into the sidecar.
pdfanno import paper.with_external_highlights.pdf

# Renamed or moved the PDF? Rebind the sidecar records to the new path.
pdfanno rebind old/path/paper.pdf new/path/paper.pdf

Migrating annotations across PDF versions (`diff`)

pdfanno diff OLD.pdf NEW.pdf compares the annotations already stored in OLD.pdf against NEW.pdf and classifies each one as:

Status	Meaning
`preserved`	Same page, same location (text still there, centers within ~15 pt).
`relocated`	Same text found, but moved to another page or position.
`changed`	Text around the annotation is recognizably edited.
`ambiguous`	Multiple candidates with close scores — flagged for review.
`broken`	Text no longer found in the new version (or only unrecognizable candidates).

Each result carries a confidence in [0, 1] decomposed into five signals (text / context / layout / page proximity / length). Agents can filter by status + confidence and pipe the rest to human review.

# Emit a diff report (JSON) for a paper that got revised.
pdfanno diff paper_v1.pdf paper_v2.pdf --json > diff.json

# Or write directly to a file, with a human-readable summary on stderr.
pdfanno diff paper_v1.pdf paper_v2.pdf --diff-out diff.json

See docs/diff.md for the full migration workflow, and docs/examples/arxiv_attention_v1_to_v5.md for a 3-minute walkthrough on a real 39-highlight paper.

Exit codes

Code	Meaning
0	Success. Also returned when zero matches or zero new annotations after dedup.
2	Usage error (bad flags, unknown color, page out of range, invalid plan JSON).
3	Input/file error (missing file, can't open, encrypted without password).
4	Processing error (save failed, in-place refused, partial write failure).

AnnotationPlan schema

--dry-run and apply share the same JSON contract (see plan.md §8.3):

{
  "schema_version": 1,
  "doc_id": "id:6a43c29b17151dc2821dc38706681260",
  "rules": [
    {
      "rule_id": "rule-001",
      "kind": "highlight",
      "query": "transformer",
      "mode": "literal",
      "color": [1.0, 1.0, 0.0],
      "page_range": null
    }
  ],
  "annotations": [
    {
      "annotation_id": "2d71c6c4b24ca04985546051c6e295330280e8e91b214664ab755605b805ffc4",
      "rule_id": "rule-001",
      "kind": "highlight",
      "page": 0,
      "matched_text": "transformer",
      "quads": [[72.12, 144.34, 130.55, 144.34, 72.12, 158.02, 130.55, 158.02]],
      "color": [1.0, 1.0, 0.0],
      "contents": "",
      "source": "plan"
    }
  ]
}

Consumers should treat unknown keys as forward-compatible additions — models are pydantic extra="allow".

Document identity

pdfanno never uses whole-file hashing for document identity (incremental saves change the bytes). It uses:

Primary: PDF trailer /ID[0], prefixed id:.
Fallback: fb:<page_count>:<first_page_text_hash>:<file_size>.

If you move or rename a PDF, the identity is preserved; if the PDF's content is edited elsewhere and the /ID regenerates, run pdfanno rebind.

Non-goals (v1)

pdfanno intentionally stops before:

Regex / sentence / section-scoped matching (slated for v1.5).
Terminal UI (TUI) for the PDF reader — Phase 2. A narrower pdfanno review diff.json TUI (just for reviewing diff output) is on deck.
Kitty/Sixel image rendering — v0.3.0 (Phase 3).
OCR on scanned PDFs.
Automatic merge between sidecar drafts and externally-edited PDF annotations.
Multi-document knowledge bases.

See plan.md for the full product spec.

Safety defaults

pdfanno never overwrites the input PDF unless you pass --in-place.
--in-place refuses encrypted, signed, permission-restricted, XFA-form, and JavaScript-bearing PDFs (exit code 4 with a human-readable reason).
Repeated runs of the same command dedupe on annotation_id and produce annotations_created: 0 on subsequent runs.
External annotations on the PDF are preserved — pdfanno only manages the annotations it created (identified via the /NM field).

License

pdfanno depends on PyMuPDF, which is distributed under AGPL-3.0 by Artifex. pdfanno is therefore released under AGPL-3.0-or-later.

Commercial or closed-source distribution requires either AGPL-3.0 compliance across the combined work, or a MuPDF commercial license from Artifex.

See LICENSE. SPDX identifier: AGPL-3.0-or-later.

Development

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

ruff format --check .
ruff check .
pytest -v

Contributor conventions live in AGENTS.md. Design rationale and phase plan in plan.md.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.2

Apr 23, 2026

0.2.1

Apr 22, 2026

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfanno-0.2.2.tar.gz (185.6 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfanno-0.2.2-py3-none-any.whl (73.9 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file pdfanno-0.2.2.tar.gz.

File metadata

Download URL: pdfanno-0.2.2.tar.gz
Upload date: Apr 23, 2026
Size: 185.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pdfanno-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`d752a4b908d3eb3fb57da5b3a97d2f58cc45abf8375660981a00530ce4b5aaae`
MD5	`f7cb3bce17bbd1880612a760e02b87e1`
BLAKE2b-256	`c6e3359718b474a72c4def96bcba53a072946aaa897aebc14061ba218e022717`

See more details on using hashes here.

File details

Details for the file pdfanno-0.2.2-py3-none-any.whl.

File metadata

Download URL: pdfanno-0.2.2-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 73.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for pdfanno-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`57f724f63b5e43f597660ace83a8ed264b4cbf6b73f9651c256ae89a2f15d89d`
MD5	`03d9ef52239af5ef825be6d764e0e79a`
BLAKE2b-256	`5a575acc90f958909305ca76deacb12a84cb849527b1376c0cd12694d3b98263`

See more details on using hashes here.

pdfanno 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pdfanno

Why yet another PDF tool

Install

Quick tour

Migrating annotations across PDF versions (`diff`)

Exit codes

AnnotationPlan schema

Document identity

Non-goals (v1)

Safety defaults

License

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pdfanno 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pdfanno

Why yet another PDF tool

Install

Quick tour

Migrating annotations across PDF versions (diff)

Exit codes

AnnotationPlan schema

Document identity

Non-goals (v1)

Safety defaults

License

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Migrating annotations across PDF versions (`diff`)