Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz) with PyO3 bindings.

These details have not been verified by PyPI

Project links

Project description

pdfspine

An Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz), with PyO3 Python bindings.

🦴 Part of the spine family — framework-free backend engines, each the spine of a domain: zero framework lock-in, Protocol-ized seams, offline-capable. pdfspine is the PDF spine (this repo); ragspine is the RAG spine (deterministic dual-channel retrieval + agent orchestration).

Status: alpha / pre-1.0, but the core is feature-complete. pdfspine can already parse/repair/decrypt PDFs, extract text & tables, search, edit / merge / split / save (incl. byte-exact incremental), encrypt, annotate, fill & flatten forms, redact (destructively), open image files as documents, render pages to images, and OCR (Tesseract + a pure-Rust PaddleOCR engine, stronger on CJK). 88.7% (682 / 769) of the PyMuPDF 1.24 public API is implemented and tested (climbing), with 1349+ Rust tests + 593+ Python tests green. Text extraction is at fitz parity (and beats fitz on Arabic / RTL), rendering is near-parity and ~1.74× faster, and the pure-Rust PaddleOCR engine beats fitz on CJK scans (see Accuracy). Not yet on PyPI — build from source for now.

Why pdfspine?

PyMuPDF is excellent, but it is AGPL-3.0 (or a commercial license from Artifex) — a non-starter for many closed-source products, SaaS backends, and permissively-licensed open-source projects.

pdfspine is a drop-in-shaped, permissively-licensed (Apache-2.0) alternative:

Apache-2.0 throughout — permissive, with an explicit patent grant. The dependency graph is gated by cargo-deny to exclude GPL / AGPL / LGPL / MPL / SSPL from the shipped wheel. License cleanliness is CI-enforced, not a promise.
Pure Rust, no C blob. Self-contained wheels, no system zlib/C linkage, no bundled prebuilt engine (the differentiator vs pdfium-based wrappers).
import fitz compatible (opt-in). A compatibility shim lets much existing PyMuPDF code run unmodified — available as import pdfspine.fitz as fitz, or registered under the global fitz / pymupdf names with one call to pdfspine.install_fitz_shim(). A default install is collision-safe: it does not claim those global names, so it coexists with a real PyMuPDF in the same environment. A machine-readable COMPAT.toml documents every symbol's status.
Memory-safe by construction. #![forbid(unsafe_code)] in every first-party crate except the single audited PyO3 FFI chokepoint.
Clean-room. No code, tests, or fixtures derived from MuPDF / PyMuPDF / any AGPL source.

What works today

Area	Capabilities
Read	open (file/bytes), malformed-PDF repair, encrypted PDFs (RC4 / AES-128 / AES-256, R2–R6)
Text	`get_text` (`text/words/blocks/dict/rawdict/json/html/xhtml/xml`), `search_for`, `TextPage`, fonts/images inventory
Tables	`find_tables` with merged-cell detection → `extract()` / `to_markdown()` / `to_html()`
Edit & save	full + byte-exact incremental save, garbage collection, page insert/delete/copy/move/select, `insert_pdf` merge, metadata/XMP, TOC, links, encryption write
Annotate	all common annotation types with `/AP` appearance streams; AcroForm read / fill / flatten + `Widget`; destructive redaction (verified content removal)
Render	`get_pixmap` (vector + text + image + shadings via a tiny-skia rasterizer), `Pixmap` (buffer-protocol/numpy), `DisplayList`, `get_svg_image`
Images	open PNG/JPEG/TIFF/GIF/BMP/WEBP as documents, `convert_to_pdf`, image-XObject decode (DCT/CCITT/JBIG2/JPX), `extract_image`
Layers	Optional Content Groups read/write (`get_ocgs` / `add_ocg` / `set_layer`)
OCR	pluggable engine: Tesseract adapter and a pure-Rust PaddleOCR engine (PP-OCRv4, embedded models, stronger on CJK) → searchable-sandwich PDF
CLI	`pdfspine info / text / render / merge / split / pages / images / toc`

Planned next: reading-order accuracy improvements, Type1/Type3 glyph rendering, broader CJK coverage. See PRD.md / docs/ROADMAP.md. Out of scope: digital-signature creation.

Quick start

import pdfspine

doc = pdfspine.open("input.pdf")
print(len(doc), "pages", doc.metadata)

page = doc[0]
print(page.get_text())                       # plain text
print(page.search_for("invoice"))            # list[Rect]
page.get_pixmap(dpi=150).save("page1.png")   # render to image

tables = page.find_tables()
for t in tables.tables:
    print(t.to_markdown())                    # or t.to_html() for merged cells

doc.save("output.pdf", garbage=4, deflate=True)

Existing PyMuPDF code often runs unchanged via the opt-in compat shim:

import pdfspine.fitz as fitz                  # the shim, no global-name collision
doc = fitz.open("input.pdf")
text = doc[0].get_text("dict")

# Or make the literal `import fitz` resolve to the shim (one-time opt-in):
import pdfspine
pdfspine.install_fitz_shim()
import fitz                                    # now -> pdfspine's fitz shim

A default install does not claim the global fitz / pymupdf names, so it is safe alongside a real PyMuPDF; install_fitz_shim() uses setdefault and never clobbers a PyMuPDF you imported first.

Command line:

pdfspine info report.pdf
pdfspine text report.pdf --pages 1-3 --format json -o out.json
pdfspine render report.pdf --dpi 200 -o images/
pdfspine merge a.pdf b.pdf -o merged.pdf

Accuracy

Validated against an objective ground-truth harness and with PyMuPDF (fitz) as the differential oracle (clean-room: the AGPL oracle is run locally only and never committed). See docs/BENCHMARKS.md and the conformance/gt/ reports for the dated, reproducible evidence.

Text extraction is at fitz parity on born-digital corpora, and beats fitz on Arabic / RTL (correct bidi reordering).
Rendering is near-parity with fitz (page-image SSIM ~0.945) and ~1.74× faster after a font-cache fix.
OCR beats fitz on CJK scans: the pure-Rust PaddleOCR engine (PP-OCRv4, with models embedded in the wheel) outperforms fitz's OCR path on Chinese/Japanese/ Korean documents.
Real-corpus robustness: open rate 100%, 0 panics/hangs, re-saved files 100% qpdf --check-clean across the public-domain US-government corpus.

Remaining accuracy work (multi-column reading order, Type1/Type3 glyph rendering, broader CJK) is tracked in docs/PRD-NEXT.md.

Build & install

Requirements: Rust (pinned to 1.96.0 by rust-toolchain.toml), Python ≥ 3.11, maturin ≥ 1.7. uv recommended.

uv venv .venv && source .venv/bin/activate
maturin develop                 # build + install the extension in-place
python -c "import pdfspine; print(pdfspine.__version__)"
# redistributable wheel:
maturin build --release         # -> target/wheels/

Building from source needs a C/asm compiler. The bundled pure-Rust PaddleOCR engine depends on tract, which compiles target-specific assembly kernels at build time: a C compiler (cc/clang) on Linux/macOS, or the MSVC Build Tools (incl. ml64.exe) on Windows. Prebuilt wheels (once published) need none of this. To build a fully C-free library, compile the Rust crates with --no-default-features (drops the paddle-ocr feature). Wheels are large (~15–25 MB) because the OCR models (~16 MB) are embedded.

Architecture

A Cargo workspace with a strict dependency DAG; the Python bindings touch exactly one façade crate, and core logic is split into independently testable units.

                  py-bindings   (PyO3 cdylib -> pdfspine._core, abi3-py311)
                       │
                       ▼
                    pdf-api      facade / re-exports
        ┌──────────┬───┴────┬──────────┐
        ▼          ▼        ▼          ▼
    pdf-text   pdf-edit  pdf-image  pdf-render
        │          │        │          │
        └────┬─────┘        │     (fonts, text)
             ▼              │
         pdf-fonts ◄────────┘
             ▼
         pdf-core   ◄────────  pdf-crypto

Crate	Responsibility
`pdf-core`	object model, lexer/parser, xref, repair, filters, writer, geometry
`pdf-crypto`	Standard security handler (RC4 / AES-128 / AES-256)
`pdf-fonts`	font mapping (encodings / ToUnicode / CMap / widths)
`pdf-text`	content-stream interpreter, `get_text`, search, `find_tables`
`pdf-edit`	page ops, merge, annotations / forms, metadata / TOC, redaction, OCG
`pdf-image`	image documents, image-XObject codecs, `Pixmap`
`pdf-render`	tiny-skia rasterizer → `Pixmap`, `DisplayList`, SVG
`pdf-api`	unified ergonomic façade
`py-bindings`	PyO3 wrappers → the `_core` extension module

Develop / test

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
maturin develop && pytest python/tests       # Python tests
python conformance/run_validation.py …       # real-corpus accuracy harness

pdfspine is built strictly test-first (red → green → refactor → harden); the per-function test plan is in docs/test-case-catalog.md.

Documentation

Guide + API reference + PyMuPDF migration guide: build the docs site with mkdocs serve (see mkdocs.yml / docs/). The authoritative design lives in PRD.md.

License

Apache-2.0 — see LICENSE and NOTICE. All third-party dependencies are permissive (MIT / Apache-2.0 / BSD / Zlib / …); the shipped graph is CI-verified free of copyleft.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.4

Jun 23, 2026

0.0.3

Jun 22, 2026

0.0.2

Jun 22, 2026

This version

0.0.1

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfspine-0.0.1.tar.gz (19.5 MB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl (11.3 MB view details)

Uploaded Jun 21, 2026 CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file pdfspine-0.0.1.tar.gz.

File metadata

Download URL: pdfspine-0.0.1.tar.gz
Upload date: Jun 21, 2026
Size: 19.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pdfspine-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`6c2edaea722ca11134acf13f112e013d15ae7d026aae8cdab57e9f2cb0d03ad8`
MD5	`8f0bc6e0edbf3ca1cb2301ed13fa1f2f`
BLAKE2b-256	`7d54eca1bbba0093ece938f7c31526f714ae6b344f3c8f7bcf6c80c3ae058c23`

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl
Upload date: Jun 21, 2026
Size: 11.3 MB
Tags: CPython 3.11+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`6b822930fd3cc8d334342cf35372bc96926defb515c6d9ae7ef9a8c51f40a062`
MD5	`2b878a62d946a2bc0d80b3f2a55e137d`
BLAKE2b-256	`23246ca5aea64fbb5f829b4f25236057f08818fde6db3468d145e74f3d4e810b`

See more details on using hashes here.

pdfspine 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdfspine

Why pdfspine?

What works today

Quick start

Accuracy

Build & install

Architecture

Develop / test

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes