Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz) with PyO3 bindings.
Project description
pdfspine
An Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz), with PyO3 Python bindings.
🦴 Part of the
spinefamily — framework-free backend engines, each the spine of a domain: zero framework lock-in, Protocol-ized seams, offline-capable. pdfspine is the PDF spine (this repo); ragspine is the RAG spine (deterministic dual-channel retrieval + agent orchestration).
Status: alpha / pre-1.0, but the core is feature-complete. pdfspine can already parse/repair/decrypt PDFs, extract text & tables, search, edit / merge / split / save (incl. byte-exact incremental), encrypt, annotate, fill & flatten forms, redact (destructively), open image files as documents, render pages to images, and OCR (Tesseract + a pure-Rust PaddleOCR engine, stronger on CJK). 88.7% (682 / 769) of the PyMuPDF 1.24 public API is implemented and tested (climbing), with 1349+ Rust tests + 593+ Python tests green. Text extraction is at fitz parity (and beats fitz on Arabic / RTL), rendering is near-parity and ~1.74× faster, and the pure-Rust PaddleOCR engine beats fitz on CJK scans (see Accuracy). Not yet on PyPI — build from source for now.
Why pdfspine?
PyMuPDF is excellent, but it is AGPL-3.0 (or a commercial license from Artifex) — a non-starter for many closed-source products, SaaS backends, and permissively-licensed open-source projects.
pdfspine is a drop-in-shaped, permissively-licensed (Apache-2.0) alternative:
- Apache-2.0 throughout — permissive, with an explicit patent grant. The
dependency graph is gated by
cargo-denyto exclude GPL / AGPL / LGPL / MPL / SSPL from the shipped wheel. License cleanliness is CI-enforced, not a promise. - Pure Rust, no C blob. Self-contained wheels, no system
zlib/C linkage, no bundled prebuilt engine (the differentiator vs pdfium-based wrappers). import fitzcompatible (opt-in). A compatibility shim lets much existing PyMuPDF code run unmodified — available asimport pdfspine.fitz as fitz, or registered under the globalfitz/pymupdfnames with one call topdfspine.install_fitz_shim(). A default install is collision-safe: it does not claim those global names, so it coexists with a real PyMuPDF in the same environment. A machine-readableCOMPAT.tomldocuments every symbol's status.- Memory-safe by construction.
#![forbid(unsafe_code)]in every first-party crate except the single audited PyO3 FFI chokepoint. - Clean-room. No code, tests, or fixtures derived from MuPDF / PyMuPDF / any AGPL source.
What works today
| Area | Capabilities |
|---|---|
| Read | open (file/bytes), malformed-PDF repair, encrypted PDFs (RC4 / AES-128 / AES-256, R2–R6) |
| Text | get_text (text/words/blocks/dict/rawdict/json/html/xhtml/xml), search_for, TextPage, fonts/images inventory |
| Tables | find_tables with merged-cell detection → extract() / to_markdown() / to_html() |
| Edit & save | full + byte-exact incremental save, garbage collection, page insert/delete/copy/move/select, insert_pdf merge, metadata/XMP, TOC, links, encryption write |
| Annotate | all common annotation types with /AP appearance streams; AcroForm read / fill / flatten + Widget; destructive redaction (verified content removal) |
| Render | get_pixmap (vector + text + image + shadings via a tiny-skia rasterizer), Pixmap (buffer-protocol/numpy), DisplayList, get_svg_image |
| Images | open PNG/JPEG/TIFF/GIF/BMP/WEBP as documents, convert_to_pdf, image-XObject decode (DCT/CCITT/JBIG2/JPX), extract_image |
| Layers | Optional Content Groups read/write (get_ocgs / add_ocg / set_layer) |
| OCR | pluggable engine: Tesseract adapter and a pure-Rust PaddleOCR engine (PP-OCRv4, embedded models, stronger on CJK) → searchable-sandwich PDF |
| CLI | pdfspine info / text / render / merge / split / pages / images / toc |
Planned next: reading-order accuracy improvements, Type1/Type3 glyph rendering,
broader CJK coverage. See PRD.md / docs/ROADMAP.md.
Out of scope: digital-signature creation.
Quick start
import pdfspine
doc = pdfspine.open("input.pdf")
print(len(doc), "pages", doc.metadata)
page = doc[0]
print(page.get_text()) # plain text
print(page.search_for("invoice")) # list[Rect]
page.get_pixmap(dpi=150).save("page1.png") # render to image
tables = page.find_tables()
for t in tables.tables:
print(t.to_markdown()) # or t.to_html() for merged cells
doc.save("output.pdf", garbage=4, deflate=True)
Existing PyMuPDF code often runs unchanged via the opt-in compat shim:
import pdfspine.fitz as fitz # the shim, no global-name collision
doc = fitz.open("input.pdf")
text = doc[0].get_text("dict")
# Or make the literal `import fitz` resolve to the shim (one-time opt-in):
import pdfspine
pdfspine.install_fitz_shim()
import fitz # now -> pdfspine's fitz shim
A default install does not claim the global fitz / pymupdf names, so it
is safe alongside a real PyMuPDF; install_fitz_shim() uses setdefault and
never clobbers a PyMuPDF you imported first.
Command line:
pdfspine info report.pdf
pdfspine text report.pdf --pages 1-3 --format json -o out.json
pdfspine render report.pdf --dpi 200 -o images/
pdfspine merge a.pdf b.pdf -o merged.pdf
Accuracy
Validated against an objective ground-truth harness and with PyMuPDF (fitz) as
the differential oracle (clean-room: the AGPL oracle is run locally only and never
committed). See docs/BENCHMARKS.md and the
conformance/gt/ reports for the dated, reproducible evidence.
- Text extraction is at fitz parity on born-digital corpora, and beats fitz on Arabic / RTL (correct bidi reordering).
- Rendering is near-parity with fitz (page-image SSIM ~0.945) and ~1.74× faster after a font-cache fix.
- OCR beats fitz on CJK scans: the pure-Rust PaddleOCR engine (PP-OCRv4, with models embedded in the wheel) outperforms fitz's OCR path on Chinese/Japanese/ Korean documents.
- Real-corpus robustness: open rate 100%, 0 panics/hangs, re-saved files
100%
qpdf --check-clean across the public-domain US-government corpus.
Remaining accuracy work (multi-column reading order, Type1/Type3 glyph rendering,
broader CJK) is tracked in docs/PRD-NEXT.md.
Build & install
Requirements: Rust (pinned to 1.96.0 by rust-toolchain.toml), Python ≥
3.11, maturin ≥ 1.7. uv
recommended.
uv venv .venv && source .venv/bin/activate
maturin develop # build + install the extension in-place
python -c "import pdfspine; print(pdfspine.__version__)"
# redistributable wheel:
maturin build --release # -> target/wheels/
Building from source needs a C/asm compiler. The bundled pure-Rust PaddleOCR engine depends on
tract, which compiles target-specific assembly kernels at build time: a C compiler (cc/clang) on Linux/macOS, or the MSVC Build Tools (incl.ml64.exe) on Windows. Prebuilt wheels (once published) need none of this. To build a fully C-free library, compile the Rust crates with--no-default-features(drops thepaddle-ocrfeature). Wheels are large (~15–25 MB) because the OCR models (~16 MB) are embedded.
Architecture
A Cargo workspace with a strict dependency DAG; the Python bindings touch exactly one façade crate, and core logic is split into independently testable units.
py-bindings (PyO3 cdylib -> pdfspine._core, abi3-py311)
│
▼
pdf-api facade / re-exports
┌──────────┬───┴────┬──────────┐
▼ ▼ ▼ ▼
pdf-text pdf-edit pdf-image pdf-render
│ │ │ │
└────┬─────┘ │ (fonts, text)
▼ │
pdf-fonts ◄────────┘
▼
pdf-core ◄──────── pdf-crypto
| Crate | Responsibility |
|---|---|
pdf-core |
object model, lexer/parser, xref, repair, filters, writer, geometry |
pdf-crypto |
Standard security handler (RC4 / AES-128 / AES-256) |
pdf-fonts |
font mapping (encodings / ToUnicode / CMap / widths) |
pdf-text |
content-stream interpreter, get_text, search, find_tables |
pdf-edit |
page ops, merge, annotations / forms, metadata / TOC, redaction, OCG |
pdf-image |
image documents, image-XObject codecs, Pixmap |
pdf-render |
tiny-skia rasterizer → Pixmap, DisplayList, SVG |
pdf-api |
unified ergonomic façade |
py-bindings |
PyO3 wrappers → the _core extension module |
Develop / test
cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
maturin develop && pytest python/tests # Python tests
python conformance/run_validation.py … # real-corpus accuracy harness
pdfspine is built strictly test-first (red → green → refactor → harden); the
per-function test plan is in docs/test-case-catalog.md.
Documentation
Guide + API reference + PyMuPDF migration guide: build the docs site with
mkdocs serve (see mkdocs.yml / docs/). The
authoritative design lives in PRD.md.
License
Apache-2.0 — see LICENSE and NOTICE. All third-party
dependencies are permissive (MIT / Apache-2.0 / BSD / Zlib / …); the shipped graph
is CI-verified free of copyleft.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfspine-0.0.2.tar.gz.
File metadata
- Download URL: pdfspine-0.0.2.tar.gz
- Upload date:
- Size: 19.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
554df6078b0b36ad24e84c4b1712e788213e13eb6c658d63c35553b4b9936694
|
|
| MD5 |
1290dac10fbd534d01a7c6333ffb8d6f
|
|
| BLAKE2b-256 |
855de467046a3ea90514edd7c187ea604a2ba6f016975ea10fb11f1f9e9d7136
|
File details
Details for the file pdfspine-0.0.2-cp311-abi3-win_amd64.whl.
File metadata
- Download URL: pdfspine-0.0.2-cp311-abi3-win_amd64.whl
- Upload date:
- Size: 11.8 MB
- Tags: CPython 3.11+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5e94608e22b2942ea62bc2f55d4d4b14b92908c41187e7c2d41032cb46d38b4
|
|
| MD5 |
15ae394140846410407547821b6be4d9
|
|
| BLAKE2b-256 |
6e231a5f2d8ca10bc2d8f5eb031066fbfa630607258b21e4bb89227713ced2fb
|
File details
Details for the file pdfspine-0.0.2-cp311-abi3-manylinux_2_28_aarch64.whl.
File metadata
- Download URL: pdfspine-0.0.2-cp311-abi3-manylinux_2_28_aarch64.whl
- Upload date:
- Size: 10.6 MB
- Tags: CPython 3.11+, manylinux: glibc 2.28+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cb86bd43bb80a986f3245a8c9ef2af3a3bac33148b7c3935f071493a034767e
|
|
| MD5 |
1efbc4edb18aa3ecbc15e003bad2b16a
|
|
| BLAKE2b-256 |
eb861850f6e58de0af6adcf68b2bfaa238af6481197b7977efd1d7031677cfae
|
File details
Details for the file pdfspine-0.0.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: pdfspine-0.0.2-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 11.7 MB
- Tags: CPython 3.11+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daf0b47916bfbb77288e98fd206d7d51934284e0f0843b20725ee45a87374027
|
|
| MD5 |
b7031c4fddf012020261e72056d15c70
|
|
| BLAKE2b-256 |
0844a43193aed248c2380be21921fbef28e29e38c8a7eeab1d0f982a9fdfcfed
|
File details
Details for the file pdfspine-0.0.2-cp311-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: pdfspine-0.0.2-cp311-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 10.4 MB
- Tags: CPython 3.11+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9137c6486f27200e2cb9b8cef186fec1ebbe525869db9ca27a192e68443e976e
|
|
| MD5 |
3bd450fe93c5cc7b8deef22c541c0f68
|
|
| BLAKE2b-256 |
8dcefb6f870b51399e3a0d1186a549bd0be4006b3f67d702c226a830e17eb43e
|