Skip to main content

Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz) with PyO3 bindings.

Project description

pdfspine

An Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz), with PyO3 Python bindings.

🦴 Part of the spine family — framework-free backend engines, each the spine of a domain: zero framework lock-in, Protocol-ized seams, offline-capable. pdfspine is the PDF spine (this repo); ragspine is the RAG spine (deterministic dual-channel retrieval + agent orchestration).

Status: alpha / pre-1.0, but the core is feature-complete. pdfspine can already parse/repair/decrypt PDFs, extract text & tables, search, edit / merge / split / save (incl. byte-exact incremental), encrypt, annotate, fill & flatten forms, redact (destructively), open image files as documents, render pages to images, and OCR (Tesseract + a pure-Rust PaddleOCR engine, stronger on CJK). 88.7% (682 / 769) of the PyMuPDF 1.24 public API is implemented and tested (climbing), with 1349+ Rust tests + 593+ Python tests green. Text extraction is at fitz parity (and beats fitz on Arabic / RTL), rendering is near-parity and ~1.74× faster, and the pure-Rust PaddleOCR engine beats fitz on CJK scans (see Accuracy). Not yet on PyPI — build from source for now.


Why pdfspine?

PyMuPDF is excellent, but it is AGPL-3.0 (or a commercial license from Artifex) — a non-starter for many closed-source products, SaaS backends, and permissively-licensed open-source projects.

pdfspine is a drop-in-shaped, permissively-licensed (Apache-2.0) alternative:

  • Apache-2.0 throughout — permissive, with an explicit patent grant. The dependency graph is gated by cargo-deny to exclude GPL / AGPL / LGPL / MPL / SSPL from the shipped wheel. License cleanliness is CI-enforced, not a promise.
  • Pure Rust, no C blob. Self-contained wheels, no system zlib/C linkage, no bundled prebuilt engine (the differentiator vs pdfium-based wrappers).
  • import fitz compatible (opt-in). A compatibility shim lets much existing PyMuPDF code run unmodified — available as import pdfspine.fitz as fitz, or registered under the global fitz / pymupdf names with one call to pdfspine.install_fitz_shim(). A default install is collision-safe: it does not claim those global names, so it coexists with a real PyMuPDF in the same environment. A machine-readable COMPAT.toml documents every symbol's status.
  • Memory-safe by construction. #![forbid(unsafe_code)] in every first-party crate except the single audited PyO3 FFI chokepoint.
  • Clean-room. No code, tests, or fixtures derived from MuPDF / PyMuPDF / any AGPL source.

What works today

Area Capabilities
Read open (file/bytes), malformed-PDF repair, encrypted PDFs (RC4 / AES-128 / AES-256, R2–R6)
Text get_text (text/words/blocks/dict/rawdict/json/html/xhtml/xml), search_for, TextPage, fonts/images inventory
Tables find_tables with merged-cell detection → extract() / to_markdown() / to_html()
Edit & save full + byte-exact incremental save, garbage collection, page insert/delete/copy/move/select, insert_pdf merge, metadata/XMP, TOC, links, encryption write
Annotate all common annotation types with /AP appearance streams; AcroForm read / fill / flatten + Widget; destructive redaction (verified content removal)
Render get_pixmap (vector + text + image + shadings via a tiny-skia rasterizer), Pixmap (buffer-protocol/numpy), DisplayList, get_svg_image
Images open PNG/JPEG/TIFF/GIF/BMP/WEBP as documents, convert_to_pdf, image-XObject decode (DCT/CCITT/JBIG2/JPX), extract_image
Layers Optional Content Groups read/write (get_ocgs / add_ocg / set_layer)
OCR pluggable engine: Tesseract adapter and a pure-Rust PaddleOCR engine (PP-OCRv4, embedded models, stronger on CJK) → searchable-sandwich PDF
CLI pdfspine info / text / render / merge / split / pages / images / toc

Planned next: reading-order accuracy improvements, Type1/Type3 glyph rendering, broader CJK coverage. See PRD.md / docs/ROADMAP.md. Out of scope: digital-signature creation.

Quick start

import pdfspine

doc = pdfspine.open("input.pdf")
print(len(doc), "pages", doc.metadata)

page = doc[0]
print(page.get_text())                       # plain text
print(page.search_for("invoice"))            # list[Rect]
page.get_pixmap(dpi=150).save("page1.png")   # render to image

tables = page.find_tables()
for t in tables.tables:
    print(t.to_markdown())                    # or t.to_html() for merged cells

doc.save("output.pdf", garbage=4, deflate=True)

Existing PyMuPDF code often runs unchanged via the opt-in compat shim:

import pdfspine.fitz as fitz                  # the shim, no global-name collision
doc = fitz.open("input.pdf")
text = doc[0].get_text("dict")

# Or make the literal `import fitz` resolve to the shim (one-time opt-in):
import pdfspine
pdfspine.install_fitz_shim()
import fitz                                    # now -> pdfspine's fitz shim

A default install does not claim the global fitz / pymupdf names, so it is safe alongside a real PyMuPDF; install_fitz_shim() uses setdefault and never clobbers a PyMuPDF you imported first.

Command line:

pdfspine info report.pdf
pdfspine text report.pdf --pages 1-3 --format json -o out.json
pdfspine render report.pdf --dpi 200 -o images/
pdfspine merge a.pdf b.pdf -o merged.pdf

Accuracy

Validated against an objective ground-truth harness and with PyMuPDF (fitz) as the differential oracle (clean-room: the AGPL oracle is run locally only and never committed). See docs/BENCHMARKS.md and the conformance/gt/ reports for the dated, reproducible evidence.

  • Text extraction is at fitz parity on born-digital corpora, and beats fitz on Arabic / RTL (correct bidi reordering).
  • Rendering is near-parity with fitz (page-image SSIM ~0.945) and ~1.74× faster after a font-cache fix.
  • OCR beats fitz on CJK scans: the pure-Rust PaddleOCR engine (PP-OCRv4, with models embedded in the wheel) outperforms fitz's OCR path on Chinese/Japanese/ Korean documents.
  • Real-corpus robustness: open rate 100%, 0 panics/hangs, re-saved files 100% qpdf --check-clean across the public-domain US-government corpus.

Remaining accuracy work (multi-column reading order, Type1/Type3 glyph rendering, broader CJK) is tracked in docs/PRD-NEXT.md.

Build & install

Requirements: Rust (pinned to 1.96.0 by rust-toolchain.toml), Python ≥ 3.11, maturin ≥ 1.7. uv recommended.

uv venv .venv && source .venv/bin/activate
maturin develop                 # build + install the extension in-place
python -c "import pdfspine; print(pdfspine.__version__)"
# redistributable wheel:
maturin build --release         # -> target/wheels/

Building from source needs a C/asm compiler. The bundled pure-Rust PaddleOCR engine depends on tract, which compiles target-specific assembly kernels at build time: a C compiler (cc/clang) on Linux/macOS, or the MSVC Build Tools (incl. ml64.exe) on Windows. Prebuilt wheels (once published) need none of this. To build a fully C-free library, compile the Rust crates with --no-default-features (drops the paddle-ocr feature). Wheels are large (~15–25 MB) because the OCR models (~16 MB) are embedded.

Architecture

A Cargo workspace with a strict dependency DAG; the Python bindings touch exactly one façade crate, and core logic is split into independently testable units.

                  py-bindings   (PyO3 cdylib -> pdfspine._core, abi3-py311)
                       │
                       ▼
                    pdf-api      facade / re-exports
        ┌──────────┬───┴────┬──────────┐
        ▼          ▼        ▼          ▼
    pdf-text   pdf-edit  pdf-image  pdf-render
        │          │        │          │
        └────┬─────┘        │     (fonts, text)
             ▼              │
         pdf-fonts ◄────────┘
             ▼
         pdf-core   ◄────────  pdf-crypto
Crate Responsibility
pdf-core object model, lexer/parser, xref, repair, filters, writer, geometry
pdf-crypto Standard security handler (RC4 / AES-128 / AES-256)
pdf-fonts font mapping (encodings / ToUnicode / CMap / widths)
pdf-text content-stream interpreter, get_text, search, find_tables
pdf-edit page ops, merge, annotations / forms, metadata / TOC, redaction, OCG
pdf-image image documents, image-XObject codecs, Pixmap
pdf-render tiny-skia rasterizer → Pixmap, DisplayList, SVG
pdf-api unified ergonomic façade
py-bindings PyO3 wrappers → the _core extension module

Develop / test

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
maturin develop && pytest python/tests       # Python tests
python conformance/run_validation.py        # real-corpus accuracy harness

pdfspine is built strictly test-first (red → green → refactor → harden); the per-function test plan is in docs/test-case-catalog.md.

Documentation

Guide + API reference + PyMuPDF migration guide: build the docs site with mkdocs serve (see mkdocs.yml / docs/). The authoritative design lives in PRD.md.

License

Apache-2.0 — see LICENSE and NOTICE. All third-party dependencies are permissive (MIT / Apache-2.0 / BSD / Zlib / …); the shipped graph is CI-verified free of copyleft.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfspine-0.0.1.tar.gz (19.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl (11.3 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file pdfspine-0.0.1.tar.gz.

File metadata

  • Download URL: pdfspine-0.0.1.tar.gz
  • Upload date:
  • Size: 19.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pdfspine-0.0.1.tar.gz
Algorithm Hash digest
SHA256 6c2edaea722ca11134acf13f112e013d15ae7d026aae8cdab57e9f2cb0d03ad8
MD5 8f0bc6e0edbf3ca1cb2301ed13fa1f2f
BLAKE2b-256 7d54eca1bbba0093ece938f7c31526f714ae6b344f3c8f7bcf6c80c3ae058c23

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.1-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6b822930fd3cc8d334342cf35372bc96926defb515c6d9ae7ef9a8c51f40a062
MD5 2b878a62d946a2bc0d80b3f2a55e137d
BLAKE2b-256 23246ca5aea64fbb5f829b4f25236057f08818fde6db3468d145e74f3d4e810b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page