Skip to main content

Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz) with PyO3 bindings.

Project description

pdfspine

An Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz), with PyO3 Python bindings.

🦴 Part of the spine family — framework-free backend engines, each the spine of a domain: zero framework lock-in, Protocol-ized seams, offline-capable. pdfspine is the PDF spine (this repo); ragspine is the RAG spine (deterministic dual-channel retrieval + agent orchestration).

🤖 For AI agents / LLMs: before using this library, read llms.txt (concise index) and python/pdfspine/_llms/docs/ (full API / recipes / gotchas); after pip install they ship at site-packages/pdfspine/_llms/.

Status: alpha / pre-1.0, but the core is feature-complete. pdfspine can already parse/repair/decrypt PDFs, extract text & tables, search, edit / merge / split / save (incl. byte-exact incremental), encrypt, annotate, fill & flatten forms, redact (destructively), open image files as documents, render pages to images, and OCR (Tesseract + a pure-Rust PaddleOCR engine, stronger on CJK). 88.7% (682 / 769) of the PyMuPDF 1.24 public API is implemented and tested (climbing), with 1349+ Rust tests + 593+ Python tests green. Text extraction is at fitz parity (and beats fitz on Arabic / RTL), rendering is near-parity and ~1.74× faster, and the pure-Rust PaddleOCR engine beats fitz on CJK scans (see Accuracy). Not yet on PyPI — build from source for now.


Why pdfspine?

PyMuPDF is excellent, but it is AGPL-3.0 (or a commercial license from Artifex) — a non-starter for many closed-source products, SaaS backends, and permissively-licensed open-source projects.

pdfspine is a drop-in-shaped, permissively-licensed (Apache-2.0) alternative:

  • Apache-2.0 throughout — permissive, with an explicit patent grant. The dependency graph is gated by cargo-deny to exclude GPL / AGPL / LGPL / MPL / SSPL from the shipped wheel. License cleanliness is CI-enforced, not a promise.
  • Pure Rust, no C blob. Self-contained wheels, no system zlib/C linkage, no bundled prebuilt engine (the differentiator vs pdfium-based wrappers).
  • import fitz compatible (opt-in). A compatibility shim lets much existing PyMuPDF code run unmodified — available as import pdfspine.fitz as fitz, or registered under the global fitz / pymupdf names with one call to pdfspine.install_fitz_shim(). A default install is collision-safe: it does not claim those global names, so it coexists with a real PyMuPDF in the same environment. A machine-readable COMPAT.toml documents every symbol's status.
  • Memory-safe by construction. #![forbid(unsafe_code)] in every first-party crate except the single audited PyO3 FFI chokepoint.
  • Clean-room. No code, tests, or fixtures derived from MuPDF / PyMuPDF / any AGPL source.

What works today

Area Capabilities
Read open (file/bytes), malformed-PDF repair, encrypted PDFs (RC4 / AES-128 / AES-256, R2–R6)
Text get_text (text/words/blocks/dict/rawdict/json/html/xhtml/xml), search_for, TextPage, fonts/images inventory
Tables find_tables with merged-cell detection → extract() / to_markdown() / to_html()
Edit & save full + byte-exact incremental save, garbage collection, page insert/delete/copy/move/select, insert_pdf merge, metadata/XMP, TOC, links, encryption write
Annotate all common annotation types with /AP appearance streams; AcroForm read / fill / flatten + Widget; destructive redaction (verified content removal)
Render get_pixmap (vector + text + image + shadings via a tiny-skia rasterizer), Pixmap (buffer-protocol/numpy), DisplayList, get_svg_image
Images open PNG/JPEG/TIFF/GIF/BMP/WEBP as documents, convert_to_pdf, image-XObject decode (DCT/CCITT/JBIG2/JPX), extract_image
Layers Optional Content Groups read/write (get_ocgs / add_ocg / set_layer)
OCR pluggable engine: Tesseract adapter and a pure-Rust PaddleOCR engine (PP-OCRv4, embedded models, stronger on CJK) → searchable-sandwich PDF
CLI pdfspine info / text / render / merge / split / pages / images / toc

Planned next: reading-order accuracy improvements, Type1/Type3 glyph rendering, broader CJK coverage. See PRD.md / docs/ROADMAP.md. Out of scope: digital-signature creation.

Quick start

import pdfspine

doc = pdfspine.open("input.pdf")
print(len(doc), "pages", doc.metadata)

page = doc[0]
print(page.get_text())                       # plain text
print(page.search_for("invoice"))            # list[Rect]
page.get_pixmap(dpi=150).save("page1.png")   # render to image

tables = page.find_tables()
for t in tables.tables:
    print(t.to_markdown())                    # or t.to_html() for merged cells

doc.save("output.pdf", garbage=4, deflate=True)

Existing PyMuPDF code often runs unchanged via the opt-in compat shim:

import pdfspine.fitz as fitz                  # the shim, no global-name collision
doc = fitz.open("input.pdf")
text = doc[0].get_text("dict")

# Or make the literal `import fitz` resolve to the shim (one-time opt-in):
import pdfspine
pdfspine.install_fitz_shim()
import fitz                                    # now -> pdfspine's fitz shim

A default install does not claim the global fitz / pymupdf names, so it is safe alongside a real PyMuPDF; install_fitz_shim() uses setdefault and never clobbers a PyMuPDF you imported first.

Command line:

pdfspine info report.pdf
pdfspine text report.pdf --pages 1-3 --format json -o out.json
pdfspine render report.pdf --dpi 200 -o images/
pdfspine merge a.pdf b.pdf -o merged.pdf

Accuracy

Validated against an objective ground-truth harness and with PyMuPDF (fitz) as the differential oracle (clean-room: the AGPL oracle is run locally only and never committed). See docs/BENCHMARKS.md and the conformance/gt/ reports for the dated, reproducible evidence.

  • Text extraction is at fitz parity on born-digital corpora, and beats fitz on Arabic / RTL (correct bidi reordering).
  • Rendering is near-parity with fitz (page-image SSIM ~0.945) and ~1.74× faster after a font-cache fix.
  • OCR beats fitz on CJK scans: the pure-Rust PaddleOCR engine (PP-OCRv4, with models embedded in the wheel) outperforms fitz's OCR path on Chinese/Japanese/ Korean documents.
  • Real-corpus robustness: open rate 100%, 0 panics/hangs, re-saved files 100% qpdf --check-clean across the public-domain US-government corpus.

Remaining accuracy work (multi-column reading order, Type1/Type3 glyph rendering, broader CJK) is tracked in docs/PRD-NEXT.md.

Build & install

Requirements: Rust (pinned to 1.96.0 by rust-toolchain.toml), Python ≥ 3.11, maturin ≥ 1.7. uv recommended.

uv venv .venv && source .venv/bin/activate
maturin develop                 # build + install the extension in-place
python -c "import pdfspine; print(pdfspine.__version__)"
# redistributable wheel:
maturin build --release         # -> target/wheels/

Building from source needs a C/asm compiler. The bundled pure-Rust PaddleOCR engine depends on tract, which compiles target-specific assembly kernels at build time: a C compiler (cc/clang) on Linux/macOS, or the MSVC Build Tools (incl. ml64.exe) on Windows. Prebuilt wheels (once published) need none of this. To build a fully C-free library, compile the Rust crates with --no-default-features (drops the paddle-ocr feature). Wheels are large (~15–25 MB) because the OCR models (~16 MB) are embedded.

Architecture

A Cargo workspace with a strict dependency DAG; the Python bindings touch exactly one façade crate, and core logic is split into independently testable units.

                  py-bindings   (PyO3 cdylib -> pdfspine._core, abi3-py311)
                       │
                       ▼
                    pdf-api      facade / re-exports
        ┌──────────┬───┴────┬──────────┐
        ▼          ▼        ▼          ▼
    pdf-text   pdf-edit  pdf-image  pdf-render
        │          │        │          │
        └────┬─────┘        │     (fonts, text)
             ▼              │
         pdf-fonts ◄────────┘
             ▼
         pdf-core   ◄────────  pdf-crypto
Crate Responsibility
pdf-core object model, lexer/parser, xref, repair, filters, writer, geometry
pdf-crypto Standard security handler (RC4 / AES-128 / AES-256)
pdf-fonts font mapping (encodings / ToUnicode / CMap / widths)
pdf-text content-stream interpreter, get_text, search, find_tables
pdf-edit page ops, merge, annotations / forms, metadata / TOC, redaction, OCG
pdf-image image documents, image-XObject codecs, Pixmap
pdf-render tiny-skia rasterizer → Pixmap, DisplayList, SVG
pdf-api unified ergonomic façade
py-bindings PyO3 wrappers → the _core extension module

Develop / test

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
maturin develop && pytest python/tests       # Python tests
python conformance/run_validation.py        # real-corpus accuracy harness

pdfspine is built strictly test-first (red → green → refactor → harden); the per-function test plan is in docs/test-case-catalog.md.

Documentation

Guide + API reference + PyMuPDF migration guide: build the docs site with mkdocs serve (see mkdocs.yml / docs/). The authoritative design lives in PRD.md.

License

Apache-2.0 — see LICENSE and NOTICE. All third-party dependencies are permissive (MIT / Apache-2.0 / BSD / Zlib / …); the shipped graph is CI-verified free of copyleft.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfspine-0.0.4.tar.gz (54.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdfspine-0.0.4-cp311-abi3-win_amd64.whl (36.8 MB view details)

Uploaded CPython 3.11+Windows x86-64

pdfspine-0.0.4-cp311-abi3-manylinux_2_28_aarch64.whl (35.6 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

pdfspine-0.0.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.7 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

pdfspine-0.0.4-cp311-abi3-macosx_11_0_arm64.whl (35.4 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file pdfspine-0.0.4.tar.gz.

File metadata

  • Download URL: pdfspine-0.0.4.tar.gz
  • Upload date:
  • Size: 54.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfspine-0.0.4.tar.gz
Algorithm Hash digest
SHA256 d6a7a9fbd4ec60b7554e706b5856d2c9694a6247628ea053bd0a0edf8d10c3d0
MD5 978fa016fdc97bac1e92d584d5d6cb6b
BLAKE2b-256 2b84ee4b12e91cc4984b97567289177566979c6bcc638561fd9047db2cf4c5f8

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.4-cp311-abi3-win_amd64.whl.

File metadata

  • Download URL: pdfspine-0.0.4-cp311-abi3-win_amd64.whl
  • Upload date:
  • Size: 36.8 MB
  • Tags: CPython 3.11+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdfspine-0.0.4-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c907824720567afc5fc98db01433ad7386cc727d5faf96ae5d9fe975637fdad2
MD5 247b2de552211bb2ec4b7797920bdbbe
BLAKE2b-256 046fcddcb8507cea8aad3a75f8c0a157b710105258ae5ca362bb8a6ea1c33f70

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.4-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.4-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 71b6c2901a0b7d4ebfb2c49c2533ccec850f4cf9bd058f9f449775188cae9b58
MD5 d0eb23683a0a2735980cf4f2cec21651
BLAKE2b-256 c38253da98087cb083dc3dade0ad1c1e613e67ddbe3eeb6e6b1f2378c67dde18

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.4-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d3902e1c2cb53cdc472af296e8b748d4534eabfd13972b41e12f709b2e468083
MD5 16b3103971eae74534b193442ccf28a8
BLAKE2b-256 3858fd113bcb9ecfedac2e543d216ba3ca123e32f232a512f8fd5e98d613c269

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.4-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.4-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 91c2d8c1d5820ac7e8fda3ad4f39883365f0d8493647d65fd504acaa4e7d0132
MD5 47c15cfe90fb75f7e9a261750861c59e
BLAKE2b-256 df19a9185fd34c2a137e76694d083c3d1d242493272a68e10b806b08382e6142

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page