Skip to main content

Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz) with PyO3 bindings.

Project description

pdfspine

An Apache-2.0-licensed, pure-Rust reimplementation of PyMuPDF (fitz), with PyO3 Python bindings.

🦴 Part of the spine family — framework-free backend engines, each the spine of a domain: zero framework lock-in, Protocol-ized seams, offline-capable. pdfspine is the PDF spine (this repo); ragspine is the RAG spine (deterministic dual-channel retrieval + agent orchestration).

🤖 For AI agents / LLMs: before using this library, read llms.txt (concise index) and python/pdfspine/_llms/docs/ (full API / recipes / gotchas); after pip install they ship at site-packages/pdfspine/_llms/.

Status: alpha / pre-1.0, but the core is feature-complete. pdfspine can already parse/repair/decrypt PDFs, extract text & tables, search, edit / merge / split / save (incl. byte-exact incremental), encrypt, annotate, fill & flatten forms, redact (destructively), open image files as documents, render pages to images, and OCR (Tesseract + a pure-Rust PaddleOCR engine, stronger on CJK). 88.7% (682 / 769) of the PyMuPDF 1.24 public API is implemented and tested (climbing), with 1349+ Rust tests + 593+ Python tests green. Text extraction is at fitz parity (and beats fitz on Arabic / RTL), rendering is near-parity and ~1.74× faster, and the pure-Rust PaddleOCR engine beats fitz on CJK scans (see Accuracy). Not yet on PyPI — build from source for now.


Why pdfspine?

PyMuPDF is excellent, but it is AGPL-3.0 (or a commercial license from Artifex) — a non-starter for many closed-source products, SaaS backends, and permissively-licensed open-source projects.

pdfspine is a drop-in-shaped, permissively-licensed (Apache-2.0) alternative:

  • Apache-2.0 throughout — permissive, with an explicit patent grant. The dependency graph is gated by cargo-deny to exclude GPL / AGPL / LGPL / MPL / SSPL from the shipped wheel. License cleanliness is CI-enforced, not a promise.
  • Pure Rust, no C blob. Self-contained wheels, no system zlib/C linkage, no bundled prebuilt engine (the differentiator vs pdfium-based wrappers).
  • import fitz compatible (opt-in). A compatibility shim lets much existing PyMuPDF code run unmodified — available as import pdfspine.fitz as fitz, or registered under the global fitz / pymupdf names with one call to pdfspine.install_fitz_shim(). A default install is collision-safe: it does not claim those global names, so it coexists with a real PyMuPDF in the same environment. A machine-readable COMPAT.toml documents every symbol's status.
  • Memory-safe by construction. #![forbid(unsafe_code)] in every first-party crate except the single audited PyO3 FFI chokepoint.
  • Clean-room. No code, tests, or fixtures derived from MuPDF / PyMuPDF / any AGPL source.

What works today

Area Capabilities
Read open (file/bytes), malformed-PDF repair, encrypted PDFs (RC4 / AES-128 / AES-256, R2–R6)
Text get_text (text/words/blocks/dict/rawdict/json/html/xhtml/xml), search_for, TextPage, fonts/images inventory
Tables find_tables with merged-cell detection → extract() / to_markdown() / to_html()
Edit & save full + byte-exact incremental save, garbage collection, page insert/delete/copy/move/select, insert_pdf merge, metadata/XMP, TOC, links, encryption write
Annotate all common annotation types with /AP appearance streams; AcroForm read / fill / flatten + Widget; destructive redaction (verified content removal)
Render get_pixmap (vector + text + image + shadings via a tiny-skia rasterizer), Pixmap (buffer-protocol/numpy), DisplayList, get_svg_image
Images open PNG/JPEG/TIFF/GIF/BMP/WEBP as documents, convert_to_pdf, image-XObject decode (DCT/CCITT/JBIG2/JPX), extract_image
Layers Optional Content Groups read/write (get_ocgs / add_ocg / set_layer)
OCR pluggable engine: Tesseract adapter and a pure-Rust PaddleOCR engine (PP-OCRv4, embedded models, stronger on CJK) → searchable-sandwich PDF
CLI pdfspine info / text / render / merge / split / pages / images / toc

Planned next: reading-order accuracy improvements, Type1/Type3 glyph rendering, broader CJK coverage. See PRD.md / docs/ROADMAP.md. Out of scope: digital-signature creation.

Quick start

import pdfspine

doc = pdfspine.open("input.pdf")
print(len(doc), "pages", doc.metadata)

page = doc[0]
print(page.get_text())                       # plain text
print(page.search_for("invoice"))            # list[Rect]
page.get_pixmap(dpi=150).save("page1.png")   # render to image

tables = page.find_tables()
for t in tables.tables:
    print(t.to_markdown())                    # or t.to_html() for merged cells

doc.save("output.pdf", garbage=4, deflate=True)

Existing PyMuPDF code often runs unchanged via the opt-in compat shim:

import pdfspine.fitz as fitz                  # the shim, no global-name collision
doc = fitz.open("input.pdf")
text = doc[0].get_text("dict")

# Or make the literal `import fitz` resolve to the shim (one-time opt-in):
import pdfspine
pdfspine.install_fitz_shim()
import fitz                                    # now -> pdfspine's fitz shim

A default install does not claim the global fitz / pymupdf names, so it is safe alongside a real PyMuPDF; install_fitz_shim() uses setdefault and never clobbers a PyMuPDF you imported first.

Command line:

pdfspine info report.pdf
pdfspine text report.pdf --pages 1-3 --format json -o out.json
pdfspine render report.pdf --dpi 200 -o images/
pdfspine merge a.pdf b.pdf -o merged.pdf

Accuracy

Validated against an objective ground-truth harness and with PyMuPDF (fitz) as the differential oracle (clean-room: the AGPL oracle is run locally only and never committed). See docs/BENCHMARKS.md and the conformance/gt/ reports for the dated, reproducible evidence.

  • Text extraction is at fitz parity on born-digital corpora, and beats fitz on Arabic / RTL (correct bidi reordering).
  • Rendering is near-parity with fitz (page-image SSIM ~0.945) and ~1.74× faster after a font-cache fix.
  • OCR beats fitz on CJK scans: the pure-Rust PaddleOCR engine (PP-OCRv4, with models embedded in the wheel) outperforms fitz's OCR path on Chinese/Japanese/ Korean documents.
  • Real-corpus robustness: open rate 100%, 0 panics/hangs, re-saved files 100% qpdf --check-clean across the public-domain US-government corpus.

Remaining accuracy work (multi-column reading order, Type1/Type3 glyph rendering, broader CJK) is tracked in docs/PRD-NEXT.md.

Build & install

Requirements: Rust (pinned to 1.96.0 by rust-toolchain.toml), Python ≥ 3.11, maturin ≥ 1.7. uv recommended.

uv venv .venv && source .venv/bin/activate
maturin develop                 # build + install the extension in-place
python -c "import pdfspine; print(pdfspine.__version__)"
# redistributable wheel:
maturin build --release         # -> target/wheels/

Building from source needs a C/asm compiler. The bundled pure-Rust PaddleOCR engine depends on tract, which compiles target-specific assembly kernels at build time: a C compiler (cc/clang) on Linux/macOS, or the MSVC Build Tools (incl. ml64.exe) on Windows. Prebuilt wheels (once published) need none of this. To build a fully C-free library, compile the Rust crates with --no-default-features (drops the paddle-ocr feature). Wheels are large (~15–25 MB) because the OCR models (~16 MB) are embedded.

Architecture

A Cargo workspace with a strict dependency DAG; the Python bindings touch exactly one façade crate, and core logic is split into independently testable units.

                  py-bindings   (PyO3 cdylib -> pdfspine._core, abi3-py311)
                       │
                       ▼
                    pdf-api      facade / re-exports
        ┌──────────┬───┴────┬──────────┐
        ▼          ▼        ▼          ▼
    pdf-text   pdf-edit  pdf-image  pdf-render
        │          │        │          │
        └────┬─────┘        │     (fonts, text)
             ▼              │
         pdf-fonts ◄────────┘
             ▼
         pdf-core   ◄────────  pdf-crypto
Crate Responsibility
pdf-core object model, lexer/parser, xref, repair, filters, writer, geometry
pdf-crypto Standard security handler (RC4 / AES-128 / AES-256)
pdf-fonts font mapping (encodings / ToUnicode / CMap / widths)
pdf-text content-stream interpreter, get_text, search, find_tables
pdf-edit page ops, merge, annotations / forms, metadata / TOC, redaction, OCG
pdf-image image documents, image-XObject codecs, Pixmap
pdf-render tiny-skia rasterizer → Pixmap, DisplayList, SVG
pdf-api unified ergonomic façade
py-bindings PyO3 wrappers → the _core extension module

Develop / test

cargo fmt --all --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
maturin develop && pytest python/tests       # Python tests
python conformance/run_validation.py        # real-corpus accuracy harness

pdfspine is built strictly test-first (red → green → refactor → harden); the per-function test plan is in docs/test-case-catalog.md.

Documentation

Guide + API reference + PyMuPDF migration guide: build the docs site with mkdocs serve (see mkdocs.yml / docs/). The authoritative design lives in PRD.md.

License

Apache-2.0 — see LICENSE and NOTICE. All third-party dependencies are permissive (MIT / Apache-2.0 / BSD / Zlib / …); the shipped graph is CI-verified free of copyleft.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfspine-0.0.3.tar.gz (54.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdfspine-0.0.3-cp311-abi3-win_amd64.whl (36.8 MB view details)

Uploaded CPython 3.11+Windows x86-64

pdfspine-0.0.3-cp311-abi3-manylinux_2_28_aarch64.whl (35.6 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

pdfspine-0.0.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.7 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.17+ x86-64

pdfspine-0.0.3-cp311-abi3-macosx_11_0_arm64.whl (35.4 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file pdfspine-0.0.3.tar.gz.

File metadata

  • Download URL: pdfspine-0.0.3.tar.gz
  • Upload date:
  • Size: 54.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pdfspine-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c29e6982f1927a847f13c065a36698bcae3d620652c90f5f0a56b47542c85a47
MD5 364fde6b933183eea777091d4cc22527
BLAKE2b-256 30ae9d84d88745c2f9dc1567ded9fd500169c820b4222611f2c001f03ab04eda

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.3-cp311-abi3-win_amd64.whl.

File metadata

  • Download URL: pdfspine-0.0.3-cp311-abi3-win_amd64.whl
  • Upload date:
  • Size: 36.8 MB
  • Tags: CPython 3.11+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for pdfspine-0.0.3-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6722d5b068f957aec6a04f1cf576bd5b974934a5b70b8b47a2829eb11a44e22f
MD5 d6bb607c2dc61e94c6403d2d9e08d3d5
BLAKE2b-256 d1b73bf2babdf2d49f3252d81bf303818bbae4084f7eb6d2dadafe3d944773a5

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.3-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.3-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5fec06a28d2ccd1dd1b359aabbd053fa32dd846059acda71565984114f948bca
MD5 ec17d43d62ae9b5ca7cf12b5efbd0ed6
BLAKE2b-256 52fb11f5f6b4dfe7f1cc56009984181e06e1e7b45615c5d09c33b1f449892cf8

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.3-cp311-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 efa2ba610c091e9a9cc0cb56a046e0013f1284600e9fd13b86d84b23ef32477b
MD5 eb90cdffad295ae975144ee153a5d576
BLAKE2b-256 f18b6a89b7b38c7822f4a867338ae606b2297182fb8cd7fe972d42bfed854ac7

See more details on using hashes here.

File details

Details for the file pdfspine-0.0.3-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdfspine-0.0.3-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5e423e58f683f783e94adb1015eb2c1809d71f8ad410606989608c1c0abdd6ef
MD5 265e787886d1607ded255ce7fc040fed
BLAKE2b-256 3c8b68fda32268ead9bc422b4989459937f10bbd6cc0b07d64991544de85de9a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page