A high-performance PDF processing library with a permissive license.

These details have not been verified by PyPI

Project links

Project description

SoPDF

The PDF processing library that belongs to everyone.

pip install sopdf

English | 中文

Why SoPDF?

1. 🚀 High Performance

With parallel processing and other optimizations, SoPDF significantly outperforms alternatives: rendering up to 2.78x faster, plain text extraction 2.7x faster, and full-text search 3x faster — while maintaining 99% word-level accuracy consistency with PyMuPDF. See the Performance Benchmark section, or run it yourself.

2. ✨ Feature-Rich

Built on pypdfium2 (Google PDFium, for rendering and text) and pikepdf (libqpdf, for structure and writing). SoPDF covers the entire workflow from rendering and text extraction to structural editing.

3. 🎯 Clean API

Intuition as documentation. You would have designed it the same way.

4. 🔓 Permissive License

In PDF processing, feature-rich + open source often comes with a license unfriendly to the open-source ecosystem. But SoPDF delivers equivalent core capabilities under the Apache 2.0 License — no strings attached, no license audit, zero friction. Embed it, ship it, fork it. It's yours.

If you find SoPDF helpful, please consider giving it a ⭐ Star — it really means a lot to us. Every star fuels our motivation to keep improving.

Benchmarks

Measured on Apple M-series (arm64, 10-core), Python 3.10, against a 50-page PDF fixture. Run the suite yourself: python tests/benchmark/run_benchmarks.py

Rendering vs PyMuPDF

Scenario	SoPDF	PyMuPDF	Speedup
Open document	0.1 ms	0.2 ms	1.39× faster
Render 1 page @ 72 DPI	4.1 ms	9.0 ms	2.20× faster
Render 1 page @ 150 DPI	11.7 ms	30.2 ms	2.58× faster
Render 1 page @ 300 DPI	34.2 ms	95.2 ms	2.78× faster
50 pages sequential @ 150 DPI	543 ms	1468 ms	2.70× faster
50 pages parallel @ 150 DPI	418 ms	446 ms	1.07× faster

SoPDF wins at every DPI — and the margin widens at higher resolutions. In parallel mode, SoPDF achieves a 1.30× speedup over its own sequential baseline (the gap narrows because sequential rendering is now much faster). PyMuPDF's thread-parallel path, on the other hand, actually regresses to 1510 ms (slower than sequential) because MuPDF serialises concurrent renders behind a global lock.

Text Extraction vs PyMuPDF

Scenario	SoPDF	PyMuPDF	Speedup
Plain text — 50 pages	26.0 ms	70.0 ms	2.70× faster
Text blocks — 50 pages	63.6 ms	70.4 ms	1.11× faster
Search 'benchmark' — 50 pages	30.2 ms	91.0 ms	3.01× faster
Region extract — 50 pages	27.6 ms	39.6 ms	1.43× faster

Text search is the standout: 3× faster than PyMuPDF. Plain-text extraction follows at 2.7×. Correctness is verified — sopdf and PyMuPDF produce 99% word-level overlap on the same document, so the speed advantage carries no accuracy trade-off.

Architecture

SoPDF runs two best-in-class C/C++ engines in tandem:

┌──────────────────────────────────────────┐
│               SoPDF Python API           │
├───────────────────┬──────────────────────┤
│   pypdfium2       │   pikepdf            │
│   (Google PDFium) │   (libqpdf)          │
│                   │                      │
│   • Rendering     │   • Structure reads  │
│   • Text extract  │   • All writes       │
│   • Search        │   • Save / compress  │
└───────────────────┴──────────────────────┘

A dirty-flag + hot-reload mechanism keeps the two engines in sync: when you write via pikepdf (e.g. rotate a page), the next read operation (e.g. render) automatically reserialises the document into pypdfium2 — zero manual sync required.

Files are opened with lazy loading / mmap — a 500 MB PDF opens in milliseconds and only the pages you actually access are loaded.

For image encoding, SoPDF uses OpenCV (opencv-python) rather than Pillow. OpenCV's zero-copy NumPy bridge with pypdfium2 delivers 1.6×–1.9× faster encoding than the Pillow path — see docs/en/RuntimeDependencyCompare.md for the full analysis and benchmark breakdown.

Quick Start

pip install sopdf

Requires Python 3.10+. All native dependencies (pypdfium2, pikepdf, opencv-python) ship pre-built wheels for macOS, Linux, and Windows — no compiler needed.

import sopdf

# --- Open ---
# from a file path (near-instant thanks to lazy loading & mmap)
with sopdf.open("document.pdf") as doc:

    # --- Render ---
    img_bytes = doc[0].render(dpi=150)            # PNG bytes
    doc[0].render_to_file("page0.png", dpi=300)   # write to disk

    # parallel rendering across all pages
    images = sopdf.render_pages(doc.pages, dpi=150, parallel=True)

    # --- Extract text ---
    text = doc[0].get_text()
    blocks = doc[0].get_text_blocks()             # list[TextBlock] with bounding boxes

    # --- Search ---
    hits = doc[0].search("invoice", match_case=False)   # list[Rect]

    # --- Split & merge ---
    new_doc = doc.split(pages=[0, 1, 2], output="chapter1.pdf")
    doc.split_each(output_dir="pages/")
    sopdf.merge(["intro.pdf", "body.pdf"], output="book.pdf")

    # --- Save ---
    doc.append(new_doc)
    doc.save("out.pdf", compress=True, garbage=True)
    raw = doc.to_bytes()                          # no disk write

    # --- Rotate ---
    doc[0].rotation = 90

    # --- Metadata ---
    print(doc.metadata.title)                  # read
    doc.metadata.title = "Updated Title"       # write
    print(doc.metadata.creation_datetime)      # parsed Python datetime

    # --- Outline (table of contents) ---
    for item in doc.outline.items:
        print(f"[p{item.page + 1}] {item.title}")
    flat = doc.outline.to_list()               # PyMuPDF-compatible flat list

# --- Encrypted PDFs ---
with sopdf.open("protected.pdf", password="hunter2") as doc:
    doc.save("unlocked.pdf")                      # encryption stripped on save

# --- Open from bytes / stream ---
with open("document.pdf", "rb") as f:
    with sopdf.open(stream=f.read()) as doc:
        print(doc.page_count)

# --- Auto-repair corrupted PDFs ---
with sopdf.open("corrupted.pdf") as doc:
    doc.save("repaired.pdf")

Features

Capability	Examples
Open from path / bytes / stream	01_open
Render pages to PNG / JPEG	02_render
Batch & parallel rendering	02_render
Extract plain text	03_extract_text
Extract text with bounding boxes	03_extract_text
Full-text search with hit rects	04_search_text
Split pages into new document	05_split
Merge multiple PDFs	06_merge
Save with compression	07_save_compress
Serialise to bytes (no disk write)	07_save_compress
Rotate pages	08_rotate
Open & save encrypted PDFs	09_decrypt
Auto-repair corrupted PDFs	10_repair
Read & write document metadata	11_metadata
Read document outline (TOC)	12_outline

License

Apache 2.0 — see LICENSE.

SoPDF is free to use in personal projects, commercial products, and open-source libraries. No licensing fees, no attribution requirements beyond the standard Apache 2.0 notice.

WeChat Group

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 14, 2026

0.1.1

Apr 12, 2026

0.1.0

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sopdf-0.2.0.tar.gz (112.7 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sopdf-0.2.0-py3-none-any.whl (26.0 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file sopdf-0.2.0.tar.gz.

File metadata

Download URL: sopdf-0.2.0.tar.gz
Upload date: Apr 14, 2026
Size: 112.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for sopdf-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`699f82103adef68194141da1f6e7e1709b61f86b77816aa42adc8b1a9f5834cd`
MD5	`b2ad3a67e7be9141d5d8b2617d8d14a8`
BLAKE2b-256	`f70fe34c22b94795e51d0499aef3dcf1be142352cd6f7548f1ef74c53d5e8640`

See more details on using hashes here.

File details

Details for the file sopdf-0.2.0-py3-none-any.whl.

File metadata

Download URL: sopdf-0.2.0-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 26.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for sopdf-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`35920258a0124a2dbdbc4b7da445bcbb3aa016c2a9a408c8f234af6e2570ebd0`
MD5	`2777e8b7aa217acece12599911d4a970`
BLAKE2b-256	`7d9425e540ed484da01899d57f5e451b93f3f50bcd38e97db427335e113ef04c`

See more details on using hashes here.

sopdf 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SoPDF

Why SoPDF?

Benchmarks

Rendering vs PyMuPDF

Text Extraction vs PyMuPDF

Architecture

Quick Start

Features

License

WeChat Group

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes