Skip to main content

A high-performance PDF processing library with a permissive license.

Project description

SoPDF

The PDF processing library that belongs to everyone.

PyPI version Python versions License

pip install sopdf

English | ไธญๆ–‡


Why SoPDF?

1. ๐Ÿš€ High Performance

With parallel processing and other optimizations, SoPDF significantly outperforms alternatives: rendering up to 1.56x faster, plain text extraction 2.7x faster, and full-text search 3x faster โ€” while maintaining 99% word-level accuracy consistency with PyMuPDF. See the Performance Benchmark section, or run it yourself.

2. โœจ Feature-Rich

Built on pypdfium2 (Google PDFium, for rendering and text) and pikepdf (libqpdf, for structure and writing). SoPDF covers the entire workflow from rendering and text extraction to structural editing.

3. ๐ŸŽฏ Clean API

Intuition as documentation. You would have designed it the same way.

4. ๐Ÿ”“ Permissive License

In PDF processing, feature-rich + open source often comes with a license unfriendly to the open-source ecosystem. But SoPDF delivers equivalent core capabilities under the Apache 2.0 License โ€” no strings attached, no license audit, zero friction. Embed it, ship it, fork it. It's yours.

If you find SoPDF helpful, please consider giving it a โญ Star โ€” it really means a lot to us. Every star fuels our motivation to keep improving.


Benchmarks

Measured on Apple M-series (arm64, 10-core), Python 3.10, against a 50-page PDF fixture. Run the suite yourself: python tests/benchmark/run_benchmarks.py

Rendering vs PyMuPDF

Scenario SoPDF PyMuPDF Speedup
Open document 0.1 ms 0.2 ms 1.39ร— faster
Render 1 page @ 72 DPI 6.6 ms 9.1 ms 1.38ร— faster
Render 1 page @ 150 DPI 20.0 ms 30.3 ms 1.51ร— faster
Render 1 page @ 300 DPI 64.6 ms 101.1 ms 1.56ร— faster
50 pages sequential @ 150 DPI 966.9 ms 1470.3 ms 1.52ร— faster
50 pages parallel @ 150 DPI 410.7 ms 447.2 ms 1.09ร— faster

SoPDF wins at every DPI โ€” and the margin widens at higher resolutions. In parallel mode, SoPDF achieves a genuine 2.35ร— speedup over its own sequential baseline. PyMuPDF's thread-parallel path, on the other hand, actually regresses to 1548.9 ms (slower than sequential) because MuPDF serialises concurrent renders behind a global lock.

Text Extraction vs PyMuPDF

Scenario SoPDF PyMuPDF Speedup
Plain text โ€” 50 pages 26.0 ms 70.0 ms 2.70ร— faster
Text blocks โ€” 50 pages 63.6 ms 70.4 ms 1.11ร— faster
Search 'benchmark' โ€” 50 pages 30.2 ms 91.0 ms 3.01ร— faster
Region extract โ€” 50 pages 27.6 ms 39.6 ms 1.43ร— faster

Text search is the standout: 3ร— faster than PyMuPDF. Plain-text extraction follows at 2.7ร—. Correctness is verified โ€” sopdf and PyMuPDF produce 99% word-level overlap on the same document, so the speed advantage carries no accuracy trade-off.


Architecture

SoPDF runs two best-in-class C/C++ engines in tandem:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚               SoPDF Python API           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   pypdfium2       โ”‚   pikepdf            โ”‚
โ”‚   (Google PDFium) โ”‚   (libqpdf)          โ”‚
โ”‚                   โ”‚                      โ”‚
โ”‚   โ€ข Rendering     โ”‚   โ€ข Structure reads  โ”‚
โ”‚   โ€ข Text extract  โ”‚   โ€ข All writes       โ”‚
โ”‚   โ€ข Search        โ”‚   โ€ข Save / compress  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

A dirty-flag + hot-reload mechanism keeps the two engines in sync: when you write via pikepdf (e.g. rotate a page), the next read operation (e.g. render) automatically reserialises the document into pypdfium2 โ€” zero manual sync required.

Files are opened with lazy loading / mmap โ€” a 500 MB PDF opens in milliseconds and only the pages you actually access are loaded.


Quick Start

pip install sopdf

Requires Python 3.10+. The two native dependencies (pypdfium2, pikepdf) ship pre-built wheels for macOS, Linux, and Windows โ€” no compiler needed.

import sopdf

# --- Open ---
# from a file path (near-instant thanks to lazy loading & mmap)
with sopdf.open("document.pdf") as doc:

    # --- Render ---
    img_bytes = doc[0].render(dpi=150)            # PNG bytes
    doc[0].render_to_file("page0.png", dpi=300)   # write to disk

    # parallel rendering across all pages
    images = sopdf.render_pages(doc.pages, dpi=150, parallel=True)

    # --- Extract text ---
    text = doc[0].get_text()
    blocks = doc[0].get_text_blocks()             # list[TextBlock] with bounding boxes

    # --- Search ---
    hits = doc[0].search("invoice", match_case=False)   # list[Rect]

    # --- Split & merge ---
    new_doc = doc.split(pages=[0, 1, 2], output="chapter1.pdf")
    doc.split_each(output_dir="pages/")
    sopdf.merge(["intro.pdf", "body.pdf"], output="book.pdf")

    # --- Save ---
    doc.append(new_doc)
    doc.save("out.pdf", compress=True, garbage=True)
    raw = doc.to_bytes()                          # no disk write

    # --- Rotate ---
    doc[0].rotation = 90

# --- Encrypted PDFs ---
with sopdf.open("protected.pdf", password="hunter2") as doc:
    doc.save("unlocked.pdf")                      # encryption stripped on save

# --- Open from bytes / stream ---
with open("document.pdf", "rb") as f:
    with sopdf.open(stream=f.read()) as doc:
        print(doc.page_count)

# --- Auto-repair corrupted PDFs ---
with sopdf.open("corrupted.pdf") as doc:
    doc.save("repaired.pdf")

Features

Capability Examples
Open from path / bytes / stream 01_open
Render pages to PNG / JPEG 02_render
Batch & parallel rendering 02_render
Extract plain text 03_extract_text
Extract text with bounding boxes 03_extract_text
Full-text search with hit rects 04_search_text
Split pages into new document 05_split
Merge multiple PDFs 06_merge
Save with compression 07_save_compress
Serialise to bytes (no disk write) 07_save_compress
Rotate pages 08_rotate
Open & save encrypted PDFs 09_decrypt
Auto-repair corrupted PDFs 10_repair

License

Apache 2.0 โ€” see LICENSE.

SoPDF is free to use in personal projects, commercial products, and open-source libraries. No licensing fees, no attribution requirements beyond the standard Apache 2.0 notice.

WeChat Group

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sopdf-0.1.0.tar.gz (96.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sopdf-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file sopdf-0.1.0.tar.gz.

File metadata

  • Download URL: sopdf-0.1.0.tar.gz
  • Upload date:
  • Size: 96.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for sopdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a8a92334c494ef698d36d7820c8dc8bca81fec51bc6140f49be00822e04f0b54
MD5 b3fbcf6dae3b97005cc85e06c8521cea
BLAKE2b-256 90e441067aab5b90793dac32bf4b2a38d694ae4fdfa7c7ad484c6c324159f1b2

See more details on using hashes here.

File details

Details for the file sopdf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sopdf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for sopdf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2672ecbddb9c120268fbf7beb3c70b6a7db9d4342dc0ad66f7aff8c3dbfb4b99
MD5 a7e85c23f4a573ecfc27414c33c5f66b
BLAKE2b-256 a3be47ce14ca05bbf61d7002c1940d22f976ce28cf2caaa915db9bd28751e6c3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page