A high-performance PDF processing library with a permissive license.
Project description
Why SoPDF?
1. ๐ High Performance
With parallel processing and other optimizations, SoPDF significantly outperforms alternatives: rendering up to 1.56x faster, plain text extraction 2.7x faster, and full-text search 3x faster โ while maintaining 99% word-level accuracy consistency with PyMuPDF. See the Performance Benchmark section, or run it yourself.
2. โจ Feature-Rich
Built on pypdfium2 (Google PDFium, for rendering and text) and pikepdf (libqpdf, for structure and writing). SoPDF covers the entire workflow from rendering and text extraction to structural editing.
3. ๐ฏ Clean API
Intuition as documentation. You would have designed it the same way.
4. ๐ Permissive License
In PDF processing, feature-rich + open source often comes with a license unfriendly to the open-source ecosystem. But SoPDF delivers equivalent core capabilities under the Apache 2.0 License โ no strings attached, no license audit, zero friction. Embed it, ship it, fork it. It's yours.
If you find SoPDF helpful, please consider giving it a โญ Star โ it really means a lot to us. Every star fuels our motivation to keep improving.
Benchmarks
Measured on Apple M-series (arm64, 10-core), Python 3.10, against a 50-page PDF fixture. Run the suite yourself:
python tests/benchmark/run_benchmarks.py
Rendering vs PyMuPDF
| Scenario | SoPDF | PyMuPDF | Speedup |
|---|---|---|---|
| Open document | 0.1 ms | 0.2 ms | 1.39ร faster |
| Render 1 page @ 72 DPI | 6.6 ms | 9.1 ms | 1.38ร faster |
| Render 1 page @ 150 DPI | 20.0 ms | 30.3 ms | 1.51ร faster |
| Render 1 page @ 300 DPI | 64.6 ms | 101.1 ms | 1.56ร faster |
| 50 pages sequential @ 150 DPI | 966.9 ms | 1470.3 ms | 1.52ร faster |
| 50 pages parallel @ 150 DPI | 410.7 ms | 447.2 ms | 1.09ร faster |
SoPDF wins at every DPI โ and the margin widens at higher resolutions. In parallel mode, SoPDF achieves a genuine 2.35ร speedup over its own sequential baseline. PyMuPDF's thread-parallel path, on the other hand, actually regresses to 1548.9 ms (slower than sequential) because MuPDF serialises concurrent renders behind a global lock.
Text Extraction vs PyMuPDF
| Scenario | SoPDF | PyMuPDF | Speedup |
|---|---|---|---|
| Plain text โ 50 pages | 26.0 ms | 70.0 ms | 2.70ร faster |
| Text blocks โ 50 pages | 63.6 ms | 70.4 ms | 1.11ร faster |
| Search 'benchmark' โ 50 pages | 30.2 ms | 91.0 ms | 3.01ร faster |
| Region extract โ 50 pages | 27.6 ms | 39.6 ms | 1.43ร faster |
Text search is the standout: 3ร faster than PyMuPDF. Plain-text extraction follows at 2.7ร. Correctness is verified โ sopdf and PyMuPDF produce 99% word-level overlap on the same document, so the speed advantage carries no accuracy trade-off.
Architecture
SoPDF runs two best-in-class C/C++ engines in tandem:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SoPDF Python API โ
โโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโค
โ pypdfium2 โ pikepdf โ
โ (Google PDFium) โ (libqpdf) โ
โ โ โ
โ โข Rendering โ โข Structure reads โ
โ โข Text extract โ โข All writes โ
โ โข Search โ โข Save / compress โ
โโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโ
A dirty-flag + hot-reload mechanism keeps the two engines in sync: when you write via pikepdf (e.g. rotate a page), the next read operation (e.g. render) automatically reserialises the document into pypdfium2 โ zero manual sync required.
Files are opened with lazy loading / mmap โ a 500 MB PDF opens in milliseconds and only the pages you actually access are loaded.
Quick Start
pip install sopdf
Requires Python 3.10+. The two native dependencies (pypdfium2, pikepdf) ship pre-built wheels for macOS, Linux, and Windows โ no compiler needed.
import sopdf
# --- Open ---
# from a file path (near-instant thanks to lazy loading & mmap)
with sopdf.open("document.pdf") as doc:
# --- Render ---
img_bytes = doc[0].render(dpi=150) # PNG bytes
doc[0].render_to_file("page0.png", dpi=300) # write to disk
# parallel rendering across all pages
images = sopdf.render_pages(doc.pages, dpi=150, parallel=True)
# --- Extract text ---
text = doc[0].get_text()
blocks = doc[0].get_text_blocks() # list[TextBlock] with bounding boxes
# --- Search ---
hits = doc[0].search("invoice", match_case=False) # list[Rect]
# --- Split & merge ---
new_doc = doc.split(pages=[0, 1, 2], output="chapter1.pdf")
doc.split_each(output_dir="pages/")
sopdf.merge(["intro.pdf", "body.pdf"], output="book.pdf")
# --- Save ---
doc.append(new_doc)
doc.save("out.pdf", compress=True, garbage=True)
raw = doc.to_bytes() # no disk write
# --- Rotate ---
doc[0].rotation = 90
# --- Encrypted PDFs ---
with sopdf.open("protected.pdf", password="hunter2") as doc:
doc.save("unlocked.pdf") # encryption stripped on save
# --- Open from bytes / stream ---
with open("document.pdf", "rb") as f:
with sopdf.open(stream=f.read()) as doc:
print(doc.page_count)
# --- Auto-repair corrupted PDFs ---
with sopdf.open("corrupted.pdf") as doc:
doc.save("repaired.pdf")
Features
| Capability | Examples |
|---|---|
| Open from path / bytes / stream | 01_open |
| Render pages to PNG / JPEG | 02_render |
| Batch & parallel rendering | 02_render |
| Extract plain text | 03_extract_text |
| Extract text with bounding boxes | 03_extract_text |
| Full-text search with hit rects | 04_search_text |
| Split pages into new document | 05_split |
| Merge multiple PDFs | 06_merge |
| Save with compression | 07_save_compress |
| Serialise to bytes (no disk write) | 07_save_compress |
| Rotate pages | 08_rotate |
| Open & save encrypted PDFs | 09_decrypt |
| Auto-repair corrupted PDFs | 10_repair |
License
Apache 2.0 โ see LICENSE.
SoPDF is free to use in personal projects, commercial products, and open-source libraries. No licensing fees, no attribution requirements beyond the standard Apache 2.0 notice.
WeChat Group
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sopdf-0.1.0.tar.gz.
File metadata
- Download URL: sopdf-0.1.0.tar.gz
- Upload date:
- Size: 96.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8a92334c494ef698d36d7820c8dc8bca81fec51bc6140f49be00822e04f0b54
|
|
| MD5 |
b3fbcf6dae3b97005cc85e06c8521cea
|
|
| BLAKE2b-256 |
90e441067aab5b90793dac32bf4b2a38d694ae4fdfa7c7ad484c6c324159f1b2
|
File details
Details for the file sopdf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sopdf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.21 {"installer":{"name":"uv","version":"0.9.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2672ecbddb9c120268fbf7beb3c70b6a7db9d4342dc0ad66f7aff8c3dbfb4b99
|
|
| MD5 |
a7e85c23f4a573ecfc27414c33c5f66b
|
|
| BLAKE2b-256 |
a3be47ce14ca05bbf61d7002c1940d22f976ce28cf2caaa915db9bd28751e6c3
|