A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.
Project description
WinnerZ Python Library Documentation
Overview
The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.
The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).
Architecture
The system is divided into several conceptual layers:
- Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (
winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management. - Document Object Model: Provides Pythonic abstractions (
Document,Page) to interact with PDF files, managing resources and state safely. - Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages
std::asyncfor parallel multi-page text extraction, eliminating GIL bottlenecks. - Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using
pypdfium2orplaywright. - Geometry & Data Structures: Implements domain-specific types (
Rect,Matrix,Pixmap) to standardize data flow between the C++ layer and Python runtime.
Core Loading Mechanism
The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:
- Thread Safety: Uses
threading.Lock()to ensure the core is initialized exactly once. - Retry Logic: Implements a retry loop (
_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues. - Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy),
_try_repair_truncated_core_binary()attempts to restore it from other valid candidate binaries in the directory. - Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g.,
GLIBCmismatches) or binary sizes to accelerate debugging.
Environment Variables
WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.- Valid values:
auto(default),pdfium,playwright. - Resolution order for
auto: Falls back from PDFium to Playwright based on availability.
- Valid values:
Class Reference
Document
Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.
Constructor:
Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.
Methods:
__getitem__(index): Retrieves aPageobject at the specified 0-based index. Supports negative indexing.__len__(): Returns the total number of pages in the document.get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.
Page
Represents a single page within a Document.
Methods:
get_text(mode="dict", sort=False): Extracts text content.mode: Can bedict,rawdict,blocks, ortext.
get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containingrect,fill, andstrokeproperties.get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple orRect).apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Setimages=1to delete images intersecting the redaction box, or keep it0to strictly redact text.clean_contents(): Completely wipes out the vector graphics and text layer of the current page.insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fitrectwhile optionally keeping aspect ratio viakeep_proportion. The actual merge is executed efficiently duringdoc.save().rect(Property): Retrieves the bounding box of the page as aRect.
Pixmap
Represents an uncompressed image buffer containing pixel data.
Properties:
width,height: Dimensions in pixels.n: Number of channels (e.g., 4 for RGBA).stride: Number of bytes per row.samples: Raw byte array of pixel data.
Methods:
pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats includeraw,rgba,png,jpg, andjpeg. Output formats other than raw require thePillowlibrary.
Geometry Classes
- Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for
width,height, andis_empty. Overloads the&operator to compute the intersection of two rectangles. - Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.
Caching Strategy
The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.
- Global Document Cache: Managed via
open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds). - Preview Document Cache: A separate caching layer (
_open_preview_pdfium_doc) strictly for thepypdfium2rendering backend to keep the preview document context alive across multiple page renders. - C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (
std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.
Performance Benchmark
Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.
Tested on a 185-page PDF document (2024-annual-report.pdf):
- ⏱️ PyMuPDF (
fitz): ~0.44s - 🚀 WinnerZ (
get_all_text()): ~0.18s (2.5x Faster)
Dependencies
pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).Pillow(PIL): Optional. Required for encodingPixmapinstances to PNG/JPEG and manipulating preview images.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file winnerz-1.1.6.tar.gz.
File metadata
- Download URL: winnerz-1.1.6.tar.gz
- Upload date:
- Size: 74.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
053bfba55c936517325a3a497909310727294b1a5bb3047f7f5bc22e373071f0
|
|
| MD5 |
dce4333c8d4490939ec7cbac8c4be497
|
|
| BLAKE2b-256 |
f93c9a1a2f0cca1f2ffa685e153e1953a4b9e9a5931ee955e896a0eff907b8b5
|
File details
Details for the file winnerz-1.1.6-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8be2c0ea012112dad2f6263dae7245912bcc7fa00dd4d567e67989131140fc2
|
|
| MD5 |
36038ad7ff7af99b44abf3616bd95dfc
|
|
| BLAKE2b-256 |
824848e4ed3e67191cf3046838c594606c2b1035e03267bc0ecd1f8e54373d72
|
File details
Details for the file winnerz-1.1.6-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 7.2 MB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18f4057d3e7d089100e44f63cf608a4eeb3078353a32bd46d3f80533166e3c7f
|
|
| MD5 |
75c291bcb4623d492db425a9a55bfa2c
|
|
| BLAKE2b-256 |
95bb1273b4cdf784718c8d7fc2ff4b3576103edee576aa0d71c514d05ef23e47
|
File details
Details for the file winnerz-1.1.6-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d3f5fba7480dd2e74bb5a8c993ca292cbd1f69e8754461c0d7305b0dba3d61f
|
|
| MD5 |
f4954a9b1680ec24fded914ad9a669bf
|
|
| BLAKE2b-256 |
4550054da73a95de5a5fd7408ff15ec08cf0945bcdc41cfe10fa839e20207b2d
|
File details
Details for the file winnerz-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9659ebf611fbabe8bd726675ce1b884c3ccf6f758d3c6a114dda19d02c5d5cf
|
|
| MD5 |
8eeba400b0eb15603e63d9ae72228237
|
|
| BLAKE2b-256 |
f4ce705733b45f83f4dbe54bf4f58cd82f591a6f30177358d9374fff70a1b29f
|
File details
Details for the file winnerz-1.1.6-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfa76d813533aa148861346a0b29497dcf5cf74d78b6bab4d19c2f8a8d334a9e
|
|
| MD5 |
14aaa0ff26e7a5aa8a4819c4ae04e67f
|
|
| BLAKE2b-256 |
f2ad7a08c0ab221340529f7898edab771f10f154eee2348874ffa70e5373b9d5
|
File details
Details for the file winnerz-1.1.6-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbdf1d91768addfbdb9a405e71119e6c264877e9d87270a6a05b7514252eea28
|
|
| MD5 |
b700ceccfc16eacb8498073683b1568b
|
|
| BLAKE2b-256 |
362adc6a3f7ff1216d00221e3c6d6431cf0b4de1a3deaec563c86ea910bad068
|
File details
Details for the file winnerz-1.1.6-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce82dc8f0efba244939f4dd378a0c80a9196105842a8f5b56d11506fa3da4c16
|
|
| MD5 |
2de0ff50fefca44253c8878634e516d3
|
|
| BLAKE2b-256 |
e8cdffc16678a925523ab59840c785b1c4f0b93144738419a2e1d963b3574016
|
File details
Details for the file winnerz-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a913b3d5354bf95a100180e6d4c78eeac90f15882b3ceefc22807efde9985ef5
|
|
| MD5 |
ea99f6657df2afbf62d53aeb52734c8d
|
|
| BLAKE2b-256 |
d95c7e40c534e6e5aae94926f71756017ac1edaf488900e33a602b8259a7016c
|
File details
Details for the file winnerz-1.1.6-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e40459a0f9a7e59390947918d3e0cb7582782ec2f3080ef540fe9e53e6646e7
|
|
| MD5 |
c8b52e8058231d1dc0b3406b1ee1344c
|
|
| BLAKE2b-256 |
35589f411b0a932b8be3ba0194305df6c982f96d5e7b5688b0694d604472e8d0
|
File details
Details for the file winnerz-1.1.6-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03dcee8edb6812b3b360868333d9dfa575b826b2f893f9a4ed527db2395cb277
|
|
| MD5 |
ccc375c3c74bd92e8ca7b6aa2ab92e6c
|
|
| BLAKE2b-256 |
8a38edf7bda71ed7d07c0e12d3174d3a8623d695c8d3785c0184cff98bd9e345
|
File details
Details for the file winnerz-1.1.6-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e02a0db90b963c176e74fe6f9e8fe3cee597a56a6c0daca70b9576cbf3dcf5ec
|
|
| MD5 |
5ff643ce9639b43a01b144553d70cf78
|
|
| BLAKE2b-256 |
1e35c1673ecd645cc5d2aa4b4eb2f0b2924f7b6137885c2553490864c55aa21d
|
File details
Details for the file winnerz-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fce2c7e261414dae4fea69ffc4e8c073cec865f107171ac15032b5d3b08c30c
|
|
| MD5 |
d44808c5a732d2c6cdbff1e49322e42f
|
|
| BLAKE2b-256 |
7c791d4d934e84fdad71ba0a51ac20bb38a9259a6a37968085411b520465dde2
|
File details
Details for the file winnerz-1.1.6-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d0125a82f55ae9fa7972a57fbcc2295ce9497d41e1e9a262c69283c2299251b
|
|
| MD5 |
ecb897487a596d8b113d2148be15e509
|
|
| BLAKE2b-256 |
b9a1fb800e668c746ad962d1bee5b526f529c790c956df680cfbb609fc2f11cc
|
File details
Details for the file winnerz-1.1.6-cp39-cp39-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp39-cp39-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.4 MB
- Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00befb431d9c7eed56134d36a478295dc82684ba7c62854e42db02be4a233d66
|
|
| MD5 |
328eabb5ba61dbea67bc22250e0ad0bd
|
|
| BLAKE2b-256 |
0035399f0022951e5dce32cba3d573fb71e93f2f526aa585fd60d03d25da435c
|
File details
Details for the file winnerz-1.1.6-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9e226498e2949cb01a8285239f4e02e7556103632db835704471ec089a3232b
|
|
| MD5 |
1c67d85a9773520cdb5347524f4d9ca0
|
|
| BLAKE2b-256 |
bea988aa78ed3bd6306c778e2756a91dae2e19cdc80cad678a9e94491e3e1fe5
|
File details
Details for the file winnerz-1.1.6-cp39-cp39-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.1.6-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cc9320589b704737f5fc9435e72c989841a223163909768d638d42e32b21d61
|
|
| MD5 |
a33aec95cc08e25c1c28ab494ec2f8c1
|
|
| BLAKE2b-256 |
52afc0f53c6383ae76203970632a4e5a4a7710c752330781afe851c080773ca4
|