A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.
Project description
WinnerZ Python Library Documentation
Overview
The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.
The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.
Architecture
The system is divided into several conceptual layers:
- Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (
winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management. - Document Object Model: Provides Pythonic abstractions (
Document,Page) to interact with PDF files, managing resources and state safely. - Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages
std::asyncfor parallel multi-page text extraction, eliminating GIL bottlenecks. - Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing
ToUnicodetables. It uses 64-bit bitwise packing and hardwarePOPCOUNTfor blazing fast template matching without external dependencies like Tesseract. - Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using
pypdfium2. - Geometry & Data Structures: Implements domain-specific types (
Rect,Matrix,Pixmap) to standardize data flow between the C++ layer and Python runtime.
Core Loading Mechanism
The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:
- Thread Safety: Uses
threading.Lock()to ensure the core is initialized exactly once. - Retry Logic: Implements a retry loop (
_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues. - Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy),
_try_repair_truncated_core_binary()attempts to restore it from other valid candidate binaries in the directory. - Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g.,
GLIBCmismatches) or binary sizes to accelerate debugging.
Environment Variables
WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.- Valid values:
auto(default),pdfium. - Resolution order for
auto: Uses PDFium when available.
- Valid values:
Advanced Features
Micro-OCR Anti-Obfuscation
WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.
- Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
- Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU
__popcnt64instructions to evaluate millions of pixel comparisons in milliseconds. - Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.
Class Reference
Document
Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.
Constructor:
Document(path_or_bytes): Resolves the file path or raw memorybytes(Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.
Methods:
__getitem__(index): Retrieves aPageobject at the specified 0-based index. Supports negative indexing.__len__(): Returns the total number of pages in the document.get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF asbytesdirectly in RAM. Use with caution on very large files to avoid memory pressure.close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.
Page
Represents a single page within a Document.
Methods:
get_text(mode="dict", sort=False): Extracts text content.mode: Can bedict,rawdict,blocks, ortext.
get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containingrect,fill, andstrokeproperties.get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.clean_contents(): Completely wipes out the vector graphics and text layer of the current page.insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fitrectwhile optionally keeping aspect ratio viakeep_proportion. The actual merge is executed efficiently duringdoc.save().rect(Property): Retrieves the bounding box of the page as aRect.
Pixmap
Represents an uncompressed image buffer containing pixel data.
Properties:
width,height: Dimensions in pixels.n: Number of channels (e.g., 4 for RGBA).stride: Number of bytes per row.samples: Raw byte array of pixel data.
Methods:
pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats includeraw,rgba,png,jpg, andjpeg. Output formats other than raw require thePillowlibrary.
Geometry Classes
- Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for
width,height, andis_empty. Overloads the&operator to compute the intersection of two rectangles. - Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.
Caching Strategy
The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.
- Global Document Cache: Managed via
winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).[!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using
winnerz.Document(path)instead ofwinnerz.open(). - Preview Document Cache: A separate caching layer strictly for the
pypdfium2rendering backend to keep the preview document context alive across multiple page renders. - C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (
std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.
Logging
WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.
import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)
Performance Benchmark
Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.
Tested on a standard 185-page PDF file:
- ⏱️ PyMuPDF (
fitz): ~0.44s - 🚀 WinnerZ (
get_all_text()): ~0.18s (2.5x Faster)
C++ Micro-OCR Benchmark
Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):
- 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
- ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)
Dependencies
pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file winnerz-1.2.13.tar.gz.
File metadata
- Download URL: winnerz-1.2.13.tar.gz
- Upload date:
- Size: 9.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
468c36ec18b9cb31fbc26af57404df4aad61d96758a1add852e3fbfb6e1daca1
|
|
| MD5 |
a9e371f734d3ed0b837ab3fef4bd4440
|
|
| BLAKE2b-256 |
d2d01addd5b120ce48c9330105e14e08f5bfd7af4332ebd1f654e7d922a35e7b
|
File details
Details for the file winnerz-1.2.13-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b29b969d66d0a769f811a09584bc60b855749ebc7ad0be709950b556ec50717
|
|
| MD5 |
729ee17adf2eb8a720c58214a47c48ed
|
|
| BLAKE2b-256 |
316f1255e1c94af1f05cde6e13dfae6082721e74fb4556015139c61ba7f0722c
|
File details
Details for the file winnerz-1.2.13-cp312-cp312-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp312-cp312-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 7.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be8a1b3a501c455a3b7a58c87c21ce5ebb930b9b565c87be8a987030d516b566
|
|
| MD5 |
e34c50264f6c170cfd201407e9b3344f
|
|
| BLAKE2b-256 |
63acfc02ab12ed69cff48ca1efd7a35e7f7f6105e7cfcb67f2e95a5a8cdbefc2
|
File details
Details for the file winnerz-1.2.13-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3461b343c9378692cefc2840537645c2bfe16ce02e54c2506e48cf1515062727
|
|
| MD5 |
7196bf20cb2bb63ba42bdc51e72cbce7
|
|
| BLAKE2b-256 |
6b94ed1b8d6de74f96010d5eeaa448df033a23f587da3dc1b906ede1d666236b
|
File details
Details for the file winnerz-1.2.13-cp312-cp312-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp312-cp312-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17104b649320390fa5303ad4e392557269e9b0787c4b98e4286c5796209ca566
|
|
| MD5 |
e1f937950f04c5d9712b72115fa69ba7
|
|
| BLAKE2b-256 |
4825e61f9062da51cee7250e8e0f28b1c535fe99056f11ee6a607b5d12988b5e
|
File details
Details for the file winnerz-1.2.13-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
063a1cd8e22a7b7809e17f1f592e760e9e045c80643dbca6606a999ec2da8e7a
|
|
| MD5 |
e2aee31f820ca0f5d97113d622da035c
|
|
| BLAKE2b-256 |
48ca1b652f76fb251095f9890ae44e75034e60b224f4c4e4a73a9ed30db2f5d6
|
File details
Details for the file winnerz-1.2.13-cp311-cp311-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp311-cp311-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81a02baf18b6694a126eae92117a6a5e27d6d9e7c5aa2e523bcc97cb274f4aa2
|
|
| MD5 |
1903dddc4a5dc92f4f04b47bb245267d
|
|
| BLAKE2b-256 |
be2deb973ea153d6cbea1e17e186fda73b4683c4d336fcac4e1fb308d2111ada
|
File details
Details for the file winnerz-1.2.13-cp311-cp311-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp311-cp311-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.11, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4007b5c7a4be0842ec546e02420ad8e06e1a31718b71190f7bbdac0c2af7ee5f
|
|
| MD5 |
4eeed6caef08df1d4156d1f3f2f70472
|
|
| BLAKE2b-256 |
55e1b32ee4c84989fcd3127c8dc49aff42927f8a1fd8e191e63100b7131c5574
|
File details
Details for the file winnerz-1.2.13-cp311-cp311-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp311-cp311-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.11, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73ce74893266a9548e9d25dff2edfc3329b81057beaf81da6f0d5b79fbd74e2d
|
|
| MD5 |
f7788e54813ed9b07ef00634b0e3facf
|
|
| BLAKE2b-256 |
dedc82dd9993a557200aea39081c29dbbedca318a39be741b70573a6dbf608d0
|
File details
Details for the file winnerz-1.2.13-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a321fd99b6a9c878a89d72859df1720246c9514d785e71308d7bfb0a425d1d34
|
|
| MD5 |
bd96a01b67007ad954a5f16a7105a093
|
|
| BLAKE2b-256 |
69d6b098ab3e2faf1eab44be4290c740f516611f41e1b7f7ac4c0019ecc96f9a
|
File details
Details for the file winnerz-1.2.13-cp310-cp310-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp310-cp310-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 5.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76083bc8667c2e5383f34fb71720cfea28e3eafbb2c3f5e2060f4992c486d533
|
|
| MD5 |
2d8207c2d2b2bd119e809c02eb029f3f
|
|
| BLAKE2b-256 |
6353a94b96fb73f5e9533233ec979483904d81c3c7c996a30a3e3112442b5b6c
|
File details
Details for the file winnerz-1.2.13-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77311b0f5a6c1bc39a1830db4b32f9aa13ff2de63209d971c8e61d01640c7e74
|
|
| MD5 |
2af9007264301ceb0a34e1fac654b389
|
|
| BLAKE2b-256 |
10669c5829257ab391ae89ce3006c7eb79ed3bb0cbfb087992ecd2d6626697ac
|
File details
Details for the file winnerz-1.2.13-cp310-cp310-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp310-cp310-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65820599a53b3e131ab217b5b9ad6fc7d728886e35ffa11ede1f2b2021a2d678
|
|
| MD5 |
3c2383cb9706af2d7317f6fe7aa59334
|
|
| BLAKE2b-256 |
0ed8c8ea668c78f8709179c16c5a3b52111a710c795d595b3dc5086e3c4c3d24
|
File details
Details for the file winnerz-1.2.13-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e5003d5098b4dc35e110f5519dc0e298e85dbf41140f32e4b856a49e1b207fc
|
|
| MD5 |
f3b9a038cbf66776dc200a81016cb00d
|
|
| BLAKE2b-256 |
0daeaddc4266f774b8d989f2e45e2617aa6ac084c1d28af0106f99fffa771c3d
|
File details
Details for the file winnerz-1.2.13-cp39-cp39-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp39-cp39-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 4.5 MB
- Tags: CPython 3.9, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cf1367a1c6f5a180e1617c73d96f195eb1658dc9964ff545f797d469e47cb49
|
|
| MD5 |
825ce1d7594b509b6aa62108dfcaebdc
|
|
| BLAKE2b-256 |
8d3092ecb82c745f80e48913e3e7272155def747cef7a6c32549d505a34b8f06
|
File details
Details for the file winnerz-1.2.13-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
750093eeaf27701430297bd3b27c142500f476cc4dc1deaeba4b86bf7852da89
|
|
| MD5 |
ad7a0fad78686a586874088788124188
|
|
| BLAKE2b-256 |
41a702acd19352ae7c025405f4fc4d58dbac98686fcb725e9894d3d817c7d710
|
File details
Details for the file winnerz-1.2.13-cp39-cp39-macosx_10_9_x86_64.whl.
File metadata
- Download URL: winnerz-1.2.13-cp39-cp39-macosx_10_9_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.9, macOS 10.9+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ba599d5daa1649f2f06b1f7b8caf5f90a97757f4e18ca9454f79797fca38fb8
|
|
| MD5 |
aef511d8582012682278a4e484853de2
|
|
| BLAKE2b-256 |
89d60d7bd80642681eddc30ce395321d1c528d230ad7ce04f9a9cf8c456aa07f
|