Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • extract_all_text_concurrent(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested trên file 185 trang PDF chuẩn:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (extract_all_text_concurrent()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested trên file PDF bị mã hóa 100% chữ (Ép hệ thống quét Micro-OCR toàn bộ ký tự):

  • 🐢 OCR truyền thống (Tesseract): ~3 - 5 giây / trang
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 giây / trang (Nhanh gấp ~15 lần)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.3.tar.gz (84.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.3-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.3-cp312-cp312-manylinux_2_28_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.3-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.3-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.3-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.3-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.3-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.3-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.3-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.3-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.3-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.3-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.3-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.3-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.3-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.3-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.3.tar.gz.

File metadata

  • Download URL: winnerz-1.2.3.tar.gz
  • Upload date:
  • Size: 84.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.3.tar.gz
Algorithm Hash digest
SHA256 adcd5e205c6b2267b38c0aafeb340ee98a406d8a5d020933cdbb036c63335549
MD5 ac52dd8561bf1b200989750693c1c004
BLAKE2b-256 9262e430f7951b419f45f870cf8fccfcd3fcd306ae8a67dcf38916bb8aa054c1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 eec5bfafbf1a7fbd6e81f31acf69ce6f8c06c8315ff8daf42969d76935a43f98
MD5 1224d178f33bdcff7dfdf262f220fa9a
BLAKE2b-256 9fda090df100f70bb168848fa18b8c77d6efd2a6e08cf6b64e4da7e764eef1c1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e019528229a1521f012e4c06660a70d9d60d7431e76bbe33fae4a9461f72fa09
MD5 c2328921c6d83eefee006ef809a4ba43
BLAKE2b-256 9e1ca7bbf770a4be9696a3f7e345afb9375a8206c73038e10707aa9b6f870bda

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c045b0688745326b2975d8847184103976ad6a35f7190602c55de20782b53254
MD5 507a612097ec18a528d971283b72b154
BLAKE2b-256 a48d25f6f0ab7d3723d100af3dba35239b7992b5818f8e5fb6e7a97919a701de

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a9eb14670a33a84240faa07759817fad89eb6017395ccf0248c6d12b2778db5b
MD5 3cac33a6e311d7349f9abdd783e06c28
BLAKE2b-256 06d708670de3102dc416bb6ed517df0092ac8c6feddd91439026aa322f9f2564

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 99582786f4cbf823bd2f0c1fa802168971ff95b1d6c80f5c49c1a89f51dd5d06
MD5 5c56c283442721f9b06c93282f6035d5
BLAKE2b-256 c6026b1ee053c1bb9ad4a8d1985322669ad7806fc18e698aeee9b95cbd9f47fa

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 acfd8b3da3f089239d13b6e3eb75604251e455f9f4cc01e6bcc34ec266ce65d0
MD5 88dacf5f530911426f1504db68aa94fa
BLAKE2b-256 eda803dc33f670e5d2e7e4a15d338fed63cdecbd6b6d46874e95a5c50be14da3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2ecd4f81c74618b56a69f751a078832d7a2c6a013afaac1cf362121465087219
MD5 0212254dc14a905999c075a15f63a1d6
BLAKE2b-256 f3fb7d51f604a743bdd519f992abd7cdb3ef13ec1c1425ae1ebf85a2b5656349

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 84deb0b0a7a57bd60cb2160c5b40142725c4f07d31206aa88ed3d6fb5b3a5ca6
MD5 baa554d3f38db1bfdb0f35cb57c5ffd5
BLAKE2b-256 f152ea9f757d9e7ab8357e1e24393449428280b9afecea143d2365c6fcc29966

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b9707bf8f5dfe8cee1e9dfb02683c39e331e4bafaa207c3e94d5e231981a4afa
MD5 42927c63b5dcbd3b5e304d9f92d51ab7
BLAKE2b-256 d1b7371463b357794f9626273f8060a281afd6772906ea67cf5d3d3aa4398b35

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 68be697d02d29bc3cd54704fafc49f09f648c492134e30362d41e31b410b4e5a
MD5 73ad6ec478b6f3a31254b3fb9e8b492c
BLAKE2b-256 1225093ef516a4051c54dee50d5ed22da59e9bbcf1401e3b0b0827eb2f0c02f3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1bb8840e612290d85d01ced34a3ae0ca67f191ea55205f80d585293a65c6c39a
MD5 ca8a5d0a80fc6c6cace52417151610df
BLAKE2b-256 0eed9344efa010818d3eeaf7f4fb73d2717f3b61a8a20d5290ff0f8b867f4aa9

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b2a84f12f1efac9efd0dbf7eba2d2116777ac6639dee00eeb93a3e7a95b16f52
MD5 ef1edf0dfc0d29d024b0604286f78312
BLAKE2b-256 2d0bd0103ce481dfae556ff121d1c461aa2a2bf86a8c446db621828893a9bcfc

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 61b91f3e4f2a1cac176d2d28d1b5ed0966d5b790293a5439fc60c18605e258a1
MD5 69b7b7eb72fb9c398720679ddd5affdd
BLAKE2b-256 cec6ac60ebe90222538a28689bd79b54e0b5c299e06b3c065e1991671509deea

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 221c7e33fbdf7349bd0ffd91af005f345516b9ae2bdfc9c54d4cecfa95c728e2
MD5 44284d944cbd96dc90712a30d4e3b841
BLAKE2b-256 80407497e8827d3b85ce4262833853a492b65e624bd1d5c9685a06351677a02d

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c109151143c4c711bef4ae12abfd7ef066bead52700823bf2d7e873348b22f89
MD5 4705c3763aa9efcbbebb5a15cbe312ec
BLAKE2b-256 28c7743fdf2db9f32530d29b98195d8f2236531df587af7250c4203850f7a0e7

See more details on using hashes here.

File details

Details for the file winnerz-1.2.3-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.3-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 802471650006f11a5c45e6a41608d81045c86cca82a2dc7ed9fd11b857f486b5
MD5 109793f89704aa813d019e609065a819
BLAKE2b-256 1b798362dc5368ce0b05312f1125a6f0674ee20442a984115026147d0456e669

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page