Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.10.tar.gz (9.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.10-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.10-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.10-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.10-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.10-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.10-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.10-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.10-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.10-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.10-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.10-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.10-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.10-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.10-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.10-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.10-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.10.tar.gz.

File metadata

  • Download URL: winnerz-1.2.10.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.10.tar.gz
Algorithm Hash digest
SHA256 cbae7de25e78c5fd7f256f7d9cd2c2399b176ec9227e4f84f0f108c414ef29cf
MD5 d665cb5aa41fc0b75d5d301fbac8c516
BLAKE2b-256 f05f9103c799c78349c4f0767ddb7eeb31b06c0a602d698b9f43e96432c41434

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.10-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.10-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 bf070e366289092fbf2b1f0c89615d862b345f0d64ddcf56aba4b8ef7cbf1b5f
MD5 3034df1156e3a9b52f4aa50416ae6d5f
BLAKE2b-256 955ff0ace780b81f413c68300ff7756670439073812b978640dd1847fe3b340f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7f4087844b922cefcb3566007a1958a2269d6e4a30f8e09d9b9bc4c33fc2d655
MD5 c457a1b92a8fa546c7a87696d2bf6e88
BLAKE2b-256 398afafa0140c1996152e968565e58a6819cd87e6c0dc2feb3b03b70b0f55c77

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ac2c63c05a6303e1ea2971ea5a1d6abd8f515486faeb96e6a7fd21c5fadd36c2
MD5 bc8690efedbeeb2264124fc9f1d07e7d
BLAKE2b-256 f6b560706a751e1b9936614f7ea20a54fb21d200ada13de6b4add82315c6db7e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c32a3e77e8c095b06303e45457fa6dae677ad76c642a1fce480295321baec2fd
MD5 828726c8a3e16c19c1cbcc6cc5d66c3f
BLAKE2b-256 16a1ca2231ab3ee5d1f2254e12bcfca097288cd34568cbfff91a4ae3c932ecc4

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.10-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.10-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 6ce674e55ae4b98b6a4e47ccb0b444d3f0a3d3231d32987db809a44cdb2ca5db
MD5 bf952ca9c62da1af3c5ffcca0c1afae8
BLAKE2b-256 311b0c0fb5475ff1ff53c19c8b841ae41409916edbb8fe430b88bcb7c93edb82

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1703d31e29fa4f3e7626414b598833a342e417f50448f5ee6b9f9749d19f15fe
MD5 d644c69fa710c7c113f42ce19b010002
BLAKE2b-256 c9a3c6e8ebf5c2c834e4a58cfa510fa525c5978e923c5399f858d39b827b1642

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1a3f50128558086956b4f6408a2bd544d81b1906320294a57212f0454f3c23e9
MD5 b3b99b0a847277c502c1ee2f4c710fe6
BLAKE2b-256 47a93406424ce43723e9988c40b1927e9d09da2dace33b3f088d5e56635c6ccd

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 67f850c3579a56433491eae840be1bed42fd24946093cd1a867af6b71a5adc9e
MD5 f58619a1a913bbffbf8eae10446e4d55
BLAKE2b-256 7477e5d9161307840dd6e3f45dfcae8aa37e5457cf49b16cb600acc6976c2c07

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.10-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.10-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c780a3d7a2025bc9f3b7ee1de1871759f1533fc603f039b5627bf9e485f8d495
MD5 4c8f532b17a6426c74d65e02bbe08c3f
BLAKE2b-256 cd07820ca936371a9fded2398607af9d1aae96fe75138c0faf5a2c84728028ae

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7e6939a52a254e483a2f3354437f8738d262d8c87da262b546d2dbeb81b711dc
MD5 0a15e119cd5a5cb981888e0312355ef8
BLAKE2b-256 e6546d6e507d35cb9d1c7906e3c422b4e02ed55ef32d188c22880e334237f825

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4bf09e2643f6f1c850a12698d00fdb294d1d7ad77bfe2868f4598e1911d2faa1
MD5 6636fec1475829ca0cf5ec6da7b74d43
BLAKE2b-256 5abc5d5eeea9d60a1779bb4aa3b3d077565ae807dcfc94c5f43e8e9b9c29669f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1baf197f279542d3dce7657a7c443e9251268fd3dc2646797a0fc32a6c332934
MD5 c10676dd511a6ed4f5dc27d2df46c319
BLAKE2b-256 f36d3a136cca8838f95db52390cb5a473e3a483be69af3c122c47aef8fe6c290

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.10-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.10-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 51331ba2e121b95be7f0720237d5d5aee84b23157856e4c4181acab035abf59e
MD5 e2c5107a70a10d24e9bce0e16297d116
BLAKE2b-256 8952c432859d8ba6d787751af4630fe8606643f769535484b3f26cea0e9809ff

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b4700fc87d76a54826d3928a5d5343daa99923749bb7c104759989e404862abd
MD5 0fa217de9434479001e53d4809605e70
BLAKE2b-256 1bb9216f573eca33dc6ac9cf042edfd14cfb34aa0b30d8f62897cacd92b41bb2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 202cf9018738df0946cf4032d6d2611dfa4cbf2eb4bf294586e093075d33e9fe
MD5 8a867d0e02d914d271e13327b9f25561
BLAKE2b-256 f8613032164faa374ec2e7c1172e878988e82fbabf58eda97811b750edf19780

See more details on using hashes here.

File details

Details for the file winnerz-1.2.10-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.10-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f0981c721937027e5bf21fc195a039c76c7d010bf09e7a1c0591329a38b92502
MD5 b4ed7257ca7a960ad48c3c26ae0b998c
BLAKE2b-256 f75c19bd9481b36c557a466a93f1a5e6914eecfc39c8d31b76e9bf91ea761bb3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page