Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.12.tar.gz (9.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.12-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.12-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.12-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.12-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.12-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.12-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.12-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.12-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.12-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.12-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.12-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.12-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.12-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.12-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.12-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.12-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.12.tar.gz.

File metadata

  • Download URL: winnerz-1.2.12.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.12.tar.gz
Algorithm Hash digest
SHA256 4a7262e68f37119531de2532d8712c3af6d6dcf747712b5c1d96a2230ba5c60a
MD5 e4bbf3e4326c6d83becf8df1a528ff63
BLAKE2b-256 f136b693a32a65abaef4381ad8471fc31b64c7fc676210dfb4367338e8b330b9

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.12-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.12-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 3660be3402d14fbba15eba7bb8e52ab3a38681469686aeef5696a3366f473369
MD5 bef168ff9095bfa8e40287692e9ddcd5
BLAKE2b-256 79a172ab41c5e9cc00285c917520ef3fc480bfe0ee03f124e81796d9f3a4ac70

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 14a999927b727886b47ed5e51cb22ae3cdff45dcc2ab162d54e52b77b016907c
MD5 4ba6f6a57c156cd7d80162fcedc72ebd
BLAKE2b-256 7740dafb02683a2c666e75580a691c479c33c457d866e832e59a38cf1c92c2d8

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3d883801d81547d622d3345bb5ea8b78a48456ff23f6cbaa8e9c3b91f45fba3b
MD5 9215b4ec026ae5a28d43d38ba98f9cf9
BLAKE2b-256 968f883dd15a9ebaf0cceb2a6a1c14bfa293ca03b5da2517c43cebe4e499ccee

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 89bb30b4a49b690c3c20533dfd3dd00a8ea18d28296cc882f79a5016598dbe51
MD5 f75da89c4a6653325aea7268b6a894a5
BLAKE2b-256 12a6d59fcc88dc1b620a2f4472d09b47904e35b82f38816c332c56b140e17fa1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.12-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.12-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 5ff83dd04cf66dbd8041f5480518577659f08860a8b3445b1295fce7fd2f0242
MD5 9f41ef4acf47516de54238d2380a9412
BLAKE2b-256 e8a9ac8737f4b0393ea94d12ad0363b001a6e81ff9fd1fe1930d18f88874aad1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fcdf73ec3b3848cd6d09c70bc1a2b4e5ea3596f9f03638e7759cf4fe1f4d3a67
MD5 4882d011f0719531676aa65a75bc2f3b
BLAKE2b-256 fc2e77ef709b1dc453bb68059ee408c9a5aea87a41d91f9aa77e74648731a823

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2a85e2909dbcbbcd52b6cd481de79011890a4dd6a421d28995d98f88b249d1ab
MD5 5a0eb7a48977e00c4bae70648197b1dd
BLAKE2b-256 c808351c0b4e9ce86f5a55e47ab9930eb8bc993c8918c4467e866b8b94e11a01

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a990f49cdb3345f1958b4b0f2bdc65eaaf350581dc61a19613a36c98df04c7c5
MD5 7c8289dd8019c167fbfeb67d612e26fa
BLAKE2b-256 9918c6d239774b4a1bd8aac60f6e43521e9a2183ee2d8e567198bb634874de31

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.12-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.12-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 112c0ab4af0ec7de33bc34712f6e94fc8fca864c0a2f2929280ea3709f44ad5e
MD5 97966597ae2af5b1e2fec3d8b7d139d5
BLAKE2b-256 de4c3265d455a88b07169f1ca381029631792638d46f28ec5ffc7c689e88945a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c479eb5eafe2abab5271a326b0ad71177aba0d57161fe30d672cf71cf6c2ab99
MD5 32830cd76a79a7e9a1938cced717fc02
BLAKE2b-256 3d059114b866946aaea2e3dd6db52a590361e3667c70417b6aaa98c36810791f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d67fe6b3e094884e735ddf88ac686bbd0ab9497a19a34124406dd889968e4cf3
MD5 8174bf1e2ae74e724ea9c40f84990940
BLAKE2b-256 670a0d82c950a685746c52bfd9116f989ef67902271805d18d1886297dcc0e96

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 491154ac28ed23fda86ed8ddfe6c5d5421d74b8e628d861c6d71862585974fff
MD5 747d980af5f5dd22c792343708c61291
BLAKE2b-256 ac7e0d3e05a80d4f970b4fd96a0875ecea722f19fcd45823edfb3a24f4414138

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.12-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.12-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 a0be67be2f8d2273b19c604c219b387afab6ef9385e4f78a81364d861af86b5d
MD5 ad210fb24f95805310837eff8f41200c
BLAKE2b-256 1bf83b54111c878dab3833828056bc208911ce93d29e34bb740d581351192f7e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 eff611fb77706dc0ca5bb5b19ae8246c3941fc9f21d729426c1ae14e0edabbe4
MD5 38ca07ca8ff8f0665a1ec502b0bc82cb
BLAKE2b-256 9195e0ba137257fe4482be01fc12b6facd29cefc083bb284c359194b1023758f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bde40a350fbfa3c6c2a18e587d947459f25da74dbe36d56a1571319524e43a42
MD5 375f8fabe42286d6a3c2c801ccfca0c5
BLAKE2b-256 6d98c4d6ccc4f34afe412bc5ca1167db5959c1b0d4f50e09126a16a99efc9943

See more details on using hashes here.

File details

Details for the file winnerz-1.2.12-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.12-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5a4b0bd24ca57a66186c3b65003b44cb9bc455e251b9505b1b096746c1644801
MD5 29176c121317229aa095e4b7c18736f6
BLAKE2b-256 a04355de8c9cfe51ea32f261bf4a3bee655f3c778eeb8c33940e82d2d6afcf25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page