Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.8.tar.gz (85.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.8-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.8-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.8-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.8-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.8-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.8-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.8-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.8-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.8-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.8-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.8-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.8-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.8-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.8-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.8-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.8-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.8.tar.gz.

File metadata

  • Download URL: winnerz-1.2.8.tar.gz
  • Upload date:
  • Size: 85.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.8.tar.gz
Algorithm Hash digest
SHA256 6ae27c8107e36f6485c926003e4e5c0eff0514d6de3d90b71cced111817ac09c
MD5 849d98034b14b38510e40a61cbfa1694
BLAKE2b-256 3fa348c9d96909c4b036fdfdb90df75025fadee06597d6f75e1acfdf86a137c1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.8-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.8-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c05e80b1c6caa5978a7e5727fc83bbe4da16732ddbd539126826e97bc935e97b
MD5 945583598dfc7fc51db83d8719380f9a
BLAKE2b-256 4c51a96afe5b8c28a9eca7fc1751fff9e7b16aa13a7b57a40e647fc98838fa2c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9664efcb5f844bbdbb70503fd78092f3fca3df8419a011a7238b842f19f4fe16
MD5 2ec3c16dac76890a1e5d5e80067d761d
BLAKE2b-256 79832741bfc924a45c28dd53c3b124682376aa2028e6cda6b9d388a2709476b8

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0f57351c2a15c42b74c19393714cbac341e5b00aeda9cd18f875b4eda5e19ba0
MD5 725e00c8119aace6ea016d979af2946e
BLAKE2b-256 ae6f42f08ba4a613daebe73803218c6d5ce9756379644da12fe2c18870811df6

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8bac74fda037450141c9f3366ad4afbb86340cf4be1fcf01506513c7ef76a2aa
MD5 b14b3c6c1d636d78217afb4dc64fd197
BLAKE2b-256 e0490bdb618c12ce570ce7bcd25418296f84ccbbdbb4068b39581e04468a0fe4

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.8-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 77e096e5066ae6d323c883527a551dde8951f9fce217d95eb84747f7253b709b
MD5 ebffb6adf188657ab0b6ecb0ef893e88
BLAKE2b-256 fd711472c3ba9211739794faaac5ded1b40c6fe61eaec35e7451e7924243e5f3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c225f752304ca50fa8c44ec1394800cec30db9d1239c297837e5e21e53bddba4
MD5 818bfcdf4dea78febed13ef719c05c10
BLAKE2b-256 a2a7ee3786ad904872265ae12abe0b971e6a3661294c8091207ebd651163606e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5207144af539508e282f27fcb3fee1100c259873e52e507c83b5ada824e6bc56
MD5 42cb901b05b70308e8ebd46a69214420
BLAKE2b-256 c453e2f5ab9c22a9cb643b946354ec8b66c98e7f79e230ae51902b6001ec0fb9

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 50babf416e04bd39d60d33c1240e5d6b148624455e9ce4096f4a0a9d161dad7d
MD5 aa7245e90d6a9b87d00ac060dc2dfb10
BLAKE2b-256 f063fb70550ac471b5f15c8ebb81684580900d3f5bee5f7f4857cc0e1731cb40

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.8-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.8-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ac0af8086cf12bfdc2e02a8a941bc73f2f3725adaff5bca596caa667eeccc200
MD5 f8796fb54e72b9a352d04c6769b6ac70
BLAKE2b-256 79d925ff1dcf742e84a4d43781808060baa19dde44339e0509f01784c7ca537c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ad20e5fa2ac06ed70cabee32f1ab53e97fc63abf9ac27595d8be4d79af7086e9
MD5 b2451e581824f1d15c8cf520c1a1b9fb
BLAKE2b-256 3c16b76e5e2600df75dac26efae653c61cb2e097b7b0f2a09106b1c17347f6ff

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3156a93c75cdca07eb7887d57a784202d37a8c162cea2d06cbb3d41b340e28c9
MD5 67cb12c930b5ed4be54bd0b9bf95fedd
BLAKE2b-256 03982fcd4b8930505b0f405b3aebbb2357aba30735d764fd2fa41fb899e10a1e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 baeb82969e25a2421e16c5cd93ba001104108b03f57c9e161583437190be7211
MD5 be1758c936698ce9669358876f1241ce
BLAKE2b-256 021bd4267ccac37ad2968aa3f7d5b3caf2d0c5d8e05d6edf63c1ae3ef38581e1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.8-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2affc671e174c6abfc4bb4e420364fc2129bcb2f7821d0886bb31354aa3a8140
MD5 ef2c6496ffe9f8a19b34c87b9fcf7f91
BLAKE2b-256 1612ebf8b50df180be994ff4ec05fdc9dc4c1662952eae7f3482fea511e94a02

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ca12f6a06922bf04cc6a61784f725f72884509bd6bbb3d91c7f52ae658114d78
MD5 faafdc192b4c3fe5b275d0d849ed8eb6
BLAKE2b-256 214807839d60233bc359f7c25dd4e7ac8fa7015478c24552bcc93a39eb7b168c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca203141adf84049395b38b7601bb487e81171427703e8cb54c893bb772d775d
MD5 6658d768fde6b826633e32e65dd03957
BLAKE2b-256 da230c0de47981621c76dd9437c188e67b4ce7ee84d6ccce00208b7127b9e822

See more details on using hashes here.

File details

Details for the file winnerz-1.2.8-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.8-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9062c19acaece167904b7472bff408962a74cce59c4026073ab3ab43f5d172fc
MD5 624893c8e5fe156076c46421007af413
BLAKE2b-256 2bcf8deb01092ed8ad7e0e60fc4250871b22e79494d8c63a2e7ad12fc5752b6b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page