Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested trên file 185 trang PDF chuẩn:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested trên file PDF bị mã hóa 100% chữ (Ép hệ thống quét Micro-OCR toàn bộ ký tự):

  • 🐢 OCR truyền thống (Tesseract): ~3 - 5 giây / trang
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 giây / trang (Nhanh gấp ~15 lần)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.5.tar.gz (84.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.5-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.5-cp312-cp312-manylinux_2_28_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.5-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.5-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.5-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.5-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.5-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.5-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.5-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.5-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.5-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.5-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.5-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.5-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.5-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.5-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.5.tar.gz.

File metadata

  • Download URL: winnerz-1.2.5.tar.gz
  • Upload date:
  • Size: 84.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.5.tar.gz
Algorithm Hash digest
SHA256 440c550e890462dc6644b83e88acee206c37a1afaab05f5485a4eb36610bf2dc
MD5 6f80b63940deb786fd958b4ff6f939e2
BLAKE2b-256 23fac38bf9966daba68a449f48bd0ac389839c2ddb4c5f5a6a11f49000bdd9b2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0de7074d464f7f5cb846a36ad9ae67c0a9ffa5037774c03d5c22860a6bd54a26
MD5 02881ec466a1573e22b0eff210483d34
BLAKE2b-256 898233d2a44207c4712fa5d769b75590e8791ff515f5ce32a02a04f0c6f00228

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 af93693b6eb17ed2d55bbff71e45eed92db9c462031fb0214a1962d76d014ea0
MD5 6c0a932dfccbeacea30034db405fd442
BLAKE2b-256 b2b0ab15d6812320131432fc9cc9a3982be4f44f0139721fcaa541d36ef5dea9

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e583aebec40020e5bc6433aca2ec6610e060d2e12fae00d224e50442552edfdc
MD5 d5ee874faaf935f6e5a8ff1e2da1d449
BLAKE2b-256 52f447fbe65dff46d0b11184f8c1d6d1e227baab5741f31ee963567bf6f1ea04

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c286494f8b43ef8ba5cee68222a5555dfebb0428935373d261bf64a43d29b84e
MD5 e474d4a61ef84511c52968935a8733e0
BLAKE2b-256 456c47f09c49927d25f31111832b91626f33024ce71d91c25fb9a3c8097d9d2c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 84045d3002cadf9fb8a59a9ee5e73b58bafef42e2b5dd3ab332811213b3bf38c
MD5 a795fb3a8abb6d9908dd8193820df40e
BLAKE2b-256 02ecc72a72bf25075ed8d36a6078724c75cf05f911ea20cf4223a7fb43e0eaa1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 87ca9ed765e6689e4e1416923faa319933d6865aa5a1ad81fada3cf376c7408c
MD5 75eb5c576895ecc70448b93df3e53837
BLAKE2b-256 d519956de55f6fd37138344341f14940fe87ea2cf8dc23575a3c8f2675dc2b94

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 318291e637708afdc9ecd7a72c58957d1721d13467a0c0b7ebcbf40640cc4041
MD5 7ead6d757932897016c1e14f247500c0
BLAKE2b-256 38faba9b712c4626aafb5c202207698ed7d927190c1258cd4da2298628031efd

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 90453f6dc8a89bcd3e40891ef078bb55c2a7164821a7ede19211259f6b95f895
MD5 befa80b0960f6000ae873c40575b236c
BLAKE2b-256 fd74b22c32f1e2bc49147b0ae808ab4490b5a32329001c9e632b14ec1fe87eb1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 234f527c2b3c7ebe52671123173456dd86cdd5fe74c29fec953f6e826ec6cc93
MD5 993b25fe75d8c3796123845f535732fd
BLAKE2b-256 b287ab510169de75d6e43bec0b64d737a82495fcc3fc0be42b2a0191a022cb78

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fdaa3bd3a7c6d201f3c905d3274d3d84df2caa127d81e62fef67aafaa4f0c400
MD5 3841391c3eada9d139efc62f342aed8f
BLAKE2b-256 626c2febcf1c36099582eeec58605c41e975df798054192c0ff919c0ccb7844b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7a520c0132ed457409ca246146ea6bad6336ff7fedc61d37a4dcb97303c7d268
MD5 c3a20a19b153a6aff3c87b44ec090fdb
BLAKE2b-256 3cc1514aa86f653fc3854ac0c88e7dc9d80385789d9b6e11f873fd9fda9752f6

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4820fe0dd44eddad12abbecb7ee4cc6789af677c88bd51192ed4b3e75ed6a201
MD5 5868d03cf2122147fdedee524304d698
BLAKE2b-256 81a939a4c967111bf83fb0b0b89c593a140c567a2964d563946de7353cdee66c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 054a8410de9b8eebadb58b1cb05e1618b9e35c20d6a4110031c25c567a67652c
MD5 69b4b15149e4d10341ae762d41987037
BLAKE2b-256 0d2a2aee261a104fbd7289ad9efe77a31abc494dba2bdff6a25ce2000026df48

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5e19d4d11f7f55829681cbf5a9930091ae7886dc81ea1fb4507714e05dcefb9a
MD5 b0e630104ae8795fe0ef55bd36ec29a6
BLAKE2b-256 ad0b5ffd39cb79853b854d1ca4c10609f9c69bd2b2a6c4fa32ff8f90432d5704

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 929a3586ea10270d077298a8f9f6031ff99bc5bda8fce0c718449358c221f898
MD5 13c843d054cb01e2d18de23e7254ab72
BLAKE2b-256 b782b83fc47ce109e66ff425927e0036ef4ad882c87243205654fae47e9978db

See more details on using hashes here.

File details

Details for the file winnerz-1.2.5-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.5-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1dea84591fc8c8010fc1a3839f4e601f8a96a77a633f1eb878e8193e24ca7632
MD5 76d70814ebff0cfe57778c9ed9f471cc
BLAKE2b-256 6247a316eed52c1dba51107b0f476839e22305f76951177312db8fcfe56b908b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page