Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested trên file 185 trang PDF chuẩn:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested trên file PDF bị mã hóa 100% chữ (Ép hệ thống quét Micro-OCR toàn bộ ký tự):

  • 🐢 OCR truyền thống (Tesseract): ~3 - 5 giây / trang
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 giây / trang (Nhanh gấp ~15 lần)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.6.tar.gz (84.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.6-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.6-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.6-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.6-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.6-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.6-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.6-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.6-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.6-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.6-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.6-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.6-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.6-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.6-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.6-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.6-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.6.tar.gz.

File metadata

  • Download URL: winnerz-1.2.6.tar.gz
  • Upload date:
  • Size: 84.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.6.tar.gz
Algorithm Hash digest
SHA256 18259b44500526725ef5394ea5ab3fd720d831052b38c127e27d7d23f4cda7b4
MD5 299b0e92698b749be76d3b8e70b8034e
BLAKE2b-256 63347cd5baa0d599733c3c5c682d5a70a58ebae46a2632684748a38220412a71

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.6-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a9ec9233033f787485a619393684f69a1119dc75e648ef186e57c0c0738f19f0
MD5 2560766f0901e3d9bd9ac70cf40f660b
BLAKE2b-256 038f64e5f0902da65c7557f4a860bf32d7cba4e1682c02e5ec5ef5037a6c24e2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 41526922da40d9541f2995ef44ed661b159b19c55591cb8d430b728cf16e05bb
MD5 7202eeb9edcf66c97dbcbd95e2d3aeb7
BLAKE2b-256 8e75b81b19f091d0092a0f9eb34235e785e03e50a07507efaebb3326bdb58bc7

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1b1d1f8d8c336a338f97b44ef12921e52001e735eb37348ab9581ca17d29a210
MD5 3c7c5d2d60044ac10e100fdcbc6ecf8a
BLAKE2b-256 af66c5e4460a5d171e8825f4f5dcf6b5dee9393c6587bfa162ec8cac5be8d074

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 20b9052e712c83f11dfe6bde560c4e0b087bedf439e401fd72e18dde9dd0b0a7
MD5 3627bf113f9e1f5e83b57ee992b2f448
BLAKE2b-256 7498dda6b44fbb8f708e1c38a7c5be07d4c0a4a7fc5db31ebb7aad0a9c9b3e91

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 2a4ef75eaf43ab85766cc62fc22084dba9a96aebdebb11d012561e2cfeb0d01a
MD5 405edd67130fac94d895097e3cbfe05a
BLAKE2b-256 cd33e7df9ef594ec96f4c2ff9c38a7875a011720d48c3f72bf4c619ced792af2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f5b050409846415394453d74ee9d5b633501da3892b230b1e898574cf5c29034
MD5 db461a2bdb7370d44e44e70d7500f009
BLAKE2b-256 f19d28155c9628744fdd54a74ae8d50c933a867f0686b30df3954951de94457a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 57f290bc38c1a408c3edcacc9eb37e6509a3e2620c9ae5e59725f8221fa999ba
MD5 b96464b7b64189f841b32d8d8fee5a6d
BLAKE2b-256 7d75a7eed9b237a34d58b5976ddd3480616a58fe57397a7f0555a59c7df43ed4

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 987998ab6c89e3dc72914df3614f23f96c2f92a7e6a928c4366a2ef52b253bf5
MD5 356297dbb3009405f2635619993f5bfd
BLAKE2b-256 405b893cd1969f44274aac7a32b2a266cca56137c75559454e2b67f1b8011708

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bcb3662b01bf20b80c4c6d993a2c6476b5da61e36112bbeee0fc14dd1b5f6194
MD5 c7bbbd8649afa013d140e5e73d32fa0e
BLAKE2b-256 9bd8ea3e1f0550d38c48f208ef0ea33278598ec67dbb61e943d566f9ac4af6a7

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8bb39754572e6ee1df0caf29b6456a5def552ab9a28b0b12831351e28212f1c9
MD5 d3280ec352ec6b7628d10a00d3f34f8b
BLAKE2b-256 5f19e6cd897fbbe2934357ea03cc3d7e7cfd1d41556c53701630c346ef95868e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a498efa166bcb46e1eb3aac1a0b0d7777926a994d23476c43bd252b4de23c627
MD5 c6d91fd5ae6eb38c066dea3261dad97e
BLAKE2b-256 2bd2c5ca96a8469a017bfbf0bb93367c386db208c87decb50aed17fa90c4336a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c852a18996795f461dd677f748d466351835818e2c9bed2c83e208b92bf8d225
MD5 19b8bbed6a09406b3921243ca56468c0
BLAKE2b-256 867d95a3846e09954d634b94aea55c4383fce669792778159d7e19614bcb7c86

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0c05e30355bb169b8927e6a246a910b2c06a3ea91cf3b5142b8b1e86488c20a5
MD5 4f68a9435f30a374a4de0350749b95b1
BLAKE2b-256 8ed38a1b3cb281664099066bf99e1d3148d1c194f9311a7ee0e596b28d92e0f3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 90649934374fb77690898c7fe87312738ffb47e71b1d9439fd5ae832ee6c5d57
MD5 a82d3c64d1abda260bc21ffb496020d8
BLAKE2b-256 f46a7501298a6492f026d220320de6594a118c4a3fa1d6bc677684fad363bfc4

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d35dd34c1fc3a2b05bc1bea7c132f2c1b63a7217f06cca67129dc75425d9fedd
MD5 5f3c12f8f4918e1102f597ccb70d2b40
BLAKE2b-256 52be275710918bfda14da569d149ebf09c950c276a377c9fe5e8aca32a4f907e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.6-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.6-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 de289dc1f2e27d1ddb93d939e2f296db9680f8d7f9429a576e876c23f435d9c9
MD5 e816b2ef87f68342808f9e96065c9ba7
BLAKE2b-256 8ea288e1fc3a1cc741653e4b0ad6a83d2ab60c5b890659d6119577cdaba96507

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page