Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.11.tar.gz (9.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.11-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.11-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.11-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.11-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.11-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.11-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.11-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.11-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.11-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.11-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.11-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.11-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.11-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.11-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.11-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.11-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.11.tar.gz.

File metadata

  • Download URL: winnerz-1.2.11.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.11.tar.gz
Algorithm Hash digest
SHA256 da2caae4c916e70ff268bc3c85dfa5aff3890693514cec225fa397da65161ac7
MD5 b9ea9ba0ca95d35903e2679f998f3b5a
BLAKE2b-256 9002db7a169d7a12144df8b39213503a6c3ad9d90347f6dc70cd76a48acf28d0

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.11-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.11-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 bed051288735146702b111b0d897d5c3ef13b80435d32771d8d6dbf379c67605
MD5 a635ace0986b37bb87c3b4232eb53ad9
BLAKE2b-256 9e99d7a9b8e24e8efad1c4cd21db1a20a56e22e15b7d2e8fa3f0196447c13c9f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 42d199928a341eba7ae0fbdd2ce1afb9329cd66140a9bd237a478dce8126ac95
MD5 8fdb6a73436af22c605014bd3fdfc99c
BLAKE2b-256 87c2614252fd0007285fae29345381718251bf6c5073018759c39d5015deaa8e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bf3481baa01a9437bdfb48a7a32b00c2e677b412906de0ced16d4a6372cb6b78
MD5 adf8a0552c7a26ab58091c5d66b4d03b
BLAKE2b-256 7be25b031973b51e0e926049f2c24f2213e185f82a70f6fd0067075a4fbb7272

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a655cfa9f9182ba7c1324a03cc6dcc850c202b5869ee7e64e68b3ffdb485e1f5
MD5 ceecbbcf90351841ff36be92f656822c
BLAKE2b-256 b1bb790de767c691a028a4bfffb763f354b8b4ea7603d6b4a89e57cbe0eb3e3b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.11-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.11-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 78743b9274651cebb8b039daa17d4516fc5adb371a130078c1bfa3d26852dec0
MD5 650f929bfc9248f9f29ae8d56ca7aac9
BLAKE2b-256 03304a6fe241e35ea5d9a9fa26afad6567d8b834b0486845c1d97a74bbc4e403

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4d8d55bde132f129c4098b823954b956e03e48cc339d5d00cac2219e337c49ac
MD5 5dadd00906a41a0a383517547fbde3f2
BLAKE2b-256 5821608aa82e301ff2816e5ea7f152f7caf9bec546d72e3b13847b87da6e3f3e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6533c3cbbad20723dd9248574caef7b286a51f1af2fffe8a09c54bb90abb9e98
MD5 9b8c97aee3953ce0d451f2bad237b47e
BLAKE2b-256 6f84e9cf4f8a0888ebb0912d403ab38bb10f14c4c3ad3c5db3a3be94c21e0e9e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 50823c5e31d204cae4bb93e2511a1b2a6c89a0f7a3f5ddf75b3da3b3d9fceb7a
MD5 0893e0fa78fd8c34ef8a4cdcc3befbe8
BLAKE2b-256 c9a7f12ad1b48dfb59a9c85a9000d132fe39357caa97b5d40ec23628448a8765

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.11-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.11-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 71e81b8070f77e670f58186f1574ea2798a135a3b7c82d81df10e83de63089d4
MD5 f1c5d4a3b8202eb54fe0f3efe97ff371
BLAKE2b-256 1a9a95684f98cee98462c8d783620dc9fd2216daefe7b5906f2d46893f25f403

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6f4b124a8af4f64eac00fa613703a9e683367cf481edf1537346c28d55ec0f54
MD5 edd0a4846e5856d98510bc5f8d32e797
BLAKE2b-256 e3c18d89e2bcc2b9d1d2a0592523ff7867dd77ecf048e8da1cf54ddf597321c2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 78d14d5a4cc01550409ebbb88eb7448ea7da6b8bd5915594c0b0a77759ac55dd
MD5 b57e84fe60a68534fe259de647b8f60c
BLAKE2b-256 9277d547be9f72ad506309c77a3937ea7a19c89c4992131fe7ce84dc95b40e9f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 82e3410fd176ed30926eb42c8d3f4ff15e3a5b017665bf9452741fe16e255a33
MD5 67d0bd46697b2a427324292582c3d2d5
BLAKE2b-256 c131196dfc543c4af112cacb088e6df7e667ea6d0ebb0104cf7548e21c7ced3e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.11-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.11-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8fc6644a880ab2df937ace00d39c36b36bdac116a0964956ca292fb94c6a17fa
MD5 582167d553982c5a202bba4a6001e3fc
BLAKE2b-256 7444b5af18bdbb8ba59adfbc324e705d52efb7fa0889c457aae56d6541661517

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e5027c0ddf8cb1cbd8aed3a172c497a9d28511ef18ef8c90fb94212627867ec0
MD5 a20f1433ce3096f582b8ae70a45cd8fa
BLAKE2b-256 4df40b78ab4f5500fe0d4a81cb273a490e0857ca3a6584b83fc34a27e25d5f3f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8a8736943b8fd32c1163557d8deefd0944cd6c37184f49df4165b16882d19f59
MD5 db67f9aff254dd9680d85d12c46ca4e0
BLAKE2b-256 e9f8ffdf403b8a73679d8719ce57884cd8bf3a8414ca2b6623114d6c5360060d

See more details on using hashes here.

File details

Details for the file winnerz-1.2.11-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.11-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4274afa45a0e52e4d4453c7405b8ab56658717466101f94cd3fd7db5eed31603
MD5 cf79926b70f08f7b2099fbd48ca3b6cc
BLAKE2b-256 aed496702512a85fb31c4b3eca5641b7f14c50885c0d18cba29caa7ed7da8df4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page