Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.7.tar.gz (85.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.7-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.7-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.7-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.7-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.7-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.7-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.7-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.7-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.7-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.7-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.7-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.7-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.7-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.7-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.7-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.7-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.7.tar.gz.

File metadata

  • Download URL: winnerz-1.2.7.tar.gz
  • Upload date:
  • Size: 85.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.7.tar.gz
Algorithm Hash digest
SHA256 1ec6a4e4ae79b854a287a5aedb92f50f2f96b9fa9596b9013a95fbc6ea30c3df
MD5 dcb3aa3a94a01f1e5e92fbfdc5de4acd
BLAKE2b-256 812b8a46cf807ef83676d5bac3f8beedf07244fc558bf2e31a6b0efe8a503bba

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.7-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 7a76576238e6d74e70f1360520343b9b5151f0e65a7db6a87cb7e58cd5c3d4b4
MD5 295f3c6a781b790bcbf139e0daaa0011
BLAKE2b-256 3f05323a77e275c960aff95c9c780ad081d2caed2ab243b5de18c40d197ff673

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 079e4014e9a1f94d06955c25e01ac7f110ceb8d954d07ddb4395a052770af647
MD5 64e3aaf48206cfa4daa7d6550be933e9
BLAKE2b-256 9f515d81641451a0689ac81c9843b1125f9dbd2654dc30a07bd87c56c818987b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 397ec0623634afa2647255d0b82eef33c0f49c6ad9429eb23acdd57049a89c3c
MD5 951abb0739a7f3d59245983740fefe16
BLAKE2b-256 11ab270c6f54b21c01fb92546dd490f40822a58b67933b9e40a125e5fca6f265

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 df57d0cdf84177a850956b742e763ebc5a4c371719978c7bc3dbed9ae7650b23
MD5 a6d5ab3c2691345cc0349aba078bfae8
BLAKE2b-256 c0a22a0080438d7cac414b7489b64717717d875162aa13dce6be23f40da1f461

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.7-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 abad0ad7c877e2bb6d7c5ee820bae4b7a5721b92b0e9a42f0301b7e42aadbfd0
MD5 3cbcd597d52f76f98e614ace916738a0
BLAKE2b-256 259c85785043d8159c294013cd6576e107ba8e58c16d3f57dad5342f88713e93

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9f2f19952aebe4c613f1b32c0183e3f8cd740699c0e3c77e4feaf26adcb3ede0
MD5 2d2e9618acc93dbfec035a7f43d9d8e9
BLAKE2b-256 bffd7d0506f270e7785c9f89fc4279c66dd255c8d2e95edfd79ec4278b659d22

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a0280eb5f5d21535ec6457133090d1d91949d2c9b9d5bc72acda87ad8c99472b
MD5 9059c956f0b8e99c53feeac358fbf505
BLAKE2b-256 4d0cfa16c538d453f47d3586eafd9b348dea191ad92b0858948af9f7222b67ac

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f63836ab141a6d59e5e1f8601006d0902884d46376483b9b9041cba193c5b8af
MD5 2f5af239af4dd3258c9b397ff7ecb5eb
BLAKE2b-256 838cc872564219d5c18c06dcbd70ffe513795955e55c9f858c836ffe2ad03e78

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.7-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8da2564933535bacbbb2528c0b949e853cf46edc2297627856b58d375039b6ec
MD5 f4a47df2340608299c33c73629289385
BLAKE2b-256 55ebb1b27dac4ddec1f70146503e1251f4e960127ed47ee1554cf2ff98968997

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8f5e7f1c1f8125c500f00b276b0924c44036fcdc6a3ab482784933e841fae25f
MD5 46615b8cb3552a5b80061d5a02061134
BLAKE2b-256 15d015ea0540cf47deebce340340c1e104b7588021e8c3a029b20b4fe0032078

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 efe8ffe33375e89ff39810824b0dcbc30880bd6640206f9825ef3e2565fb4b22
MD5 b2d51ddd3f4294e21f037282067a97aa
BLAKE2b-256 de0cf9b044de625bfd5b42d403c3c40dc3711a54e44ed26da8307ea709a3c933

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e303cd75be358c8896b28d8dc6fbc9b25eee6b6024e39d54c3c51d956a7696a8
MD5 11375084074cc0527d03965f07384b7f
BLAKE2b-256 520f5a0d9326648470ae6815e4418010e502d67e34f414910824911fb9e0ddc6

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.7-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.7-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e1a7709e7c42caad9b420572120a257ec6a795eb7db83a46984aba2e9aed11bf
MD5 90deeb3027baa0dbe7377aabab8c490c
BLAKE2b-256 7d512972a8aee4b59d5d73e40fccfa6f15648a5f33c006c48eb37f8fdd0d2db6

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 00ea01dc460f82299dc1013bbc2da027608e4935d751ff023bcb593e4c878cb2
MD5 d4683ab8196ac219403cd594475c301e
BLAKE2b-256 50809573e9008236c774e2b423c8d1e6471e4bbee708829c0797283269eb7cc8

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8a5ea434f5e0663ddfb1e9795f298c1aac2fd894fb246c6b0c9617f94a0b0d9c
MD5 c63dad00b36f2d40f757e788cebb1d83
BLAKE2b-256 ace2680a779bb599fb20b345bf88c2f078bb8c42b93fc3144bee817b93ee2f08

See more details on using hashes here.

File details

Details for the file winnerz-1.2.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 616d4992794930d0b6b3650a30282cf98ce78dfece6272a7d1f1b3ade8466708
MD5 6eadac783c433dbf2000b29f672ee6ec
BLAKE2b-256 cf8b12933ed8d01b09f7de1c8498d33582274e5c182c3169e645be4501f7d2db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page