Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.9-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.9-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.9-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.9-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.9-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.9-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.9-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.9-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.9-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.9-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.9-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.9-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.9-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.9-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.9-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.9-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.9-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.9-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.9-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 02f7fd322ec208cdbad4df08d75b87f0977b915b80f43d605265da9f4b912c6f
MD5 68699de9d50c38a035864f92edbfc758
BLAKE2b-256 dabdcbf64ce9cbd80218b57c88fecb9b54b73a0480d27188a2a96a303656f264

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b74bd0113a5241efc38a160054e23c3e5aede3a1a7087287ccf86ab1ad48f37b
MD5 ce057b0f73c142ff750f89637d0f273c
BLAKE2b-256 da0410f7eb61bedf3bd5f37c4dca7e715e128f3b504d5ea8940727bb7889479d

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ffb0f53614ddeac30c86387e629ec4d7ede30574c2a706284e77b0b62ce84f8
MD5 f8a9c3c00ec0ecb07ca6d37061fd80e7
BLAKE2b-256 e628df55df179336eb196616548700fb08d71cab5eb508d91b877b8d125d95f4

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c940df2a45c772f92334174bcf11cb69a65ab151cc61583f4ac3de22b22ac10c
MD5 a5a9c5a4232d5636f88ce1c426d8584f
BLAKE2b-256 cb88e0738cbb6e496f3bd39382d90d0cf245d978d7f10c45c8cfe0dcd71334bf

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.9-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.9-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1e9f7eac2aa8617bf91109eff6c29aca7e75b072b46ca465c5d1e11d5bd41a32
MD5 f706fed016f75aaef0a8e6666a2eb594
BLAKE2b-256 d6c2f4c5c38a611098354d28e0b9b19591dd311c74e64d42e35a85b3937b0787

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4adfeb83b12463bc4ac025d82a451e6f9746efbe3c7ed4e728ef7ee56f5eea29
MD5 3a473888def4830be24ce70d36e58d63
BLAKE2b-256 340c4bf490acfe41c70cb0a5c659531165dba0af48172db5661b97d62cbd1b1a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c0bdb3a28a31e402e79c63b3f688c2743fda55bac4924bf017c2cb08fc7921e
MD5 8f22e677610e844c112d683af5c705d5
BLAKE2b-256 9d4683546ec7467b20a83185fbdfdb9f67e4c43402f5729bb91c09ac01ac1c4a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 45736101b357e1791cc53ff643f4e0f79b9adc8f0a05c5c81a5b72da93198044
MD5 23099fdddbbe4d6ed013033d2d35bbe2
BLAKE2b-256 c93fa3148e4eaffe23f7c0b5bd3f52b7d71d19b9ad59c3049d55be25c68ae01c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.9-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.9-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 583072f478dc7aa8d08f988ab218beae7ff896f92d05464f31d6115293cff646
MD5 78fe13928d2bbeb5913ac9c6d94c2318
BLAKE2b-256 c936ed3ddf6c277520aac13fb926118cf6bca51665f37f92df3d814e3d14c3b1

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 02f901fc8522c19c5e01935aa283865fd4155bfb91a09be48ccabf239c060019
MD5 47bd4375cf754d2a05d4fb310918b38d
BLAKE2b-256 b43e164ebf04b5a9fab352924133eb71eeea2b5b61385cf47310f4be2d481c15

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 375d5d570333eb02570967c577020de82ad9f6e7ca9452a5cbbc7c45947561e3
MD5 5ebc641b0af4c25d4008d88d467e538c
BLAKE2b-256 9675d22a70d26e6e0fd4badfbd047de35cd0a2029f5c84309b743c184c9e0e0c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 99332457f45667cb19f8dd190b25a7df968913a4013504a2fce51c195f0baf87
MD5 04d6d2537dc0304c38e716f573241550
BLAKE2b-256 4ef4fe32dc1d41a3d7da7b2c8e4b70816a2d6dd7ca6f1127689a6765f34f8ee3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.9-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.9-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b34c842239cf32741dcde83e90f94153586a18f5f1b123f190ddbabf0f056e20
MD5 f56b427ff196ca4ddb03490e241bf369
BLAKE2b-256 102b23ffe9f433a69f87ca0a3d279481ed26051784d6098516c6889928dd6d3a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 05bbcabd9f30843ffb9e31fb1fd7f0778cfce968aceaefe4a7e1b520d30af4ec
MD5 e9e01de462112743c5bb76c608173fbd
BLAKE2b-256 f303ed051262ab2597c0a97f80826244df04860902443bb6bd1ea45172b4d0c3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5b4a4c05320c82771195bf7306a1d955ff51300a96a7de8954735eaef1b3634b
MD5 35735eb1fc61bbfc3b053e959cd7addf
BLAKE2b-256 7023eff536530237408731dfaaf9bed4b521c191271b6f6fc1989cd7410d88f0

See more details on using hashes here.

File details

Details for the file winnerz-1.2.9-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.9-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 94f25a077c5983e564a243b2e7796e9f2e60a028d570789e28fddffa88d42f4c
MD5 3763048011bddea5a53e75f0326920ee
BLAKE2b-256 a14e90a8c9b3bb92c033acd697f7aa023c9b7b1c8be09d785d89f41cb0237a7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page