Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_pages_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a standard 185-page PDF file:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested on a 100% text-obfuscated PDF file (Forcing the system to Micro-OCR all characters):

  • 🐢 Traditional OCR (Tesseract): ~3 - 5 seconds / page
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 seconds / page (~15x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.13.tar.gz (9.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.13-cp312-cp312-win_amd64.whl (1.3 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.13-cp312-cp312-manylinux_2_28_x86_64.whl (7.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.13-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.13-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.13-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.13-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.13-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.13-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.13-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.13-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.13-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.13-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.13-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.13-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.13-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.13-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.13.tar.gz.

File metadata

  • Download URL: winnerz-1.2.13.tar.gz
  • Upload date:
  • Size: 9.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.13.tar.gz
Algorithm Hash digest
SHA256 468c36ec18b9cb31fbc26af57404df4aad61d96758a1add852e3fbfb6e1daca1
MD5 a9e371f734d3ed0b837ab3fef4bd4440
BLAKE2b-256 d2d01addd5b120ce48c9330105e14e08f5bfd7af4332ebd1f654e7d922a35e7b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.13-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.13-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6b29b969d66d0a769f811a09584bc60b855749ebc7ad0be709950b556ec50717
MD5 729ee17adf2eb8a720c58214a47c48ed
BLAKE2b-256 316f1255e1c94af1f05cde6e13dfae6082721e74fb4556015139c61ba7f0722c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 be8a1b3a501c455a3b7a58c87c21ce5ebb930b9b565c87be8a987030d516b566
MD5 e34c50264f6c170cfd201407e9b3344f
BLAKE2b-256 63acfc02ab12ed69cff48ca1efd7a35e7f7f6105e7cfcb67f2e95a5a8cdbefc2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3461b343c9378692cefc2840537645c2bfe16ce02e54c2506e48cf1515062727
MD5 7196bf20cb2bb63ba42bdc51e72cbce7
BLAKE2b-256 6b94ed1b8d6de74f96010d5eeaa448df033a23f587da3dc1b906ede1d666236b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 17104b649320390fa5303ad4e392557269e9b0787c4b98e4286c5796209ca566
MD5 e1f937950f04c5d9712b72115fa69ba7
BLAKE2b-256 4825e61f9062da51cee7250e8e0f28b1c535fe99056f11ee6a607b5d12988b5e

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.13-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.13-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 063a1cd8e22a7b7809e17f1f592e760e9e045c80643dbca6606a999ec2da8e7a
MD5 e2aee31f820ca0f5d97113d622da035c
BLAKE2b-256 48ca1b652f76fb251095f9890ae44e75034e60b224f4c4e4a73a9ed30db2f5d6

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 81a02baf18b6694a126eae92117a6a5e27d6d9e7c5aa2e523bcc97cb274f4aa2
MD5 1903dddc4a5dc92f4f04b47bb245267d
BLAKE2b-256 be2deb973ea153d6cbea1e17e186fda73b4683c4d336fcac4e1fb308d2111ada

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4007b5c7a4be0842ec546e02420ad8e06e1a31718b71190f7bbdac0c2af7ee5f
MD5 4eeed6caef08df1d4156d1f3f2f70472
BLAKE2b-256 55e1b32ee4c84989fcd3127c8dc49aff42927f8a1fd8e191e63100b7131c5574

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 73ce74893266a9548e9d25dff2edfc3329b81057beaf81da6f0d5b79fbd74e2d
MD5 f7788e54813ed9b07ef00634b0e3facf
BLAKE2b-256 dedc82dd9993a557200aea39081c29dbbedca318a39be741b70573a6dbf608d0

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.13-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.13-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a321fd99b6a9c878a89d72859df1720246c9514d785e71308d7bfb0a425d1d34
MD5 bd96a01b67007ad954a5f16a7105a093
BLAKE2b-256 69d6b098ab3e2faf1eab44be4290c740f516611f41e1b7f7ac4c0019ecc96f9a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 76083bc8667c2e5383f34fb71720cfea28e3eafbb2c3f5e2060f4992c486d533
MD5 2d8207c2d2b2bd119e809c02eb029f3f
BLAKE2b-256 6353a94b96fb73f5e9533233ec979483904d81c3c7c996a30a3e3112442b5b6c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 77311b0f5a6c1bc39a1830db4b32f9aa13ff2de63209d971c8e61d01640c7e74
MD5 2af9007264301ceb0a34e1fac654b389
BLAKE2b-256 10669c5829257ab391ae89ce3006c7eb79ed3bb0cbfb087992ecd2d6626697ac

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 65820599a53b3e131ab217b5b9ad6fc7d728886e35ffa11ede1f2b2021a2d678
MD5 3c2383cb9706af2d7317f6fe7aa59334
BLAKE2b-256 0ed8c8ea668c78f8709179c16c5a3b52111a710c795d595b3dc5086e3c4c3d24

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.13-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.13-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 7e5003d5098b4dc35e110f5519dc0e298e85dbf41140f32e4b856a49e1b207fc
MD5 f3b9a038cbf66776dc200a81016cb00d
BLAKE2b-256 0daeaddc4266f774b8d989f2e45e2617aa6ac084c1d28af0106f99fffa771c3d

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6cf1367a1c6f5a180e1617c73d96f195eb1658dc9964ff545f797d469e47cb49
MD5 825ce1d7594b509b6aa62108dfcaebdc
BLAKE2b-256 8d3092ecb82c745f80e48913e3e7272155def747cef7a6c32549d505a34b8f06

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 750093eeaf27701430297bd3b27c142500f476cc4dc1deaeba4b86bf7852da89
MD5 ad7a0fad78686a586874088788124188
BLAKE2b-256 41a702acd19352ae7c025405f4fc4d58dbac98686fcb725e9894d3d817c7d710

See more details on using hashes here.

File details

Details for the file winnerz-1.2.13-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.13-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3ba599d5daa1649f2f06b1f7b8caf5f90a97757f4e18ca9454f79797fca38fb8
MD5 aef511d8582012682278a4e484853de2
BLAKE2b-256 89d60d7bd80642681eddc30ce395321d1c528d230ad7ce04f9a9cf8c456aa07f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page