Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • extract_all_text_concurrent(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages(output_path, page_rects_map): (Native C++) Safely performs parallel Block Redaction across multiple pages and saves the cleaned output directly to a file. This is the recommended and most stable approach.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Similar to the above, but returns the cleaned PDF as bytes. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (extract_all_text_concurrent()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.0.tar.gz (83.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.0-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.0-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.0-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.0-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.0-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.0-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.0-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.0-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.0-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.0-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.0-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.0.tar.gz.

File metadata

  • Download URL: winnerz-1.2.0.tar.gz
  • Upload date:
  • Size: 83.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.0.tar.gz
Algorithm Hash digest
SHA256 9dbd0f93bfee61906e2503723ab840d0651a3432e3985dcc6b8ce13585f60341
MD5 ffcbdcd9eed434cc9041039ddbb71335
BLAKE2b-256 6efeac9b0008483fca538002e04777f1016cf08970f225e175274b82f709efcf

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c0621fb9aabbb67fa8efcd14fbeb1e48566277c081316b1e69c6d0564d135aa5
MD5 517472e53a35fff223694076d2cfa9b6
BLAKE2b-256 b0c8ca26df8f10ad62e789cba7d6bb8ae82ef385004536cb02d759f7a58ca071

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 10684cd1198b9bfaad3fc97bf7d2586713a37cec650ac00ed672c0d5f5edcb1e
MD5 09f0d3c933eb22282cce63161dd8ef93
BLAKE2b-256 a5d18888ba00d7b558e73e0693de92219ec3598e895560dadebf17ebc2c9e7c2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1253eb37e437c3adb32627ea75e614f5864d8f2f7e1a4917393928916c9d3407
MD5 550d7f5b470b1941828d26755f8f7298
BLAKE2b-256 c487c0ad2670acc7c6e6267f33244c010eb99e5d310d1f87ab9d3d7b514177ca

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1c8ce06be7828d57118619412c0727eeee6c4918d5688c9e2bd240a712f64d7b
MD5 43bca47c6b4a92836f876fba66211d87
BLAKE2b-256 5487ba0025c14d39a8a28c43a5ea64011005cfb1cfe5360a6ad7b8c9185a5e4f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 bd82216913eb39f37120a60d2c8b5e8c0ca747fa5c507f454a83534428d362d1
MD5 95b714da67dc90b4893176cfb4e27c78
BLAKE2b-256 7324f44398e1692ffed3cca3c567559510e434b97eac1e864f1a60ed250256ad

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0e52271320998dc374c0d1ac81b479f5cd6de4a942117c79a8383933ec2516d3
MD5 94423309a6356d07480af4b092274a2a
BLAKE2b-256 9396c2478b86436a2828e95f5d633608a2558a3186e851efb79d3591513833e2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4a5b2af14147b40d9b3ce4572ad8ea434616c7ad3a70db4a51e55cba388d5c88
MD5 8e37d728bd703405c489594b4b47c4a8
BLAKE2b-256 e4961916b17ada5fdf8e4d95c9c6c4f27a35ae92cc6ebafa47266c2c1cb98ce7

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f9bfb56a81de1230e724f5227ffafe8cfdfc27534ef4aa662f9fe842aafb1ea2
MD5 1b5f3bbfdd57dbe77318f4a939d27753
BLAKE2b-256 6df5c34db4f17a558653f1efcab1ac19c4cc522f967e3b66f5c48bef52683b41

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fbf8163aaf00dc4475786533b28dd88fc62709c78e3ada78b4fae1112c17cf2b
MD5 da6bd031b37d6964acc2bbd286f68b26
BLAKE2b-256 81835075c3ec3d1a6acd5cc369e66f120ecd47b27c5121223ed149bcc6636afe

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5cee8241b31a89624afd103cc0cf9f5b674ae0b198752bf76f2d13372536b3e5
MD5 f92ce6f286d39e0972618ee27e0c5ea4
BLAKE2b-256 76ec62d47bceb04aced6160d903bea8e35172e3b7e21cece6aaff263caf8cabf

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9b415ac516a3e552be6808f531154c604db8f3d5572e682d1e533ea86f831588
MD5 cfa5c8ef934ea5730ca14d6ab1ab47da
BLAKE2b-256 4ad58a6e92ffd1e2c350684fcd522d846e87b5a986faf3338c2db35477a38c69

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 67368858d9d89ba1359a5338219e4244d5de4ebdf8586aee730604c78f3cf861
MD5 cbeaac3ca88651a0f333a97d44f27dc1
BLAKE2b-256 4b07dd05933bf574a7c4f837f5bfed06db6677890c56d5e3eb037cf773eec62b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0fe07b9e44a3262d2792f46c7f457588a356feb0629c73777149c7d7302a09d7
MD5 b9fc4d9acea5d00e7d7a96fac544b30e
BLAKE2b-256 6489a8f7cb8c6c8abf9e0927261197c2f12068028fad1c614e3538a85911c622

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a03f78677bc1b7f13e7e3b3b718c6f22e9fe086f44688c48fb13db15bcdb88af
MD5 f07273749c48c19969a94cb2ad527819
BLAKE2b-256 53ee7855c87da7e000bb059c4d9183fbd88baaed644e05c5e85a05ed5f29ccdf

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8df6ae3db434e1adf6e3e38849c1df662d5d32997a4091ee885dec405ac04328
MD5 5fff0921d3d1e78b300f63f086378713
BLAKE2b-256 99f38669f14d834e8b54c5ee65a78cc347e0036f4b6d71d6309357c07ff63a80

See more details on using hashes here.

File details

Details for the file winnerz-1.2.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e5929774d0460149e36ce5d2a7c10c222a0aa9f2c389e26a99cc2d60cbc365cd
MD5 83195ba674918add99b5e11397fcd433
BLAKE2b-256 7f4c47df3eb6ad118f5657d167562ce1cf5d1801e5e7d01e310578023ac80651

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page