Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • extract_all_text_concurrent(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages(output_path, page_rects_map): (Native C++) Safely performs parallel Block Redaction across multiple pages and saves the cleaned output directly to a file. This is the recommended and most stable approach.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Similar to the above, but returns the cleaned PDF as bytes. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (extract_all_text_concurrent()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.2.tar.gz (83.3 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.2-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.2-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.2-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.2-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.2-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.2-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.2-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.2-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.2-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.2-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.2-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.2-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.2-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.2-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.2.tar.gz.

File metadata

  • Download URL: winnerz-1.2.2.tar.gz
  • Upload date:
  • Size: 83.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.2.tar.gz
Algorithm Hash digest
SHA256 1d83dc90d5b2f0266f1bfd52ab854a24f179049586b650a843ef25b6561d7bad
MD5 acefd106eba08c4892f9f3f96f57d9a7
BLAKE2b-256 27d9153b8d77e5fb2d27e62453879345a1acce67e98ba64e8d2cd84ddfedaedf

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4cef952c2f774dac8fe0bc4e6c0bf5fdff02e9617fcceb24eefe3fe3f220653f
MD5 22bc28dcaac6977c1bff434c838cff00
BLAKE2b-256 942ed3373f17fbfc9712513c94dfce272e3917e7513d7729fb386675ea1841fd

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2e785297f97ed3bf8740d709754955d2586979364311aaa5f435ff5d4452281c
MD5 41570f994126f9ac8a3f4855e8595687
BLAKE2b-256 8d5d03def41bf56863497ff33b14d0c196adef3147daa67f7e245d493029493f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7fc0399f0b39d3970d2e2720ccf24089e32c0d55b017c0281aa772bb7bd8a041
MD5 86814536175d3638ca7874544a943939
BLAKE2b-256 daf67cfd25eeb7d2251a383d339fddd1958ba15a43f72c75e1014432ae38bd79

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d22b6b5c819d6bd1eb96a84e778bb84460275725704e62e3de12c5a51cd58ea7
MD5 46b4336b15677656d8aedeb30f52d665
BLAKE2b-256 e0a953f92713492013262555e9687a21bc9db9d4accb40c1174bad0ff100be9b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3a7ea05bfd1e65761fef3a742c6a1c318d4c77b014b350ada3b8382c3b870466
MD5 e0cf300f6c75e4d0b757a018fbc42475
BLAKE2b-256 c6a501e4c3af67869aed9893fe45c05ff50a63e64a678303cf33adc8366b6cfa

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e68c8433e7974dd5704cb5b920d3194e4c28ea57a7611f823476520001accd85
MD5 0c022bf4809d29db9b361ea989764a81
BLAKE2b-256 ff9e53cf98b24d00b4e6cc655d739c52e952e2a187f8c807c43115f6c39840dd

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a72a6512b466bcb40a3c124c1b551adf7e59d114fbdea48f20a93129997e2dbf
MD5 3bf0142e9f3a70ae884031f624b11146
BLAKE2b-256 9eeb145e2799ecf44f4acdf7ca6514a36442ff60e31e42e81dd1e8b35cdd1d0b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ccc950a405d9ff9e6cb2e2e4dd221010550396f50fa5b46578702723b17939ed
MD5 b412584dc08f6b441829720481507f54
BLAKE2b-256 7725ce7ff4b67a22dbad834d2b42fff4b5a6f81a0e58f3df3bd245f38c3c3e94

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fc1a8d83b6a7ff8beace7de31416745a657bd92586bbddfff32021a464a07557
MD5 7910590144f633394efaf0f5ce9a1ead
BLAKE2b-256 34695ee2e72ca15515d5b8f0298f4b1fa07d5a85876e42bebd7d39b79f63b22a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e1e1bd33a60ecc859fcffb4c0fdb6e91d09ada3f8a2dd86a13231304df37d4f8
MD5 3584b40077d25c8a64ca621838c25ab5
BLAKE2b-256 e55a063ff1c9954961c4d126544a32606de6ad93af6cccebcd5189d5b878bf74

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b8df1ac312bf791371e9b0b56f6a0fd0eec4f951618d850185ddbbf5982b968f
MD5 a599630d7ca759733b3b2101d53ae65f
BLAKE2b-256 4392deef4b5baebbdb0494c03fab44cc4d72411b4a37cf81028c7588907eaa87

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1ea5df00c8a751e38dae8681ca9f2a1f3e76cc8a252cd39e30794b70194bc647
MD5 3703393179fe7ceba9bae926d840fbe7
BLAKE2b-256 1be95f04c1ac3967a2c3792c88c18a225758c84cdd99ccf98513a9ef8fc8c2bd

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 5e5c23fd5e61144d176c920bce77a137a6c0d105d9c850fd85c9cc9c71ff22f7
MD5 6e7edac137a44a3ad19b40ee171f47da
BLAKE2b-256 7acdbeec5982cb48bd5d201a2222385e7887e727a735acc9694adb3afc7d0072

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0bc46ef3034923a20ea11c94a66ba78faf5273e430afd7bd605eeb941f6d7286
MD5 0c1c6975e21d61ff16cb620d53d69fd2
BLAKE2b-256 a70e41849d8285933dfae654371ba4cdcfb70e7ae36fb5e9372a5c4c05c0d6f8

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 349f54bbd6f4b8bac5e39ed4d60e225a95aabbf82e80ebbf4ee270645fec2d71
MD5 8738fac962776ccc7b73c6353fc1fe0c
BLAKE2b-256 ba89f9438c7b0d63631dc611aed9ac8808b72f7a074dc1e9dbcd196b01067e7b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4a4b9945ee71d1ce363c59dc5b94ff99950333fadee3bec0d51c1ce5513562ba
MD5 7b4ef1f2738be8140d9e9bf6090d207e
BLAKE2b-256 8763036b3be77aabfe1fe947cd1d703cf1a134abddc3284c14e2a4fdb8a30159

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page