Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • extract_all_text_concurrent(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages(output_path, page_rects_map): (Native C++) Safely performs parallel Block Redaction across multiple pages and saves the cleaned output directly to a file. This is the recommended and most stable approach.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Similar to the above, but returns the cleaned PDF as bytes. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (extract_all_text_concurrent()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.9.tar.gz (83.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.9-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.9-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.9-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.9-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.9-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.9-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.9-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.9-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.9-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.9-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.9-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.9-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.9-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.9-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.9-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.9-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.9.tar.gz.

File metadata

  • Download URL: winnerz-1.1.9.tar.gz
  • Upload date:
  • Size: 83.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.9.tar.gz
Algorithm Hash digest
SHA256 b5b1cc416cd1e2bdaec095809ede353ab56c55e7c8296e8c5f64e3efcc409ea1
MD5 1d9b29cb88e1417725bd3483092025f1
BLAKE2b-256 9cebeb854ce8b2caf1ebf4462fbfd07c5cf30b1975c02d73bc4551f69c6c016e

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.9-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.9-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f61d8acfc31ebbf91099d6e91bc0afb98a3de3e0c60234efe09659568eaeaa92
MD5 1308f44d76caa25241329943cddb597a
BLAKE2b-256 f04afce4047ed690853d1281a12eb93ac42fc88f775f6d1b0509f3f35c959d13

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d52f1975f299277496642d4ff291455d952231f528fe53b8ea15ea88fd515c26
MD5 188b721a0632d32524adacbff9004fea
BLAKE2b-256 2091d39caa87e7f01282ec320313d6b08bd69aeb6bb821d9922513923a50da92

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d5599a6e2680273de44836b090cc33be566c98829b468555428407ff1cce0230
MD5 893ce95e9641c272504175ba44a22700
BLAKE2b-256 5b472e6ceb454f8658d642fe78b3ffa84103a77c30f81c9d8bf9b3ecfa4f0bde

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e3415787c1136845c12a4309125d8b1adc98a72218e4d41d6ea4d48e5fb65be6
MD5 188619caa40d515be629564ed653fde2
BLAKE2b-256 72a4346b19a5f507d232df01efc3a6d0bb3acbd7af4fe98ee2bcc7d8e19e05ba

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.9-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.9-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e0426a3026f34a1f82e3496e69ba227333e9e1f1e5cfe608d6c404e2a337e4f5
MD5 ee6ab00ff06c2263ee5bd8389570f246
BLAKE2b-256 4a2d8c7badf5559683f4ca95bdf6b0b5fa952a06f13a276b32a4e6c5be76052a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8a279f46e3de5eab6f424150e696582acd7ae4e7adaccfda52b18e2c6da306fb
MD5 8f23e90ca1b28ff9b5f31cc3cc23e8e0
BLAKE2b-256 8cfe8cbe56c6e76528f43fe041e72fac842718af9798e919f8876f08d383310f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cf807fa3fc836a4b239fdce64204763df076b2598c110bd0d1fa37d39f46725b
MD5 aaac6c9886de307e4d1c2ec957849933
BLAKE2b-256 ee5e543bc87161651ec31fd38d4f2dea156e5cd759a9797cdd12e9a6f8ee716a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 eac61e7a42390acf0e4ef5c0944999f880f7fadc526481b7fa2ed758570126e6
MD5 fc86ee17a750ced204b334bd071140c5
BLAKE2b-256 5f39c38da65ac32ef6f6c4ab3b9848fb8d75b70223040b3e82ece92c45e91fe1

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.9-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.9-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8e501aa0ad1c9152d9d5e91c8a2ac2773c58f19e7ac56f2e8438ae156340ae2c
MD5 f5aa597c7d8310593e39c28972ef02a6
BLAKE2b-256 d2e0c2257096a1634103bc766b87a4d40e5abe1ebd0cb9b282e414f8a03fe56d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d0c7cb77d260b5a6585cecb866bc429d03f8af40bfa2c132ae9d5530fc653b01
MD5 bccc8f27b601da13dec977664586cde4
BLAKE2b-256 7f164b4a09798f82f4df9a4a71a5b5c648e4dde7e62a7ee163b1649035b699d9

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1d5e2afab8a953a3e84274998b76258522c78741a28477b0e71ac93fe6e2cc50
MD5 b0b765f2b9ab0e2dee38b72c2ef4411b
BLAKE2b-256 ca5001f56d3ddd4333236dbda623c55c74e8d2382bc5e2331cd8769894d4649b

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 65b615f5add8132711dec6c9b722b5b843f0842b6ea9cc9794e560830b847a27
MD5 e78a9c8ab141a5476f692896c546ddd0
BLAKE2b-256 75b2e302ef3a0aa7b1d2fe9f07dde195ad32d474431499ee7623a141bb7187e6

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.9-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.9-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 4f790bff18c92bde701037c978c33597cf193c1c7c821aa15ee14f86026924c5
MD5 332ceb48c5f5dea4f4290758fb0c010f
BLAKE2b-256 0d660233b93d44142d90332312135f84f8c5f2b52f0a67e051c7fbaa9f1d891c

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b926a1f04087fc52dfc8d8420c78ba1d614feff65b240148d9c0e8d1443b94b5
MD5 5bf7b0e786ab1827e6f14a43d9afae0a
BLAKE2b-256 59d080c83a70c11c65a72c29bb4977879a3bf6473f70705ddf04e96449f56fed

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6e2bf04fd20fdf56bd2ad6230fb81eaf983e0e878e397de47526385bb613312a
MD5 61e2d4eed9798d8d741fdd954cc0b820
BLAKE2b-256 14d1ea5598db6f85d4a0b9cd8a6ea61310f11364cfb9e355f054993e3332f9ef

See more details on using hashes here.

File details

Details for the file winnerz-1.1.9-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.9-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0ee9270b9d713af4d77066b97218fe307bb319dedd5b1ca91225e78bfd218f39
MD5 47478a95555d4e4a61b2059c9a8c0100
BLAKE2b-256 f1c91f924613f002b1f5c49347e7bdd4632356cbcb38ed7591b8f4290d17336f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page