Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • extract_all_text_concurrent(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages(output_path, page_rects_map): (Native C++) Safely performs parallel Block Redaction across multiple pages and saves the cleaned output directly to a file. This is the recommended and most stable approach.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Similar to the above, but returns the cleaned PDF as bytes. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (extract_all_text_concurrent()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.1.tar.gz (85.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.1-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.1-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.1-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.1-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.1-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.1-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.1-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.1-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.1-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.1-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.1-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.1-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.1-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.1-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.1-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.1-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.1.tar.gz.

File metadata

  • Download URL: winnerz-1.2.1.tar.gz
  • Upload date:
  • Size: 85.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.1.tar.gz
Algorithm Hash digest
SHA256 3ddf600b748ea112fc5d7d8bcf68fa382320351e0c430eb8391b7dc2295d17a6
MD5 94aaec1b0c6e577e124d295cb5985cf4
BLAKE2b-256 7db9d7c83a5135168f1bce037c67468313805f0326d1a18e54de4ba46e11f074

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d46b23abd2d43a1b47d912103d44acfea75413d880dadda8053827f41bcfbd13
MD5 5da86604ffd665c8e2439980aa855ea3
BLAKE2b-256 3d3dbae1ea84d6285d903b5608fb47839c0af0f7d80ced2f5ec2efba58d0329b

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 84a71a98cd6583536b01fa6a5a784b75d529320435ef8554a7a7f31972632c8f
MD5 2a3929e8cd80b19cb9651b8c465a0ae2
BLAKE2b-256 f341ad570b2e24fbb17975258200145d4a65ffca95e49676c219a05569968b48

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ab5f099cea9def29d3c86a2315747dcb0929b2fcc12e0fd4755b77ce35c1f628
MD5 39e4d1021f83b37aae65782f2d822d21
BLAKE2b-256 d2d86d542588bb5de4ae031d9b726b1b5c77df8d151ce1479e6c1812c2b0ff0c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 133316c10ee9a3e7260f27f770118ae60a6e648d44b055d0dea71e919a4aecd9
MD5 2803459cb54fb7442442bc1d7b41b30d
BLAKE2b-256 7f43a43d4b47fce082b92295574865a222340e47bd2e3191f0afc036942fff22

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1ea39d213b133d8a56a8460c0e590e897454f6cd7f6c6cbb7cb410ba6ad2b638
MD5 9ceea7b7e88221fbe08c1945893a1162
BLAKE2b-256 8e866cb925aca4abd90752b3d4bc26de3c63dd39705062e3722abe5a15cf068d

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1ff3154bd678b8fb9768803281e2ab8ff7f1368f1154a425b60d34597658fc81
MD5 c051c41be0bbae6af6a67f64f2c96e20
BLAKE2b-256 499482cb44cf0744e40099a258e196e7c5ac4bcae964eabaaf0e49cf9da6c08a

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a21bc0509b9a2f7723d22e2610e76d6026f4d8a746d90f17612d1526e35c6a8d
MD5 5411a4b10da920f87c1ef4290c35ee99
BLAKE2b-256 5973d89113bba1cbc6e60e9c28bd97d63dd7dd337e264495daa1dadf21ce04cf

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c0ac6561e93a5e225a09da6807d4549d206c810148cb97fc385d1f38288bf97f
MD5 be42fc7d9053985e46c813d4cc819bee
BLAKE2b-256 8cf4b90426ebf1026a1e7ea3387b74d24f71455e05eaeae1758131b65a4779a2

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9e68a209218aee1d9e32f715cdd8b51800f50345b4111a4f53c4fcce00c5e62b
MD5 62a2a61b80f62bd3c634c17f3489061c
BLAKE2b-256 da54f175d97405becbbf57909545d7ab03ba58580ca7feedac78cbfc906b3f24

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7f8bc9d896cac9a37b2a001b42f96aa1780b75c4daa1ed21d210f5191f58177e
MD5 6243e76b35fa8a3609284f17901aa46a
BLAKE2b-256 85df229157fd7ea756e1a1fd4cc4dbfb75d69f57e43d57f750de728abfbd5028

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 de00c0b3f5fff1921723b5b51bc8934fda41de12dff37bd7e3343711740e59e7
MD5 dbe98749594942ca2917b5cab65fdacc
BLAKE2b-256 8a07bdb2644b0abc1b829297872a0023fa85a7c161a7f6e6fcd87c20eb4ed472

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0abe38526c00c488a4393b6592c302207635d1361e0ce8424b99c2fa311d53a2
MD5 9927729b10233fe6345b51485a49fe73
BLAKE2b-256 1170ec69d0d1f4027ed93bb7df8fd9515a01092df4e5eda9a1d2d69a28667e51

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 56293618756214971a39ad2e2df5b97c30782548501925bcdced0bce9ad1a5bc
MD5 2fa25cdf082f74a6842b8f6bff5c6414
BLAKE2b-256 3fda26765236483cf13f60a154ffcc4309901b34f730717d6d4019cf425388b3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f05ec7e2ca652c496be5385451c3832a07f58081a60062c940664a702bb46575
MD5 71f5f7a49bc0a10dca39b5715cc871f5
BLAKE2b-256 04b8cf222985fa8a29cbb54c3a6b24482f42b26dab895f231c485ad7c65beb55

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 200f12f15760bfe448c0e8b707c94597eca655d32ad796470e0168d53628fc72
MD5 a354a66a92de2fdd94c45ed772202ea4
BLAKE2b-256 3754be10f7145fd5652aee79fd6ee21948b3401d654cec4c5cd58ff1766b64b3

See more details on using hashes here.

File details

Details for the file winnerz-1.2.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1f162c09798f9d04de17007c4c10ab2567b6b969ade9550df4d8f033dc72a757
MD5 d45562797b4d09ed8f7412cda68fb0c3
BLAKE2b-256 0a3102d7ca86c62610a37beb5cbec2e2ab621066c1da9c732c3a54d51ad7f93d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page