Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages simultaneously, completely bypassing the Python Global Interpreter Lock (GIL) for extreme performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): Applies redaction to the specified rectangles and saves the output to a new PDF file.
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional. Used for decryption and as the primary preview rendering backend.
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.0.9.tar.gz (41.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.0.9-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.0.9-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.0.9-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.0.9-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.0.9-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.0.9-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.0.9-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.0.9-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.0.9-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.0.9-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.0.9-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.0.9-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.0.9-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.0.9-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.0.9-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.0.9-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.0.9.tar.gz.

File metadata

  • Download URL: winnerz-1.0.9.tar.gz
  • Upload date:
  • Size: 41.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.9.tar.gz
Algorithm Hash digest
SHA256 5f825ff3452dd47fc67119fd4870bde0570c556fb5adc2d18e734e4c9419904b
MD5 e778a5cb68dc8ea1d9e63ac04773c682
BLAKE2b-256 c83de6afac578618a23594683ea3535fb6ba31655702d77e16e46b27a41dae86

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.9-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.9-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 114a88cf9d676072affc881513719a9aee4ca7648453f0e2fa1150c600a2bf64
MD5 434f079ff5a41eb7028cbdf137137c2a
BLAKE2b-256 3ba09576834a28995f98f0b1a87153cc9c87d59377eddaeb0414519ff92e9bc4

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9c4afebd11376bbb9f151b7e084a5aeee2ce156827a8455cce5c7d571ec01a0f
MD5 cde0086bce2d6b15395083bb087c64f6
BLAKE2b-256 f9ce76947434c54885cc77668bfaeea55327d3c1f8b30dede8bd6ecacfddcd91

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 15d843dbc23b73380a11532119b78d227bc5f1f472277f8f5762309eda431326
MD5 e744dbc081849bd0e37f3cc806f6c7d7
BLAKE2b-256 3ea49b65b2fb27491855a0ce60503abb7454e371fadf4cb4490c55c56a66a52f

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ef581da3063aae1da23f2d6cf9543b4d14185c706214913cd27f2ce4147343c2
MD5 906c068ed2d810cdb1c43766788cca59
BLAKE2b-256 1854d7ef3bc295b1440dedb4be5ca6d13179989dd47cbdaf85daa547c0a8679e

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.9-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.9-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 45c48624b14ee5cc8b09a509d56aaef9c5dc13b7c7a334043f029e3bc1c269b0
MD5 bfd79acd09e65a5feb7184a223eeac0b
BLAKE2b-256 3450d8a3e7d98d49e5b3cc6bca24424746b619bc49bc723354a8406b43567a43

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 891e288d9358aa98a9571c0d01ec879824f00ed7e87f53a9de3c575f3f0f6281
MD5 0a870784fa8a2fcefae95b0bbcd8215d
BLAKE2b-256 82fcfa70ecf24ea54d4555e26e91ee58560e80483bd943d49a34bb014ae9fd57

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 40c08a6718854816d65507c1e9b715152e4ead26b53ff11b45923bbbd93e10b5
MD5 dd9b44b2d5091071f06326ba24c7d90c
BLAKE2b-256 e36a1d9b1af13043eb97c2880b3ec97aafe26674075c16033d9725562d60ac9c

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0c694b730751f410ad70d906b87a1b7f8dabb1bbe244b0717a5da5b24780599d
MD5 d7d0bbe31fd979fff2844b436a2fe5e6
BLAKE2b-256 a67615f74d9e437ada5f20126ee2286d1bb74a23a5403d8501ee5a3602975678

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.9-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.9-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a512d0f8a1084f62ea85969a273e803c2ef568d4269656a8e6a1c859152bedaf
MD5 a6db3d381e35e751f5a5148dca1d3b44
BLAKE2b-256 c389e957107ecb1e3bedafba926a17b5a0f5333f44e6dafc6a0f4ab7bf4dbf77

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3a3baf5dc424c08f845079728787283a779a854a097fe5fc6970434da97ed56a
MD5 1219dfff1e98a0992ccc62950e508ae6
BLAKE2b-256 1ab49f148f21777ea97666654680dbc6dc2ec33915ca08166f5c2d2196a0853a

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cae422bf10b9e05c543c5754d675e96d928f85cce1a1610e6ca2f059708eb7b3
MD5 04eeb298546de87ce15d9485b3c3be6a
BLAKE2b-256 50087b9ec65e2ce6b56b86ae6736417f8249e0d6fa7e0837e0aed16872bb9f17

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 34ffd53870e610b0e936d689059ece14d89fe76b4c772f484c51423041762c8c
MD5 359f0fedc7a56474ea906f62fefeb98f
BLAKE2b-256 707fb28a0d94f9d797dff741c9fc05df7a80109c1642171bf3b9dfc9bbc95de4

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.9-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.9-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 512d4c1bd1433d8370c1db8eef5bde78e092a8cd5869d0548ea3e62fc740f255
MD5 73808e3bca231ca246c57587a9d7c040
BLAKE2b-256 b7c8155e539d40af90e96fd37f3eb72b116e31758a0140e3766f75aee003f107

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5ff5ec657b785bf99dd5b02d33d2b71e122c1b3a383c78ae43a7eccb132231f9
MD5 5a3123b226e53c0308395d64e150b1a0
BLAKE2b-256 638f73c5284cdacaece70e1a9c45ad5002eb1bd0a7a08efddf0456ed1584a688

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0811914f2e17a4894d97eaa342d470a82ec5d7e854a03c316bfc96eb84fff6c3
MD5 8f34c3f9c754a6af1387899daafe96f1
BLAKE2b-256 f7c0d3933c62a0f255c5ab7731e04590c8df8747f7a2c34c8cb7af16327cc0d9

See more details on using hashes here.

File details

Details for the file winnerz-1.0.9-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.9-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 21055b9792693111691092b352f34efc2d4449b5f7547c62960b48af2592bd5d
MD5 2be71a9f40709e768a00d9523b1ca74f
BLAKE2b-256 d4a6552e936653aff6aea81d110a6a4bceb7a473870c640053bd36a281d0292e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page