Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages simultaneously, completely bypassing the Python Global Interpreter Lock (GIL) for extreme performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): Applies redaction to the specified rectangles and saves the output to a new PDF file.
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional. Used for decryption and as the primary preview rendering backend.
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.0.7.tar.gz (37.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.0.7-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.0.7-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.0.7-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.0.7-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.0.7-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.0.7-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.0.7-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.0.7-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.0.7-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.0.7-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.0.7-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.0.7-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.0.7-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.0.7-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.0.7-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.0.7-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.0.7.tar.gz.

File metadata

  • Download URL: winnerz-1.0.7.tar.gz
  • Upload date:
  • Size: 37.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.7.tar.gz
Algorithm Hash digest
SHA256 a93990334c4782f795f857362ef40b99b140499bc3604fec037ada4f883d60f4
MD5 7f2f70369b51cc4b7d55e496c070888c
BLAKE2b-256 f13b279493ca46f10de1d92ea2045cb6b5c93e84b00fdfe7104649757f79228f

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.7-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2e08c3cb5e1b93ed2914700a13f71d6e47ed34ed43ce4bfbf143a5df64d89922
MD5 10a5092f061a96cf61578d4258cd2a49
BLAKE2b-256 d9878ea3fe6dfedcff2b8c37af008fa9dc423daa814a754dd5b893a6cd227a01

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a4af30b4b2ffb21fa221dccf9775327bf8c17ccb7c282c75ecd581ed4362b2c3
MD5 d7bf74cb807b02939565efa6b7328d7f
BLAKE2b-256 bd8d068f8860d2f3581227a6f1a090659a585913b940477188bd8f39bde8ab35

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 588ee48d2f2114f8dec4e6f937bffc4e5b5c4ba609e5276377bfb9d392e80529
MD5 cf8cf32f89404395b08e33a6cb5aab34
BLAKE2b-256 a56c68e313343843af7f103f7978d36524b3be7c1f6b2427f745af61f1981da3

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 75f92d76fd92e596d7841f303791b291147b78229cee29299dced66a1ebb692e
MD5 efcfcb11d302f18541f6063e371791e7
BLAKE2b-256 8bf5454c04aeef37d0450019e2a8d18df2eff17743ee2088e1f4849869733020

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.7-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3f0231b0cdd87afc527c48ac78c097201f4e652f355abcb5a781b81075bec37d
MD5 7173098f389032fce0f94a3baa4018a9
BLAKE2b-256 3bcbcfbab3d0873b714f8dea71c4be8411d41a99ebd329018361075840519126

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a6d546188fb7277c42e7ec7c50b29de439cfa3ced62813102a61a48c3a0ce6c3
MD5 73d2ce9a42390b9464db872ddb30b4a0
BLAKE2b-256 b117f48946a78e0b0e92b5207839c644f03c89f98bfd44e41befbcd3f6e3b219

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 521698cc65c7e0ff0b5f12da1ec21bddc6b67c56724087749566a8648df6846b
MD5 5696817278574ff0d10266be2e38ff5c
BLAKE2b-256 625298bac24e21e9fa5d00d222311670496684073cb2c70bebb8515206ded2aa

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e9d51b518ed7f646af1e2bdd0d4ad20b3da442f6a7203d5913c6ddba29b9d3bf
MD5 36bfeca47bb5eb65bc752c43d91e0b28
BLAKE2b-256 a76d27944e4da5dae1110cb6bfabb929a9fb3265de7f3859477cc3dae3d2446d

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.7-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8a512ff96e871846c18166c8acc5ec9f9b140474e5c697fc04b0022373600c61
MD5 80dcbbb69241e018370ea3c2af504f10
BLAKE2b-256 288b4c2591b39c3585b1cca242e30dceb778de909f04d1ce00a3e1168a32306e

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 779899504c4b11083c5793a565275a52ddea91d7ddbb7cad7e8189fe760974f6
MD5 af6c1e717eeff73ca55e49ad8369b93d
BLAKE2b-256 8b727dca29029ccaef0f560e8fa09d8e3b34164102d07553f174e0c8730f91ba

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 105c13d99eb23e68779c9fe8cf88425dd10de2ab834436930edb288fb3f64213
MD5 638d4e1815da92935294f7b79a83dc97
BLAKE2b-256 3d3db806be10230ff244b95ef28181aa47006aa56c0f5deb8e80ed46a8bae52a

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f089ea81e59959e4a5dbb8caabd1bb724c89d113648ea98cca3d0a7f8dac79f2
MD5 763503f4f969aa3f57cf91ec6a087909
BLAKE2b-256 cc4555f6b3a9dbd8a79d094928d3d2d93d9acc651a6443207b4adfff05a7cbad

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.7-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.7-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c8a4467823e272f6461ba0150566edaa6dc7c0747184746a1255e7a4235a0f60
MD5 4022e6ed870913afcae77ab185ee4c36
BLAKE2b-256 6873e22a24705d4afc04aaf4cc41c371e5d3dcfb38f2c43c89e195ea1f3844fe

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e69f1b53c75e9f22f942fa84153c9b4a7275c385d0e7f47e27806bb04818ab39
MD5 09113dfe3c79c3e7976b80911f510b62
BLAKE2b-256 859fe686bb0a8c0d6672efc7fa73cd60e01e17e6a00d6087e1c5edaf1bea6893

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 50c4e2f202249a53359eb94db049e8b78e17ce43372c07493c4cfed510ea5c54
MD5 b03c6c483013d1f741e439aecc6a3717
BLAKE2b-256 486d5a578ba99ac9346517748772fcd483687f01b3c3f1a2b19c8bc0c9eaac11

See more details on using hashes here.

File details

Details for the file winnerz-1.0.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2fae0b64083bbe23c4f350c518faf22c0f98d312539aa67a6837efa2ee892092
MD5 c20465d9c592ffbd0fd4c298899c2071
BLAKE2b-256 d619b35d6f397fcc1a054eaa98ab1a052f1f07bf8da312468e3ce6e4a3f5d909

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page