Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages simultaneously, completely bypassing the Python Global Interpreter Lock (GIL) for extreme performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): Applies redaction to the specified rectangles and saves the output to a new PDF file.
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional. Used for decryption and as the primary preview rendering backend.
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.0.tar.gz (41.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.0-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.0-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.0-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.0-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.0-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.0-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.0-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.0-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.0-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.0-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.0-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.0.tar.gz.

File metadata

  • Download URL: winnerz-1.1.0.tar.gz
  • Upload date:
  • Size: 41.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.0.tar.gz
Algorithm Hash digest
SHA256 53e4c8c494aa0f6f4498e04ceca533f008d361fbd2d69d059ceb03ccb2cc5460
MD5 9553b201bfb0003f529bb39521af6dda
BLAKE2b-256 00b25a6118750ddf3c10a25c1a429cad12a9eaae7fd4ec3f2062db91bdbe3dcd

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 bf03bdbeb91459d71e23716c610e3b28ac938da8ca3d8b7b6e48c849546e0746
MD5 a4555abfd22279ee66df8d8abf96577f
BLAKE2b-256 9ea1d1eb6f066b330525dafc5e10bcc43ed381fc1bcbb12a2e8d12cd1b1dc08e

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a270f326591e151cfc04a36ff29891fcf59810f679cb14ee0a0d9edd36dc66a4
MD5 35472902ab18236bd6bdf9ca9ac6c3fa
BLAKE2b-256 477de9ffc4f7061e7e21851578285da5909e7c54e63d7d580b267716258af928

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5774204fd177dd5fd8c7bd683af849449df35159d2a270d28160a31aecdd0e26
MD5 1399224995c64e587285d6124c0a338a
BLAKE2b-256 8ddc457eda3d91c5bb483c520ad2a54a6ef4ac442496baadd9bdaaa3a6289ac5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 473019c4b714c6e5db986b070d93a4837cfd2d3997a66df5f17afdb63608af5a
MD5 4cd830388f68d7976a7a3c2a48c43202
BLAKE2b-256 566c7fbf5f367aee5e7dd9243ed97e8d7af6226c8dd5aaa3b918b73c1f106508

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 6d1adf23ced31e38430b2aaac906d792b6d53b95957ffd684b8572261c8b03af
MD5 261f95731900ee98ad766eec7d626dae
BLAKE2b-256 4323156f875564c1d3e7ca937c5f016e5951d2cad035b955b4890640267f9e8c

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 49363c6f0a5e8da40c21fd9123457469a85532636089a59bce111f314786591c
MD5 e3fcfae0644599d07c2c387c07f33b2a
BLAKE2b-256 b31838bd60781f2a02e712d3e6dca99eacd450609d1e51cac26eaee60664a451

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 554762cafe23825ba74dc0c506eaffd739dc4242b2e832029b3a8097f88fe35b
MD5 2f829b9e62e61fd5de7ea1ff6dfbf971
BLAKE2b-256 3e5b7c33226e4d2d396cdb04bb58e94ca9e5506edf886a81240538d28f65996b

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 31a4cb25c4df0b8b20253e02c354df4c2ada9f6bd5f8e50589aee43ea4a74434
MD5 eaf49017de5416768c5dd72043e1a786
BLAKE2b-256 a20f044e7f42e0dd3e95f20213de8de6cad17ecaf3712cdb4bd13911cfcadbf2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 308a5083d5a491d4c38c290122cfb577b712fe8eb8dc6947eec236254ee09318
MD5 f99f7531cbd70c34cf164660c610254f
BLAKE2b-256 1e0bf47b5cf2970ddd43a44a102e049c44974930bfcacded37729caff1feccd3

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 59ba136b9c925712fa15a86e2229788d67b19072894cd2f9a42fe870ce555592
MD5 8db44b5d857b285e2aea7515fb3cc948
BLAKE2b-256 cc8317ac649ff02dee2aa3f4ae2472efbe233fa350c7c96e72b4ac5378f5660e

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 da10668b8860d1e5c53cc2544b5700650f457f717140238ad2b0c5380b0b51bf
MD5 78deeb7cc848a164b4a37c1df7a8aeae
BLAKE2b-256 49ab3ee8f8adb68529fc82ad95cb7080cae61a82a775543dd1ed31185d1538a7

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2d94da3c2149c56d81e41ee3784a3d0509bacfb18f74356016540cc91b1a05cf
MD5 bf2e9ecc6b30c2de781b876e9eeb3fe5
BLAKE2b-256 bebb3cb25cd06202d8620f33061172031a0e0b5ad262b8d77f02a762b3eaf4ed

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3f358191b1e6c9678188be542b4a5b0b7f00b2b524a487f443b9db80cb854b67
MD5 1a8157142c3cb642f16fcd9a4ec8e3e1
BLAKE2b-256 f423d0886516944cd5f7d3745a631b2b257969c34b3ecc52846b781280da4811

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b446e602f83dd3a11785c17986d92f4fb5835a6427c515b9eb5dc03c00e6795a
MD5 5e1899459c094aa361216e26d75b202e
BLAKE2b-256 fed220d5f5e370de4716e50f40ad0992718332aba52cb0492f5b034b2741cb6b

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5541dd72d12fd1a66e66bf294c316b387e3a7ebad7ef7a0c9ffa72c75b5a8bba
MD5 6aadd20750ad68b5cbb2f10e5f943fb2
BLAKE2b-256 ce1dd72be185f814aebc49e4e0495fd0217d4d51fa033557364a2ead3aeeb249

See more details on using hashes here.

File details

Details for the file winnerz-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 70e9ff420eb9a1a37d5b9cd0a4bae8f6e47eb63483d0e7e176c5e8d6c473367c
MD5 6671b800fa580ad02d3b4b3aa9ade2cf
BLAKE2b-256 c13a5d8bb3bc5878c90803788be46cfff3faa14101aa54c93d39edf346cc41c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page