Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.1.tar.gz (41.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.1-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.1-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.1-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.1-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.1-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.1-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.1-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.1-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.1-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.1-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.1-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.1-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.1-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.1-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.1-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.1-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.1.tar.gz.

File metadata

  • Download URL: winnerz-1.1.1.tar.gz
  • Upload date:
  • Size: 41.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.1.tar.gz
Algorithm Hash digest
SHA256 99516f5ae1ee8724cc06aba3c3baffa9da15946440dd489dbecf375412d7754c
MD5 bcedbeca4d2a64af0625bd6ffdc2be91
BLAKE2b-256 6e36e257c3f230956d75557268083fbfd23278f1fbdeb6e4168cc9787c86aaf2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d1cc2c5593dfbf79a4f5686faf74e17967422d0f26a239746668e8922f45799f
MD5 34b60fcf6056aa91d980bc3d8f656be6
BLAKE2b-256 0089fbc2f5e5f70b66566b5fc7057c1d8ca26adefb1b0c4044755262c5141789

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e0242738ba85fb4aa315ad91c1e3017345b5b036607071766d585204925ae542
MD5 2bb325d9d227dad446e30c549a946ff2
BLAKE2b-256 cd98a8d2d78b9a908572e848c87383b2bb34f77f044f5a6728582e16ea332e6d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 56e64ded344174aed00e3ba0a4b46aabcdc36e3ca14345c5d2678ddb586e0b83
MD5 c451a733f95edcaeb8926409661f28ef
BLAKE2b-256 9b791a1b78a4e8abaee8968f687161f1b70d2a5693a3eedae2b6cf8e038e350d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 602bcec13a133969d5eaf5a4a97913fcd59b369b64d2d4ca170fc3670c3d81d3
MD5 ef7c42b0c81de138d9decd629ed82855
BLAKE2b-256 e1b34588a3b7ef0d1e64738c1d4e0b17294f3e17eed1af83d474808f9c34bc20

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 26017429d0526a688598e11bb00a6a1c459472b7e05c95cea06cfdc451abcace
MD5 feaa7d9e10aaca69a2000862a7f989b4
BLAKE2b-256 660794a87ed8f3b0f5eb8f8959ba76b6550a4fcb11d5e25c28c59c915aa3bc1d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 253a35bed7efcd1a71b478de90d71db09fde296380d8e2045362b8e4e8a8ca17
MD5 972c5468e1f96d2b1bf3c3e38ed8f4d4
BLAKE2b-256 7d3e8c407caf4f825967ba082e87beb33e9c452736067af3c204b7586dabfd06

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cf7a231c13c9292eefd1e5b5f94708bdc3a3489a315bd041b4fc9fa8b80151fa
MD5 4530838d83e0184874f5482bfdd5f00a
BLAKE2b-256 9634b69c204e82d209ed54a5cf71b3587c23b2572db6f184062ac6e3451b4e76

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a2dd4ea047dc07c9c0051fa131f4bddaa1603348c556c29457a600eafeb0780a
MD5 4e51084bcb288d2093927aba6ede56ae
BLAKE2b-256 d2b637def9fc8fdbd11f2fc79cb3f015ff5562bb445bd7bc70fd512d050b93a3

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8596a9704527475fe0bc8ad55112d232f160e45baf8c1839f6ab5ae50478dc90
MD5 42af30a199425351d011d33c90135b36
BLAKE2b-256 8f8a77c157ff2a3e8bf80e5fdf4ad25c0e7c8499dd739cfe47e9aaf4885fc65a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5fdd448b01ec87a3863d82007924e80ed5825d0cde85a16c70715f095c777638
MD5 db443c333cd132720828a9dbfe547f82
BLAKE2b-256 460c18d740e08a7bee8c9625753d3f78d815cfae8dbf534b89e100e997adc901

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 feabd7ea773a6984e64d4b5712ebe27255c726147aed7ed1afd6dab3e8ecc56c
MD5 023ae704308ecaa5bc55fab27374790d
BLAKE2b-256 8055844357c77a0f9b1ab45a93154622a053a3ca517271187ad432fb214b337a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bcb68f48b5de12eb3ef29604b00e90ae994fe6fbe3e6a1c70263e82daf2fa5dc
MD5 f3c671b729e21936209814a2681f40f8
BLAKE2b-256 6741e5ffb2daf099b06102d679d145fe54a8d1bacd3296e353f8c2cad1d94db7

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 b14605e36aea2267b16b9b55fc9bf759cf7c9ea8f0e9ebd97b6179164af610a9
MD5 9a402ac4991bd7f8f2c6b46cda832a27
BLAKE2b-256 8bae7263c56bb6c1c5ba80b247914b010acd6ab86fff3d58229efcabe09b2dea

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3b21b90d304f57050a43d32ebeae782f320c89ceb19552f20105dba84dabe41a
MD5 478748089ab354b3c10e242cd0cb262a
BLAKE2b-256 28d713c1904936984338ed23f0803867fdfe936cf5c3097da829194842799956

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6e4da7c4a7dc1435d6804f16d946d7ab4c2c3a41d98d4c6da5e530e3ea05522f
MD5 936409962a2b1cc5ffa3b319dcbc42f7
BLAKE2b-256 61fa895a839126bfa6136604a6d9640a412d9aa6503a66a577e61395fbb51db0

See more details on using hashes here.

File details

Details for the file winnerz-1.1.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 02f99403f65ace6e45a0494bc280dea911bff5c70bb5b442b7882309086d10c6
MD5 83114509d44543280eeeb8dcf102d3b0
BLAKE2b-256 036ec46c0a84cc40a3a8bf1c89d80d44ccecf937ec3149611202a1129c6bfb16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page