Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages simultaneously, completely bypassing the Python Global Interpreter Lock (GIL) for extreme performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): Applies redaction to the specified rectangles and saves the output to a new PDF file.
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional. Used for decryption and as the primary preview rendering backend.
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.0.8.tar.gz (41.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.0.8-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.0.8-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.0.8-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.0.8-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.0.8-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.0.8-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.0.8-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.0.8-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.0.8-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.0.8-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.0.8-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.0.8-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.0.8-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.0.8-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.0.8-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.0.8-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.0.8.tar.gz.

File metadata

  • Download URL: winnerz-1.0.8.tar.gz
  • Upload date:
  • Size: 41.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.8.tar.gz
Algorithm Hash digest
SHA256 bfbece2fea67fcfafbcc826133323e0fd22ebee4a7061eba1c4e0f5426b36649
MD5 1f61fa0060fb0f7ac2b289f99170a5ce
BLAKE2b-256 5a30abce9a74e9e772afc74e5f348540df1e49d2f6feb7ba1309d27d5424f8d7

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.8-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.8-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5bfe8c71b0482bc33e8e2dbe85636b0fc0dabac64098a7c95d4e056346e93b9e
MD5 422cfea13425be27014f1e15ee4342af
BLAKE2b-256 142c174b0609c156ea58129b3f6a84d6dd2c7924ce94a85310cf24b1ea195967

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cdf27413964accef8e1642f831782b70c97c67252cfd5ccaa434c56bb5467407
MD5 eebdc6a95e808b8001c591bfa33de7e0
BLAKE2b-256 489f109f0c6215ccfd1e51dbf7c8f3dee2a6a0b46ca42191bcbdf8b786ed815f

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 41786965e27d6ab1e821f15d5e26f80eb577ec11b58cf26548109bdf750cdd03
MD5 a4840c71cc724479fd227b69cb0cebe7
BLAKE2b-256 3bfd2bf79e2704186ccce53838e8e807f9d8fcf1aa3639f9d990358e7ea5c4c4

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 915d36c7c9de6f104304df4dec1de7e924d73a1a16403e5fc96e071fed3cff8a
MD5 aee94399811fad812bbcbd2468990ffc
BLAKE2b-256 34ba3ad773c128ce294da5a7d82b5cb9d3ec08f0986ba9bc6c1d573c46af0175

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.8-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1f6559aa3ecd139283e08b86d872d1c1ed11ea13617bd06d79b9104a20d6a330
MD5 349abde5c63de20950327b8ab6a46081
BLAKE2b-256 49f737182a8c925a037d30c0ad32965ddc1e660b1e59cc3715e9be3df1d9a6c6

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6b41dc4f08a7cac4a6855b601e01c5e401289cb44f6b8e8fb364c758bfdbe0c9
MD5 774370ae56ac9e62c4c7f02ffd7f2abb
BLAKE2b-256 50b43f0ddddd9e549dda921620be07bb76d6f9d96a783ec1de5319beceda4517

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 05dd5109212ee21053e6b59f7d5d8c8f6ba8159ce673e507f954f98f5d1ff672
MD5 e30ee13054ed10252cc96e4e77a5ddc6
BLAKE2b-256 df7cf862ee7ff671220428da5a4aec4d3eb355bfdd2179bddaa1bea39a6c32e9

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b76d127e4c9a1d66687ee2e0c1585a89f15953c6de8c3e75df397ece5356319a
MD5 26d7c2b1860b65c73b95b4d4808fc341
BLAKE2b-256 eae9628e8ace48cbb0c170940bcfcd33fcd1a5dec87eb8182ba36e9b1e56b07d

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.8-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.8-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 af916fbffd2c9f9b832b6f465a4aa8e0005ca81e623043d382af7c25bd9ab9a3
MD5 0a657637677b8a1f87f85f293682f17d
BLAKE2b-256 f1635cb7dc0eab81b5e6d93c6279ad042511ec9b45b7e91063543921453192bb

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e121c63e7c373515cdebdc8c20c6c09f959b5c4eac8d9b714176bba5bc58b13c
MD5 93a1777d3f46d0ee0c0cd7c676a69049
BLAKE2b-256 7b236fa22ca7c9016528026ecce032b82ffa0b776f4908d11b5edba97296410a

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc33d60804175740415c8e75214e97b1cbe639b9fe346d7be9f952f1f0ecfb14
MD5 0b6677e3759b19ad36c565f2924c114a
BLAKE2b-256 80bd2855a72aaf8ffd8fe1c68fc3907b2628b187e76de7395720482c4ac6db03

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 82dbd15724ca794934647900dc98a1f60b41658f0457278b18da787a9dd0fde8
MD5 309fc321190f71946f51129ee9771385
BLAKE2b-256 76ce9ca708da0b1a246b274f025a4cde8ec6f1b2a8faa03b982cda11b5392a01

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.8-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9cecd66758941eb896c975ff5af7fd2a23b36b13327aa99f1c402c1ff5e770be
MD5 e0f082d88cd3f1b7d2e51fe2df803597
BLAKE2b-256 befc6a4bd356cc7edbdb0afdf03ba95eb912aaf768a185a1b08dc541e898f116

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 31491e6591089510af06fc6a038ca70aed5f999bd31f102fe217bdd8fa6e4bd5
MD5 71224f05c993b385314b60a03354fa06
BLAKE2b-256 6c519aef54eec9b3aca1a1454fefd1c37a862f08530ca363cfd75c708b43cd4a

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 20c561443c8ffc8761835124a381b7110df32ab96332a3baabebbf804c5b143b
MD5 62285717cc1bcf1ba2077f3743254150
BLAKE2b-256 bfdf8f04be9cb532ed737291767c135aa6eac8418bd91bc6f79f3ef1c2cb653b

See more details on using hashes here.

File details

Details for the file winnerz-1.0.8-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.8-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ea2fe51a18b7e97f45bcaf6212f7b92185412ab9e39f3115435a80ee27593605
MD5 2d655d12fc6448a1032046f13ba052d1
BLAKE2b-256 aee5092fccfcf206dcdab04f79b8cdb2b9e127f51f6b31b03ef6e4f190d63afc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page