Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.6.tar.gz (74.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.6-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.6-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.6-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.6-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.6-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.6-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.6-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.6-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.6-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.6-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.6-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.6-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.6-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.6.tar.gz.

File metadata

  • Download URL: winnerz-1.1.6.tar.gz
  • Upload date:
  • Size: 74.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.6.tar.gz
Algorithm Hash digest
SHA256 053bfba55c936517325a3a497909310727294b1a5bb3047f7f5bc22e373071f0
MD5 dce4333c8d4490939ec7cbac8c4be497
BLAKE2b-256 f93c9a1a2f0cca1f2ffa685e153e1953a4b9e9a5931ee955e896a0eff907b8b5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.6-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f8be2c0ea012112dad2f6263dae7245912bcc7fa00dd4d567e67989131140fc2
MD5 36038ad7ff7af99b44abf3616bd95dfc
BLAKE2b-256 824848e4ed3e67191cf3046838c594606c2b1035e03267bc0ecd1f8e54373d72

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 18f4057d3e7d089100e44f63cf608a4eeb3078353a32bd46d3f80533166e3c7f
MD5 75c291bcb4623d492db425a9a55bfa2c
BLAKE2b-256 95bb1273b4cdf784718c8d7fc2ff4b3576103edee576aa0d71c514d05ef23e47

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8d3f5fba7480dd2e74bb5a8c993ca292cbd1f69e8754461c0d7305b0dba3d61f
MD5 f4954a9b1680ec24fded914ad9a669bf
BLAKE2b-256 4550054da73a95de5a5fd7408ff15ec08cf0945bcdc41cfe10fa839e20207b2d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b9659ebf611fbabe8bd726675ce1b884c3ccf6f758d3c6a114dda19d02c5d5cf
MD5 8eeba400b0eb15603e63d9ae72228237
BLAKE2b-256 f4ce705733b45f83f4dbe54bf4f58cd82f591a6f30177358d9374fff70a1b29f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 dfa76d813533aa148861346a0b29497dcf5cf74d78b6bab4d19c2f8a8d334a9e
MD5 14aaa0ff26e7a5aa8a4819c4ae04e67f
BLAKE2b-256 f2ad7a08c0ab221340529f7898edab771f10f154eee2348874ffa70e5373b9d5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 dbdf1d91768addfbdb9a405e71119e6c264877e9d87270a6a05b7514252eea28
MD5 b700ceccfc16eacb8498073683b1568b
BLAKE2b-256 362adc6a3f7ff1216d00221e3c6d6431cf0b4de1a3deaec563c86ea910bad068

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ce82dc8f0efba244939f4dd378a0c80a9196105842a8f5b56d11506fa3da4c16
MD5 2de0ff50fefca44253c8878634e516d3
BLAKE2b-256 e8cdffc16678a925523ab59840c785b1c4f0b93144738419a2e1d963b3574016

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a913b3d5354bf95a100180e6d4c78eeac90f15882b3ceefc22807efde9985ef5
MD5 ea99f6657df2afbf62d53aeb52734c8d
BLAKE2b-256 d95c7e40c534e6e5aae94926f71756017ac1edaf488900e33a602b8259a7016c

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7e40459a0f9a7e59390947918d3e0cb7582782ec2f3080ef540fe9e53e6646e7
MD5 c8b52e8058231d1dc0b3406b1ee1344c
BLAKE2b-256 35589f411b0a932b8be3ba0194305df6c982f96d5e7b5688b0694d604472e8d0

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 03dcee8edb6812b3b360868333d9dfa575b826b2f893f9a4ed527db2395cb277
MD5 ccc375c3c74bd92e8ca7b6aa2ab92e6c
BLAKE2b-256 8a38edf7bda71ed7d07c0e12d3174d3a8623d695c8d3785c0184cff98bd9e345

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e02a0db90b963c176e74fe6f9e8fe3cee597a56a6c0daca70b9576cbf3dcf5ec
MD5 5ff643ce9639b43a01b144553d70cf78
BLAKE2b-256 1e35c1673ecd645cc5d2aa4b4eb2f0b2924f7b6137885c2553490864c55aa21d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3fce2c7e261414dae4fea69ffc4e8c073cec865f107171ac15032b5d3b08c30c
MD5 d44808c5a732d2c6cdbff1e49322e42f
BLAKE2b-256 7c791d4d934e84fdad71ba0a51ac20bb38a9259a6a37968085411b520465dde2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3d0125a82f55ae9fa7972a57fbcc2295ce9497d41e1e9a262c69283c2299251b
MD5 ecb897487a596d8b113d2148be15e509
BLAKE2b-256 b9a1fb800e668c746ad962d1bee5b526f529c790c956df680cfbb609fc2f11cc

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 00befb431d9c7eed56134d36a478295dc82684ba7c62854e42db02be4a233d66
MD5 328eabb5ba61dbea67bc22250e0ad0bd
BLAKE2b-256 0035399f0022951e5dce32cba3d573fb71e93f2f526aa585fd60d03d25da435c

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f9e226498e2949cb01a8285239f4e02e7556103632db835704471ec089a3232b
MD5 1c67d85a9773520cdb5347524f4d9ca0
BLAKE2b-256 bea988aa78ed3bd6306c778e2756a91dae2e19cdc80cad678a9e94491e3e1fe5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.6-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.6-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7cc9320589b704737f5fc9435e72c989841a223163909768d638d42e32b21d61
MD5 a33aec95cc08e25c1c28ab494ec2f8c1
BLAKE2b-256 52afc0f53c6383ae76203970632a4e5a4a7710c752330781afe851c080773ca4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page