Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.4.tar.gz (58.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.4-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.4-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.4-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.4-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.4-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.4-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.4-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.4-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.4-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.4-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.4-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.4-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.4-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.4-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.4-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.4-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.4.tar.gz.

File metadata

  • Download URL: winnerz-1.1.4.tar.gz
  • Upload date:
  • Size: 58.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.4.tar.gz
Algorithm Hash digest
SHA256 234cfc347e6cc72894f5363f62a0b439418fb5ab191dd5a304d30e9628ecde5f
MD5 20057dc68775e7ff30f93750281f04e1
BLAKE2b-256 1e1e9230fa34ce066671066f8f7d9f4449b6c883cf9af2f73aa0018f8a59173b

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2e823c8bfc7e133b62d043a23f67ecb003dbb4d798113485224fd9a025f110ee
MD5 1562137c33ed4ef60dc8486d3f0a56c5
BLAKE2b-256 c278c04486899ef88c1d3bec6e6f3972a3bd8f522799ab7c198b3b7138ed287a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2530f11042ece58b5d94910cb5d588a8db89c8639c48db64e8413cf12ad6fbc8
MD5 b1bf3ece04510b86576f86239cc1a67c
BLAKE2b-256 c0f211f25c68f16b39f123a77d43708f917c16c7e3029e7c0611131462f7a0dd

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b7a98acd0316972b3bb8adb530701aeee812a4978c7a52a80ae69f91e933cc3d
MD5 e9d94fa4596cbae0f14c4fc3339d335b
BLAKE2b-256 7aea7eed664bcb4a83f0d32349c925b3416852e5c78231eec4256a9c3ad7e438

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 439fc76555f7df397502ef4890292a8e6378edc3207dcb2fc48fbeaf2cc485e0
MD5 3d8a3400b195740226040bfe1d8e5bfc
BLAKE2b-256 8ba69b52f143d8045df550f27e769ef5de43896b34de29ddb31b020cc6ccf56c

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 901aa35d7f337fd696b8e239eb40ca4dd3c8845d7eb6b3c50fcd206bee3deddc
MD5 c0edb57f8860c3fb6218abbecf41da65
BLAKE2b-256 edcbe8f7d91e98fbc44d15f5a8f06d07276c7f85938bf42d230e2a3c7de198f6

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7d05d3897de17a0937482ae505875f17ae98c2bf782fdd34f5b8e6e0a7ef7ac9
MD5 d18e2fb8c9f533b8c4804535406bfbb9
BLAKE2b-256 66fa640b8866af6a268c32ae6311c66490c0815f8e60a04bce9ec666b99ca7cd

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8e614266c4875408ee7c325279325229e7dad13b3bacf47219c65c86a9a5d024
MD5 07fe269608c35268991141be3db109b3
BLAKE2b-256 619133df8e0d601f5ea561e1b52df44a21c42729b0b4a28f588ad65a923128ae

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8a54f0b0d89a30c10fa7f0de2856ff7fc593a5c6e603ad9f6f71f9f27a3268dd
MD5 274c9062f894ebbf9e0da0fa95665f53
BLAKE2b-256 1794e8d043186baeaf59ce03de25e91cba6d74ff802988dc4bfa3d71432ca14b

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bbfc0d6c9cfacb6dd20a9f4e6c58f871c0db97e8bb66287a82def7c8467f3ee7
MD5 dd3305c84fc1a09bd00f75b3e20b4981
BLAKE2b-256 0878649f017500295baf30614d43f09c21f5940b84c3b3846104f062e3223416

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 53c278b7ed3704f2e2c1d62579747e18331dbf947e0cc6abeab893a64e1bebf2
MD5 973c6b4b3e060607d8c1a1a362965d6f
BLAKE2b-256 36db214685c3b22d2aa83042d621562b415c91d744d9bbf25efc35145b7019f2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e43cb9b48ca18b7396489986cf09787aef7fcfabd5d400d84375b8a91d97b189
MD5 9673c6603248df45cae5f9e9da00c44a
BLAKE2b-256 0b423860c25bd1173d5c622fb16105b42515c10efde79c9d8fee103dab11c7dc

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d5b86fc70a6b73b04679d287d4aa01748ef16f3abc71e08daeb1e7ec1d794c6b
MD5 f519f17f979a6d7c2e36f42c1110e6fb
BLAKE2b-256 b22703430a239b1b06ad77f5c41f08c62e55c6dff34d65a9ae1be3e5c7f81f18

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 f3048df9b3b89eda3d3ee9707925ea2bdb79e5ae76c15f8fbcc0fc1311a5bdad
MD5 f99d98919583c79a3535cfe8ab2d7297
BLAKE2b-256 11839151b35a39bebe4f36894bb218ee46cdb851f2668888e5e532b92304292d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7aa69d61588d9d96ce12e6f0d9454a429eabebcac93fd2234204838dfab6359c
MD5 a1ecc3661a409eba00eea71b76271b47
BLAKE2b-256 121357e0c4adb67cdbf621edffbe31fc9bb7f6e0ab5eef50a58b803058a56a7e

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 867dfc3283be5f43ea4a072c76df25808174cabbc5757cb59a8ba60346dcd456
MD5 b2cb5a24460273ac2dfbc6b8413759d1
BLAKE2b-256 150beaf7e684bd3ec7dabef08c8c6ff9f0b006b91c2f6ccad981079312c0aba0

See more details on using hashes here.

File details

Details for the file winnerz-1.1.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5c491d8e46b8f57026a5358e64238ec3d7aed8eaa3a339e4365e5917b9d032f1
MD5 4c30bba23aede453600fba624d39d388
BLAKE2b-256 be50d3a720f21a7e666672304c506961f7b5cbee25673f6aac445b06ccd011f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page