Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.5.tar.gz (74.6 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.5-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.5-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.5-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.5-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.5-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.5-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.5-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.5-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.5-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.5-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.5-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.5-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.5-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.5-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.5-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.5-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.5.tar.gz.

File metadata

  • Download URL: winnerz-1.1.5.tar.gz
  • Upload date:
  • Size: 74.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.5.tar.gz
Algorithm Hash digest
SHA256 c6fdaa1db1954180ae34ebb414f12ca0a25606f9cd9fe30e5c34437556c2b864
MD5 03515d6308f192933b4d411566e6d95d
BLAKE2b-256 1a777863b63b198b0dad799ad4c1f5207682af03391295138b075d7b8ccb46fe

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 472babccb89825d9deac85800de31e7074fdd14d37044c2873576000fe466b3b
MD5 964259186c38a322de84459b2a44342e
BLAKE2b-256 85e2a3b21f1939328d7fa6daf5a407c79cee6c7fe04d9e4e72213cb0494b26d9

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 23392d3c27993bb445d1340c9ca11b7b9491522bca3e0ce8b1837e2049fa064b
MD5 f4b1662e257c2a8b3f24777afbbc24a3
BLAKE2b-256 8d8cabb6cc74eb600cf6a0698780be254f122a199f79c5b71929e0b5f8679913

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b13b7a546bcf1cf8cf2e9d7e4d1d39346f0699df630dd6fb65429b38549e806f
MD5 db10526a63e98f6050c7cac427429afe
BLAKE2b-256 ee2f999762fa7e19f20a27e93442b8487063b8954d27a4cd2536c6c762ac1e85

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 900ee01f4171b9ae330c9c812b071e79c8f355877b2535b3ab6702b062f21a5f
MD5 5a3d337c4dd043b9b01573d45b0e4a6f
BLAKE2b-256 a5d32717bb74a524fdac230717180d979d6b890563c0eb3fe97ed4cc5e9c5bc9

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 2f2d7ac5cf143d6412f1b7f6994de99008308f5a51743a785a4b4259b3fad6ba
MD5 d49f77c4900214fdb2d41fbe50a03107
BLAKE2b-256 db8060cf635d0dc05dcd58fbb715afedcfbd592cd7cfa3978bf1614cd50ab144

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d384ddb791ee36d3e6df5f9225368f7b315cd467e21ee9491274a95b3148ed5a
MD5 3dc9c7deaa7143124d4c3e4bb08b1edf
BLAKE2b-256 75866b077a774a12c1e28123bd804040dee3925a69c0b3dc27526716db9e1deb

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 decd63873a5698869ab1951cebb07862db22db281d1077a4728b46223d47fa44
MD5 50cbae298999f8c59fbd3423bbe8eeee
BLAKE2b-256 e2fe188e878c0a8a7cba3b2d755580f4561219df78ed3316cedff74e257ce562

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5275b9301d51f7d2d68ee2466dccecdaa31506813376ca5c3c49b6a4a4a54cd7
MD5 6e9683c93ec396cfb4b504222663daae
BLAKE2b-256 68edfd8312d0d910ab2b5df31ab0c238834966b7cb7788bd6f898927594eee86

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3ba704a49838e31aeee85467e7164cbc72cf9cab0a405583a0066a8203aa5016
MD5 928b1783b0cbb687741db0f26226d598
BLAKE2b-256 7d0d215cf63537c7ff368817ed210de94ab1090648191ea17bf4eaea1a27cc30

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 339c830c8c11a1f3450620441744c8757e7bbae6f121a32428887ad5a3857bbc
MD5 fceb9a31006b2d189730ba97edff1505
BLAKE2b-256 3bbe9af45373e1d5f158e8ddebe55eee26e204b5d9d2f02bcbe9f7593a6f983d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c793745851857a8df967c93c909586e7c7c3b3a3680e61240a584fe8a44fede2
MD5 84ef392a52b7f454b522ee7379c296bd
BLAKE2b-256 1d118c13a4993dc7d9604cbecb22c89a254d0ccdd43567337907727453b825a7

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 81cd3ec196b3bfcb5e7fe2190117949c94fe659715845ac8cdda2ff2eb89580c
MD5 9e45125a71c0532aedfe1d05f70f7365
BLAKE2b-256 d2e158c8b9a50e5fbee9929dc34be64a76742c49bbbfac591f0d10bda32a40a5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 182d5c9435c79555a133ad2d75da0ca380d3eb1fc4bea7c8d21233b1b127df89
MD5 4762ead7b991139fbefd25f8ffe538a3
BLAKE2b-256 6d0d916956e56bb9c7102493ee5c39e4a92641706e45056313de64727fa36cb6

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 234c25a01c036fcffc97a39e1c8374734f0ad5b18e4f46c352b91858baaa1568
MD5 de206e83da69ab8752fc91429a010afb
BLAKE2b-256 bf672116b8d0c0c313b99b8bc70939efae24fb838007373b2137d37606e4cfdc

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ab34b2adf7199791b7233de5bc251709c2d0ba105b1c81ca3ce08750a7035f3e
MD5 ef8538d655bc714d3775b28c2294f123
BLAKE2b-256 734719c77bd15faa2d2d5dcf5d95d3461a69b92c14fdbb3c36686ff091a1c5ae

See more details on using hashes here.

File details

Details for the file winnerz-1.1.5-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.5-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2473457c171d3ba43287c0d7367b490e613aed50efec618bd9d3f399398f93fd
MD5 4ba2c197648ce44590c2301feff1b5ed
BLAKE2b-256 bc852e6cc340e143d5677433deef051c8af405b8dea46e369ad97bf7527737b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page