Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state to disk. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs lightning-fast parallel Block Redaction across multiple pages directly at the MuPDF C++ layer, returning the cleaned PDF as bytes. Outperforms Python-level redaction loops drastically.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.7.tar.gz (76.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.7-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.7-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.7-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.7-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.7-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.7-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.7-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.7-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.7-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.7-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.7-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.7-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.7-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.7-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.7-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.7-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.7.tar.gz.

File metadata

  • Download URL: winnerz-1.1.7.tar.gz
  • Upload date:
  • Size: 76.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.7.tar.gz
Algorithm Hash digest
SHA256 2d622574357aa894a0a5819d929be31dac0ef6817c738f1f66bd2e1a413e42ea
MD5 9e3a5b64b6a3b3fc824763164d2987d7
BLAKE2b-256 8cdd48434ddcf261524bd078cbc6b6d6e28e7adc06fbb65c6474a5592eb13662

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.7-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 c95f6fac9e10d52a6007ecd87bbfd3c945886d5e3b5dfb71113af1f8be02203c
MD5 e8281bf82e583c553a12d930fedfaa01
BLAKE2b-256 27074e0ae4f70431d753aad7bb3b5f849f0ea78740cbdc99977cabaae5178eda

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4e9eeab59dc8c898791cadee2823af0bbe739e98bd69aafc78a93c7c6d81edd7
MD5 7be0e64b5fe54551adebdf6da1cef50c
BLAKE2b-256 899a67ae4316114f1673664fabb9b023667bac0e20d93c7f5e305b150c47c758

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7d39d794d36433ecc71d92dc0d04feda78acd1669c45e0668e93bdbe52c5132b
MD5 0c086328269011b8e2f15ca1a83cc1d9
BLAKE2b-256 081e5281da53c1dfa93709a58efe9b47d8bda8fe9608e49265d17026a1c96d1d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9e03d0bd11b3a023e5b680170e301dad2f147bed2c752bb44998051739452ad9
MD5 5d6276f0271fd8d376b421d508ccfffd
BLAKE2b-256 9a6b75b8a96f7a46485e192a03c4c8f50d2be438b26c76bbf5235f9f943f6aa9

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.7-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 57c479978883a19d71c305bc7508f2bc00d329a0f0d5f9bf3e585f928cf7d964
MD5 58d92823804f166666521d51012046e8
BLAKE2b-256 0e60742958c5a57a2c38b4875d795490f0343443a9bb02d899a9edacf77266c9

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e9956543ae3561400f73ad610c1e6ebc46d030254235e4a8a40a9d4dc433c600
MD5 6138038159647d71a1c32eb0afdedd75
BLAKE2b-256 d254d24e0cdc44d5c8933267a7a2a3583b450a26e9703c371c61d55817f9ca3d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0c9440d1f068665e98ac0da2a917df9dac99952fbc366e44c8853923c6fe672a
MD5 a2777a00aaabc54d3844ecdf8674481e
BLAKE2b-256 8bf311bc63d8ff6a767bb9bad825120ae4e33d3593dfa11a49bb384c7d8274d4

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5a5153c0f4efe6b08e234fc6143702dced75e72cfec90fa7b43de7ee0f938aa0
MD5 bea819e830ba8fcaceafc182e6c8e906
BLAKE2b-256 631730d39ad55a865037d18333fc580b703b991e6239e5ee8654f6416f5ff3a2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.7-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 152864f5f01a8aff4df34b9cccde5132cc7c5da2e0b04e64e1ca7b83c09ba51a
MD5 0a4911a2deb8e1b9b2cb4b7158a2cad4
BLAKE2b-256 5b8df8418b40690f5f1b5c7b899d7f2fc5b849812d4e1087d03a102a14baa9bf

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ec8c37f7118cee9a02bbcd5672af55ba8a7d4cce0bd4c310fc583dd36b44f629
MD5 33d764fb1d62edcfe3cc78f915b1575d
BLAKE2b-256 52e152dfdf0207f143898703ba03933bbee4d54d48df86a6599224da593757a1

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3e2ae25645472131c7c85f342e93029b8d3d443c4b98011f50a333f7c762192c
MD5 9a17ea54db7d7e69175427a6ad76d131
BLAKE2b-256 7a1e9cab5e5d4910723800965fec3bada84646ad38f7bbec65ac6c7bad01b346

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cf867ea6433886c7e96a9cbc7fa56437aa54fa7eaa542ad285ffed92e68ea57b
MD5 c08038953e820b86a732e76309322952
BLAKE2b-256 fb271f2191f968edafadde2cb0acde6a668a8602e4942044f9c08f36e3568d5e

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.7-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.7-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8453991342447d6badb6211560778d3d7817ed1926fda5c998fd40edc4de2fed
MD5 ad0936fac45fa0bb5c0bcab033767361
BLAKE2b-256 2b0e95cb59d0e560c4d13e1fd894edfb7e95139a610ba4069edea49a40f12b11

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 a80abb71d496c661390f7f5fdcdf2f309012ffc5fd16fa4bec2efec7fd3c04cd
MD5 1870ab098273f2605f09a4ce15939f50
BLAKE2b-256 ebe33ec78d4daf51e89f05afcb14aacbb317cf01e2814b38b45932b9a3ab7099

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e9d0405d9cc23ba9b8e2a386f6df74c31e72acc3063485362a4dd144e4e4b670
MD5 3003e8f963e18f91798a98e6e036911d
BLAKE2b-256 6d3e6ca6c32f7934c2bd539bc040074470a14ec28f971d34f2b0df54986f3bbe

See more details on using hashes here.

File details

Details for the file winnerz-1.1.7-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.7-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3347b71f3171587879895f96e83a2deaff8835648c3a28646b9f7b2fce446d1a
MD5 0cbd7431f92dc68eda4e401345f07ddd
BLAKE2b-256 05bf4f154de81ca60c89c110011827d08c7a61df3f37c3e2ec8e744a0c60f32f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page