Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.2.tar.gz (58.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.2-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.2-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.2-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.2-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.2-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.2-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.2-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.2-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.2-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.2-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.2-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.2-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.2-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.2-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.2-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.2.tar.gz.

File metadata

  • Download URL: winnerz-1.1.2.tar.gz
  • Upload date:
  • Size: 58.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.2.tar.gz
Algorithm Hash digest
SHA256 7f6ff9815de3a2e334da48710ce7e45fa6aabb5980dcc5c6be210279cf70cbdc
MD5 1b22111fb83dfdb4db0b6504fb8c999e
BLAKE2b-256 a6d7b26950adfaf1a74b03023832a1fee9a64e7e6f52c255c9fb0ec170ff9de6

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f3f7d9966cc287f5249ff4ae8280139ab2160b06f7e6fb7f05ad7de94947ca00
MD5 54e48737acde00ae666efabb3df65108
BLAKE2b-256 6e1e27b76641fce77eaa0db1af8c5b389ff936bede55fd476e87689502b13d9f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7618f8c5be5e2a4a8325282f8964260681e39f3d7d4a209735df31dea6dc20b0
MD5 9dbf8b686520b5a58a6ccadd91f88e07
BLAKE2b-256 9c42ea628f89d7d7d8be65fac16fef369f7c87472cf43e0692f3ea178b227f4f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 41008b3ff132fe10c9ae982f9e4a5235734c3282be379a284aab48866ff1ac2e
MD5 9e5d6ea489304c1d95182a2672db5493
BLAKE2b-256 b5a2c0a03b1aae473e6e85680271751ecbfab3cb94c55063f48b0b0fcfe6b854

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e6f6c7ecdc5a7f1b5fa03dbc8cc9ce257b7e55f21b33c226559872ea27112b9c
MD5 6c45474b2b73e181975781abe2cc39b0
BLAKE2b-256 6645f42375ff55558338e77f8aa4a3995f9a84425aa331dce4d6a385a513c3c2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 df105e32154f22836b4bc021f7fee6228d68fd787911699a7a4fd48abeb2f377
MD5 003f25a85b8a1772e5019309f84928b4
BLAKE2b-256 38c79879ef1ad40515659e6256fac8d565ae5eaccbf09d5224ddd671df8ffe38

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ed201377f025f3c1197965d4caa2117f8c94edc6c03adbcfe03d82842a29a89b
MD5 3cf529f304d4fec24504c5b6a8437f20
BLAKE2b-256 0a721964b86be361481a4767413c3c22d0c75e6d8b94ba10a468ed4a7fa9ffc2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6091d62d9ed742fe893c41a4c14e7ddbafe6e136d27533ee1761d2e3e33c7caa
MD5 531eb107d17782232a5163dad8078cc6
BLAKE2b-256 81d2085c731add1217fd9f6fa4c622036393befebeaa3cafa20b5dc9e72e389e

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0fabb870ab04fd553733876f0bde6db712a3df825161926b89644b1412041480
MD5 5e496622ed723de1ead7d1f370bc6f05
BLAKE2b-256 6b2e1cfc23cbfff1fde68fe2cf3754c1c651db7143461a4e02cbb3733e282591

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 044b79b84c8821758574cc6fefb395e1994d8a0723ddb42c926bac2c8b26f99f
MD5 b777689ff660032953464d4f9ff70b6d
BLAKE2b-256 26118351a9ee22bda78d1e6ecdd9087529e78c246b6a93771055fc6a2c70d104

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 933466cf8e2ccee6ae2c9931b2e13755fc3d71d20e151dcab6e3dacae7d69e99
MD5 c56ff4e6b9435a3a76440a7c7624ced8
BLAKE2b-256 b8b2ea20a92552ced19ee4b2bf2f08a59a280c237f3b240dc931eecdbd97736a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e4e1566bbc54e51493fc186135a99246126aca1cf8aba95f0882d1a21545f506
MD5 e47de85f537ec15ed265d5f1c67d41a0
BLAKE2b-256 880eafdc23245258cf8a2ce1383df6cea9cab86e5e76cc64011a6b83b1bcf65d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7a6f1f8814a1ce7b7d752eed484e77f8f473b35af1dd3541270a11f2e4c4d068
MD5 e7889cd4c26c8a2ec7f06f3ef315868a
BLAKE2b-256 14a34f8c018ad3c279921a6262cfeb12db8b2aef1ace74c1c7c7b69f88a95267

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 fb40949bbcb883ec2a8ee1af080eae8d3a94634dcd0799f117bd19301b8929fd
MD5 aad34a46f0c6f97d1473163f26ec3c6c
BLAKE2b-256 a298fff35f670417c8b213d433962aadc980da16fc9bda0b48d3e4e6de1ed811

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e96bef30e8e3cd7e564e907b751bc86f45fbe1558166d03aaa12cba6b8f51552
MD5 1743acc53a338e3dbaeed1da8dffcb96
BLAKE2b-256 d399b9015a522b6c9b1994c2eaaccad831cee997992d0a830e96ee891053fa1a

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ff0dc2ecc1465abf8770c53a9d978265aebb5d77404f51a42d35a97f826eb2e0
MD5 56a6c2bb1616c5a389564fe1fda5a2b9
BLAKE2b-256 7597dc8f1cd0a2f86e9568d3af7e6f48635e8fa8eedc422e3da35e5787c25553

See more details on using hashes here.

File details

Details for the file winnerz-1.1.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7448d0f3ad1c79c37123b9a71764479ff5c4a4b91ee07b80578a0516bba4d6b1
MD5 b790a321bcc8ccc05738cd91f7be26c6
BLAKE2b-256 86d5723cec654982cc45a1d51e1fc547ceb2f7bda589ea14efbd532f5fb087e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page