Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering via PDFium.

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Micro-OCR Fallback Engine: A pure C++ built-in OCR engine that activates automatically when encountering corrupted or missing ToUnicode tables. It uses 64-bit bitwise packing and hardware POPCOUNT for blazing fast template matching without external dependencies like Tesseract.
  5. Rendering Pipeline: Integrates the C++ rendering engine with a fallback Python-based preview engine using pypdfium2.
  6. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium.
    • Resolution order for auto: Uses PDFium when available.

Advanced Features

Micro-OCR Anti-Obfuscation

WinnerZ includes a built-in, lightweight Micro-OCR engine written entirely in C++. When a PDF intentionally hides its text by removing the ToUnicode table or scrambling encodings, the engine automatically falls back to rendering the vector glyphs and performing Image-over-Union (IoU) template matching.

  • Broad Language Support: Contains 2170+ built-in templates covering English, Vietnamese, Latin Extended, Cyrillic, Greek, and Thai.
  • Hardware Accelerated: Uses 64-bit Bitwise Packing and CPU __popcnt64 instructions to evaluate millions of pixel comparisons in milliseconds.
  • Zero Dependencies: Does not require Tesseract, ONNX, or any heavy AI models.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs parallel Block Redaction across multiple pages and returns the cleaned PDF as bytes directly in RAM. Use with caution on very large files to avoid memory pressure.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the PDFium preview backend.
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via winnerz.open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).

    [!TIP] If you need to open multiple copies of the same file concurrently or bypass this cache (e.g., in background workers), initialize the document directly using winnerz.Document(path) instead of winnerz.open().

  • Preview Document Cache: A separate caching layer strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Logging

WinnerZ uses standard Python logging under the winnerz logger namespace. Error and debug messages are routed seamlessly to this logger, allowing you to configure professional logging streams similar to pymupdf.

import logging
logging.getLogger("winnerz").setLevel(logging.DEBUG)

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested trên file 185 trang PDF chuẩn:

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

C++ Micro-OCR Benchmark

Tested trên file PDF bị mã hóa 100% chữ (Ép hệ thống quét Micro-OCR toàn bộ ký tự):

  • 🐢 OCR truyền thống (Tesseract): ~3 - 5 giây / trang
  • ⚡ WinnerZ Micro-OCR (Bitwise Optimized): ~0.33 giây / trang (Nhanh gấp ~15 lần)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.2.4.tar.gz (84.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.2.4-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.2.4-cp312-cp312-manylinux_2_28_x86_64.whl (7.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.2.4-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.2.4-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.2.4-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.2.4-cp311-cp311-manylinux_2_28_x86_64.whl (6.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.2.4-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.2.4-cp311-cp311-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.2.4-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.2.4-cp310-cp310-manylinux_2_28_x86_64.whl (5.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.2.4-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.2.4-cp310-cp310-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.2.4-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.2.4-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.2.4-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.2.4-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.2.4.tar.gz.

File metadata

  • Download URL: winnerz-1.2.4.tar.gz
  • Upload date:
  • Size: 84.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.4.tar.gz
Algorithm Hash digest
SHA256 f0e97b5d74d9ef96a7a9b3f853721d4582727c6f12e87b7281524af1cf058d62
MD5 6e8ea046aa2d5695f4e2673b42c1d4fd
BLAKE2b-256 5a71f062d1e8614b8a44af707c87e4e50eb4d7702d26a3576a873c7f07b62e5c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 8c0273f7c8fd34356d605df3f304f325057aa7e992812d222cf9b1cf142d0a88
MD5 4188ce2da103f2d0ddf72b1ad77f71d2
BLAKE2b-256 9afe279cab519f4d47329e757b666cfa09ef84166a14100ee71235cda3036e39

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ef6f66d2f80ce753b5379833015b41d4a2c15d77d06b00a42c7cec46044a5739
MD5 46c14460025e5c2cd81a620881324b6d
BLAKE2b-256 f11e6692848cc3bffa94403215e7d3b7d7a556a04ec246edc23abded1f198070

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d91b39135a924ed54d27600fbae1bb6de5e87f649ce5dbf441ad331d2ef4448e
MD5 5467d95f14821bfe9e655f8dcbbfce3a
BLAKE2b-256 4b298ecf33f147c4034761b159b0a70adc7529dc1782cdf90a2044b207a04797

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ed026050c2af19893ce0972b0472eb785012ff2894fcc38780f9429bfbf00b99
MD5 3c3e1db49c16f302a52c34d4ee56bce3
BLAKE2b-256 c2b840480f831d952e42f69b322948cad6d8411ad90acb6354dae37669ab2dbe

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 fb18a23bb636507b6f2ef0e7bbfbd8d78ad509790874071b4ccfce715b7a2e10
MD5 36280ac985614d6e42b8cee85902b6f1
BLAKE2b-256 0b1f137b4042f69ee44d578558679d28c049fd9c85171ae74c8d49cf7fde3f36

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e55691dd52ac55a64163b990a9d7db0296c17110982429a1f78b534879b673c3
MD5 22cb13a7daae1c82e2bf453caeb6eda2
BLAKE2b-256 af6039772e7830ed98d3839f05f2374b845b889a927e8a25077f6094873f04f7

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c47bb4b88c6a19759ecce723136f1d564cbfe52d43de37e0217c58a71f6b2e8b
MD5 8ecbb586935de77bc09954448a251ee5
BLAKE2b-256 0f5b04f13f922a733881744389f0719b2011df17e5964ae26d8717459e3c255f

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a4e95a31200b231a9e4f7fdf923eb953e53cb0287b3195d0d4ef74301af5c680
MD5 2b91807deaeb003f20ab604d21b88267
BLAKE2b-256 bf89cc52dc6b41d57acae30e6fc6b8e01a90894e5d831df18d80575e1922218c

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 56c2f2fbdc13336a5070ee6dafda2184086e91d58b808606eec35a22da326293
MD5 649aabad2658c90a683f4e4bdbf0ec08
BLAKE2b-256 32e07b6f104f24e2ae4dc673e7968e08d9c0e713798d0584b8162702371c9017

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6256ae58120d39423d2162ebe3b7298023f930044c7117dc3759d554dc79d7f2
MD5 5c68c639f5ce0115a78ef4f5e6c8634b
BLAKE2b-256 1352d810581dffc698a7a3878b0d5789f0d6207ed93d6820aab523aa4a0d4608

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b4f27e36bb79ca5a700af74b43c374b3dee0e1d530322afcc55fb4e14d6aabf6
MD5 0b8d85ec52e260620cd676f576727537
BLAKE2b-256 2c6f9354892bcfb33567dff52f23d50471e30af8c6c89b7b18709554797e9ee6

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 885d32e430832ba87f8cd19ed72e155c930c8bda05445926d1c7d3a831492b36
MD5 995fe766b64dd7f783a7bd8c7915ffaf
BLAKE2b-256 2eda3bf54014252c9b1a52126c99d768e484698be0514c3b48cfa3b52bad0e15

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.2.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.2.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ed58bfc81aa976a8ec4fac1c22e9b5137b5999953ae38933efbb671ec6bfe840
MD5 42a06d9f9fe342b3748931708484f1a4
BLAKE2b-256 d5316510682f2c3e06e391550d818d953288bda249e234fafcb5010e94848f37

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 450261f7da6ef8622a5d6f54ea6b1d63cf635bbcc97698f24318b74de39e8778
MD5 0081a3f69a08d51686f34eda7f40dcb7
BLAKE2b-256 fc9e2ff53323339e40a7406076d87f309b340625ce6cf10eb44702dbdb67eed0

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 401c3f44844e54e918578b8df70eddf780b2a138a575e37d98b2890f504e438e
MD5 965fb903c9e7f48d7cdb534c25379974
BLAKE2b-256 8fc58fc0c749ef90804d6496af9613d6873fa8dd28920bd5f931be3f3669fe68

See more details on using hashes here.

File details

Details for the file winnerz-1.2.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.2.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 03bc9d180eb3de9e2686350918a766c2d9c549217601995e68f442fc58aa8c8f
MD5 a563a66253b5caa56d68dcb266a490de
BLAKE2b-256 198acd30a889bb5d8127fb5cb3478b9f65fa61abf11c3cae53d6459493d87724

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page