Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages simultaneously, completely bypassing the Python Global Interpreter Lock (GIL) for extreme performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): Applies redaction to the specified rectangles and saves the output to a new PDF file.
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional. Used for decryption and as the primary preview rendering backend.
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.0.6.tar.gz (37.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.0.6-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.0.6-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.0.6-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.0.6-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.0.6-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.0.6-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.0.6-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.0.6-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.0.6-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.0.6-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.0.6-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.0.6-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.0.6-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.0.6-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.0.6-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.0.6-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.0.6.tar.gz.

File metadata

  • Download URL: winnerz-1.0.6.tar.gz
  • Upload date:
  • Size: 37.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.6.tar.gz
Algorithm Hash digest
SHA256 aea940e4c6c97c6157f3ade824dce1f8f85ae7628a7a101556c287f219e1ba77
MD5 ab782f39f866afd03522bcd1f108aaab
BLAKE2b-256 fa4dbf46f47508379f2a7b435b13fe06fd2a1936befa1570a6fb90fca1cb450a

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.6-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 123274a583d1d22eae8012cdc2d5ed9644ee6cbf25bf923efd5d7afcca42c6d4
MD5 5696bf579fde09718cc2e59e01669ad7
BLAKE2b-256 acce528d4001f367d016bc39d86aab655369043e4e4a9d00d7478f657b8e9110

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d4dc1193185a3274a441ea3bc33e238cfd8d926fe835dc9ef8240c87e1dc7bb0
MD5 24ad9999d3ca933de1ea4d7253026f32
BLAKE2b-256 6973b95377845f18194f17d4b0c662493f7d208cb9100231ce7d8da1ac0ac707

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c48509ad164a0931bb6d0d156f81eaa729839adbc00f65d4d089f7e9ee5d1a7
MD5 a19693e679e0b293dfd0cb4791e81e77
BLAKE2b-256 217cb6d1dfda59a3f0fa275a00cce1ce25b5313bea583d95863b07fd2c990caa

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 56b97caa433ee927f6efde50e46499c5811dd4c79aa13f4d6b9bee8b9c18a1c6
MD5 c8d67a544a1885c037e3866c9a2b98c7
BLAKE2b-256 21004de43b78f7e5f34010dcb04f1f57e210a9bcab2decfd77feb73d10a78331

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b588aa50a7d490a8e806d57294db38d7a385fcf4b082a72e3155811be8a36711
MD5 2d230e63eb503cbf78a4cf0e9b0869b8
BLAKE2b-256 5fb61e1441763181e6dfe886ffe44128e48aebfef24fb03df7f5cc609c4e466f

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9134fe099227511da070c9369604fb0ba3d52fefca6da94ecc8720fd6433bf56
MD5 3db89a26d891d8cc8788a2f157a20056
BLAKE2b-256 7dde3738b49eb073bcf04e73d0717c4bf2e4f2a9b0d6387b6e4b35cf756d67dd

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5f76410af48877bb562d9e502b43cbcc6fc1ae03c597e43e2bbf23b1f8eff7fb
MD5 abefb4b3c482fa950bc803e4ae2c2575
BLAKE2b-256 5c78d546548b8bd501a1c50f5dcd20472a3a16683bb88735e06297b0b493e519

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6b01a856eef727da39184075b652ffb893d302220df5bd225193e2eeea4618f4
MD5 03a040eeb883df1d4a171b6714be7377
BLAKE2b-256 84826170426e9b771193408f7d1d9881db489f04ccd1c9de04a43e3c07a99fe7

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6658e014c26fdc4179883b8e4f429c5df27e171ee0ffd0fbe49c596286b40a52
MD5 94c83d2f4e05ecd3d6f59882ef15ac85
BLAKE2b-256 16882a29d261c180e4baf5e4a823ebf0724189696645de09a04ca877a2c2ce8c

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8165363f24ec53438a905bb1ecc9ec21a4c83a7c97e8c6fc43f9484eef64e854
MD5 de43cc5ef7ee1984fe661952799a08ef
BLAKE2b-256 b9f995515d21c5c45a25b62fb2788e353a5b448e8ad8c2def4cb8403271d734e

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f57816ee9fe044ceb360eca3f70b2160059855429888510d07528b146528f5b0
MD5 d8a3ee89837c7c09238cd69d39ba0297
BLAKE2b-256 c3862fe83d8656a9eccd7ba43ac1ac9a3e57f3451961d6f13a415a9e0903e230

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9b76bf5309f7674c75aa3eeb6f237a7cfe77c651b64d88b27b2d49205f123847
MD5 b3a9878854562dc903301fb649a3fbe7
BLAKE2b-256 808afe6d86f56d1088d50744f30622c0fa3c9d218d7417ad57f9ac762da20f0e

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.0.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.0.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 75636cbaf3a0dbc38745905f99ec53924d45da0583394b8ff11ce9ab1cd23054
MD5 987ea832c695c5b9dca6c63cb4bb6c4a
BLAKE2b-256 a2530b92ff743934e8eb1c69393831f392e0773837a5b8113a1e80ff64abf8b7

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2c5b724a76db9b414565ecbb048dda5a4b9f1ef16952de87e736b0c8ff7cddb0
MD5 700bd55e37a90f2114cf4f493633d51a
BLAKE2b-256 af937415b79cdef0eedcca8f9a6568617605889add6257f6789d9d5c6dc29fc2

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 adbed7482f3748ffac664de95d926df19c65324dbcb3f07fbc3e7d9b738dcf61
MD5 a5eea1ebf9046c13be8dace33800ecec
BLAKE2b-256 ae5b1ddd79ca2dee51eca5aa35f4a54973c78e22ae5a2fde216537f9d5dbb79d

See more details on using hashes here.

File details

Details for the file winnerz-1.0.6-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.0.6-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3d24b81680b9438f341b213e6642f9e708e0718b10f27a35aaf11af804344871
MD5 967c8a32ebcef9e6171e13638acc871d
BLAKE2b-256 59c5440ee7848b0ff562a8385c9a71b4c8b3ba9178f1a41f828b261fd328401b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page