Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path_or_bytes): Resolves the file path or raw memory bytes (Zero-Disk mode) and instantly initializes the C++ core. If encryption is detected (e.g. RC4/AES), it falls back to an automatic decryption routine seamlessly in RAM or via a temporary file.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state to disk. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • tobytes(): (Zero-Disk) Returns the finalized PDF as a raw byte array directly from RAM, avoiding any disk I/O.
  • redact_text_multiple_pages_to_bytes(page_rects_map): (Native C++) Performs lightning-fast parallel Block Redaction across multiple pages directly at the MuPDF C++ layer, returning the cleaned PDF as bytes. Outperforms Python-level redaction loops drastically.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True, keep_proportion=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page, scaling it to fit rect while optionally keeping aspect ratio via keep_proportion. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.8.tar.gz (76.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.8-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.8-cp312-cp312-manylinux_2_28_x86_64.whl (7.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.8-cp312-cp312-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.8-cp312-cp312-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.8-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.8-cp311-cp311-manylinux_2_28_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.8-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.8-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.8-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.8-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.8-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.8-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.8-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.8-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.8-cp39-cp39-macosx_11_0_arm64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.8-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.8.tar.gz.

File metadata

  • Download URL: winnerz-1.1.8.tar.gz
  • Upload date:
  • Size: 76.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.8.tar.gz
Algorithm Hash digest
SHA256 3868118850debc1b6f1052debe5d6022556eac24ecc33d7f69b277cff84c4338
MD5 05ed03b97ac4249b0107ef144918f6a1
BLAKE2b-256 0256f9add4a7362ed51ce9cb847ca41aed94d7374386d8dc16e8bc6f9840f4d8

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.8-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.8-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 dc45071a4a57d022c6b625ab07525c3c30f800c673f0d66e05b0e3d86f2af998
MD5 85da4772b4f795bf264f0cb4a147c3ff
BLAKE2b-256 a9daea903b4f56010588afe2f71cef49caa26abc6b01e3a3ffc1ddc91f12b119

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2e0ed8a309695fbb7bc2d6d79e034282938c1d07cc849d7ec4b89309cfc6bde4
MD5 0b867e4e7c1cc77a60ba921707040e12
BLAKE2b-256 14c79753fb67e102be275df6c1c171a8fc314f95221bb8c299f6b624cafa3b30

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d41c8907035150cef81bc260232c23f4bead7fb9184b0aaa47261c8fe6ab53fa
MD5 acba49945f9e80bf646de42f44ab96d2
BLAKE2b-256 3d671e47f6d97fd05108f86874f2b7b4160c76368ffc4d923440e262658771c0

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fddb7541c879eb9aada8509ed226353a8de68dca91835487445fd971e4f1300a
MD5 4a107c31d6619287b382a73fe3e2baa2
BLAKE2b-256 10569d5552f0641d0f6baae777d23df9aa4fbbf39d1edfa7dac63f46a9ea137d

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.8-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 36acc0df803e3f871377daab4953d5c818cfd45421e2ef2ae091a9cc4ada3e98
MD5 c5b8db687917df0e746a1e5849ce41a3
BLAKE2b-256 316b47e8e2f1649930245c7a868278beb587d798b0da869de8573693cf01c5e2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d8da6f57f28190fb53351f309fd5ab453e4b3e42127f2a01120af2493e4f8702
MD5 947f079982d1fd34bdff657ce31092ab
BLAKE2b-256 79e44489ab9764bb383ac3d09c022fa21f940c95417c72dca46f95a6f5c90f14

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 649cbc95a12ee439404f51469411c5f8ac0114c804ab2bcaee01acb6428e965f
MD5 5c950db7400fffc0e23a1823e00e3cf0
BLAKE2b-256 be211436973d57761c94e8bcc8a602d319135a2fa46e10ffa0bcc268efb87ecb

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 a44f153bb19565f226cc6516693c079cf2cedb60143d13ada3f497a1dd09b66f
MD5 dec482e697cb276aaabeb19f6e480837
BLAKE2b-256 fc718a7b4f8d009597e230279dad31074fe66b47a47e2668db9372dee021fee8

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.8-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.8-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bf989dc0f378619c388a5a8346e1fb496eea60144ae9aa113c2a8f3c03944021
MD5 832e18d876a5875e4f444e313ffe2e82
BLAKE2b-256 c2125472cee65409c87004a7f70a856cc69883e8a7fe60611fff334cac0b12fd

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4bf8f200c7a58f41d108ff0c1455b556d561f3a16b8604ccccea785f978ba908
MD5 7c1eafc313d79f57d8001bf38591c610
BLAKE2b-256 91ede2ebcebf28dbdaf9ba984851cbf47f4cc0247299fc5c30e7b6fb1f171acf

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 06d05247917a252c004bb65c30d7fd12952fdac3d04f5f387d2a7f1a7992deb3
MD5 075478cd830e310ee7efdb558c62394a
BLAKE2b-256 37951315f52fbe1676ff5e2851e2c93e49dc21a052f99b0980cf131986942338

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 90be16876431b082aeef234f5f7dbf7854517c248087d2e8b092fb14c1a969f7
MD5 d25dee4ddf950250b2718041de25a959
BLAKE2b-256 57fc3f486fd9e74c6681e07594c57df8027ffb48c7e22f5d22a828033c97624c

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.8-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 90090f08cb5f1da0c070c42ff9a62ada30cddadfa1ed98771f478bd51be8727a
MD5 33695a6eea6b475c7e49bb35a8eaab0a
BLAKE2b-256 9e1ef0d63ebb0bb348ab17232eea201d12d817bb77ad1825658152109b9927c3

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 31b29fbd00515248cf9bc594b92cdbcf4a3bbcb35784157895e5915a8e745679
MD5 b7c04dad134781d049dceb36e809d13e
BLAKE2b-256 301ca987f7a2ef74be111c902c94f18acd786f80563e2ff45e75b286d17e3e02

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1cfcf9f1d1969f0eff8c7faabe7a313c6ded5fda812c9664dc3551bee9e76a54
MD5 d7a9b789c1bf01f12bfac570b083e0f3
BLAKE2b-256 f5060dff72261c19951554280b5f6a6a539ed91babd28844ea35c4087f7c5f81

See more details on using hashes here.

File details

Details for the file winnerz-1.1.8-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.8-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6837af5479b5f6494616a08c3e3ef9b88f4ce98f707f4699be9de53898bf2c4b
MD5 71e6173f130349f5a96c5740718c5a5d
BLAKE2b-256 c3081c7341f5ffcd837e09228cdbd05be29cd51267d0eaf4cb5f2ec95c2dd9b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page