Skip to main content

A robust Python wrapper for high-performance PDF rendering and text extraction using C++ core.

Project description

WinnerZ Python Library Documentation

Overview

The winnerz library is a robust Python wrapper designed for processing, rendering, and manipulating PDF documents. It relies on a high-performance, multi-threaded C++ core extension (winnerz_core) for intensive operations while providing seamless fallback mechanisms and caching strategies in Python.

The architecture emphasizes blistering-fast text extraction, reliability, and fault-tolerance, specifically in handling binary dependencies, dynamic core library loading, and flexible preview rendering backends (PDFium and Playwright).

Architecture

The system is divided into several conceptual layers:

  1. Core Loader & Diagnostics: Handles the dynamic importing of the C++ binary (winnerz_core), including binary size verification, truncation repair, and Windows DLL directory management.
  2. Document Object Model: Provides Pythonic abstractions (Document, Page) to interact with PDF files, managing resources and state safely.
  3. Thread-Safe Interpreter Pipeline: A C++ native, thread-safe PDF token interpreter that leverages std::async for parallel multi-page text extraction, eliminating GIL bottlenecks.
  4. Rendering Pipeline: Integrates the C++ rendering engine with fallback Python-based preview engines using pypdfium2 or playwright.
  5. Geometry & Data Structures: Implements domain-specific types (Rect, Matrix, Pixmap) to standardize data flow between the C++ layer and Python runtime.

Core Loading Mechanism

The library initializes the C++ binary through _load_core(). This system provides the following safety guarantees:

  • Thread Safety: Uses threading.Lock() to ensure the core is initialized exactly once.
  • Retry Logic: Implements a retry loop (_CORE_IMPORT_RETRIES = 3) to mitigate transient filesystem or OS-level loading issues.
  • Self-Healing: If a truncated binary is detected (e.g., due to an interrupted build or copy), _try_repair_truncated_core_binary() attempts to restore it from other valid candidate binaries in the directory.
  • Diagnostic Reporting: Generates detailed error messages specifying binary ABI mismatches (e.g., GLIBC mismatches) or binary sizes to accelerate debugging.

Environment Variables

  • WINNERZ_PREVIEW_BACKEND: Controls the backend used for rendering preview data when the C++ core returns placeholder data.
    • Valid values: auto (default), pdfium, playwright.
    • Resolution order for auto: Falls back from PDFium to Playwright based on availability.

Class Reference

Document

Represents a PDF document instance. It manages the lifecycle of the underlying file, featuring a lazy-loading fallback architecture: it natively opens the PDF via C++ by default and only invokes PDFium for decryption or file repair if the document is encrypted or structurally corrupted.

Constructor:

  • Document(path): Resolves the path and instantly initializes the C++ core. If encryption is detected, it falls back to a temporary decryption routine.

Methods:

  • __getitem__(index): Retrieves a Page object at the specified 0-based index. Supports negative indexing.
  • __len__(): Returns the total number of pages in the document.
  • get_all_text(): A highly optimized utility that utilizes C++ multi-threading (extract_all_text_concurrent) to extract text from all pages. It uses a dynamic hardware-concurrency batching mechanism to process pages in chunks (scaling automatically with the number of CPU cores). This entirely bypasses the Python GIL and prevents thread-exhaustion (EAGAIN) on massive 5000+ page PDFs.
  • save(path): Saves the current document state. Any editing methods (like redactions, image insertions) or page merging operations (show_pdf_page) are instantly written into the PDFium memory and this method flushes them to disk with high performance.
  • close(): Cleans up temporary resources, such as decrypted temporary files and in-memory editing buffers.

Page

Represents a single page within a Document.

Methods:

  • get_text(mode="dict", sort=False): Extracts text content.
    • mode: Can be dict, rawdict, blocks, or text.
  • get_drawings(): Extracts vector drawings and graphics, mapping them to structured dictionaries containing rect, fill, and stroke properties.
  • get_pixmap(matrix=None, clip=None): Renders the page to a bitmap image (Pixmap). It attempts to render using the C++ core; if that fails or returns a placeholder, it falls back to the configured Python preview backend (PDFium or Playwright).
  • redact_text(rects, output_path, min_overlap_ratio=0.0): (Legacy C++ Core) Applies text-only redaction to the specified rectangles and saves the output to a new PDF file.
  • add_redact_annot(rect, fill=None): Queues a redaction annotation for a specific area (supports tuple or Rect).
  • apply_redactions(images=0, graphics=0): Applies pending redactions using the In-Memory engine. Set images=1 to delete images intersecting the redaction box, or keep it 0 to strictly redact text.
  • clean_contents(): Completely wipes out the vector graphics and text layer of the current page.
  • insert_image(rect, stream=None): Inserts an image (from bytes) into the specified rectangle. It handles internal PDF matrix transformations automatically.
  • show_pdf_page(rect, doc_src, page_idx, overlay=True): Queues a complex overlay operation. It places a page from another document (doc_src) onto the current page. The actual merge is executed efficiently during doc.save().
  • rect (Property): Retrieves the bounding box of the page as a Rect.

Pixmap

Represents an uncompressed image buffer containing pixel data.

Properties:

  • width, height: Dimensions in pixels.
  • n: Number of channels (e.g., 4 for RGBA).
  • stride: Number of bytes per row.
  • samples: Raw byte array of pixel data.

Methods:

  • pixel(x, y): Returns a tuple representing the pixel color at the specified coordinates.
  • tobytes(fmt="raw"): Encodes the pixmap to the requested format. Supported formats include raw, rgba, png, jpg, and jpeg. Output formats other than raw require the Pillow library.

Geometry Classes

  • Rect(x0, y0, x1, y1): Represents a rectangle. Provides properties for width, height, and is_empty. Overloads the & operator to compute the intersection of two rectangles.
  • Matrix(sx=1.0, sy=1.0): Represents a 2D scaling matrix.

Caching Strategy

The module implements file-based caching for document instances to minimize redundant initialization and file I/O operations.

  • Global Document Cache: Managed via open(path). Validates cache hits using file signature metrics (file size and modification time in nanoseconds).
  • Preview Document Cache: A separate caching layer (_open_preview_pdfium_doc) strictly for the pypdfium2 rendering backend to keep the preview document context alive across multiple page renders.
  • C++ Thread-Safe Font Cache: The C++ core utilizes a lock-guarded (std::recursive_mutex) internal cache for Unicode, Width, and CodeSpace maps to prevent data races during parallel text extraction.

Performance Benchmark

Thanks to the native C++ multi-threading pipeline and persistent object caching, WinnerZ outperforms established industry standards like PyMuPDF (fitz) significantly in bulk text extraction tasks.

Tested on a 185-page PDF document (2024-annual-report.pdf):

  • ⏱️ PyMuPDF (fitz): ~0.44s
  • 🚀 WinnerZ (get_all_text()): ~0.18s (2.5x Faster)

Dependencies

  • pypdfium2: Optional but highly recommended. Used for decryption, primary preview rendering, and all In-Memory editing/redaction operations (including high-speed C-level XObject merging).
  • Pillow (PIL): Optional. Required for encoding Pixmap instances to PNG/JPEG and manipulating preview images.
  • playwright: Optional. Used as a secondary headless browser rendering backend if PDFium is unavailable.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

winnerz-1.1.3.tar.gz (58.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

winnerz-1.1.3-cp312-cp312-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12Windows x86-64

winnerz-1.1.3-cp312-cp312-manylinux_2_28_x86_64.whl (7.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

winnerz-1.1.3-cp312-cp312-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

winnerz-1.1.3-cp312-cp312-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 10.9+ x86-64

winnerz-1.1.3-cp311-cp311-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.11Windows x86-64

winnerz-1.1.3-cp311-cp311-manylinux_2_28_x86_64.whl (6.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

winnerz-1.1.3-cp311-cp311-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

winnerz-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

winnerz-1.1.3-cp310-cp310-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10Windows x86-64

winnerz-1.1.3-cp310-cp310-manylinux_2_28_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

winnerz-1.1.3-cp310-cp310-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

winnerz-1.1.3-cp310-cp310-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

winnerz-1.1.3-cp39-cp39-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.9Windows x86-64

winnerz-1.1.3-cp39-cp39-manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.28+ x86-64

winnerz-1.1.3-cp39-cp39-macosx_11_0_arm64.whl (1.3 MB view details)

Uploaded CPython 3.9macOS 11.0+ ARM64

winnerz-1.1.3-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

File details

Details for the file winnerz-1.1.3.tar.gz.

File metadata

  • Download URL: winnerz-1.1.3.tar.gz
  • Upload date:
  • Size: 58.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.3.tar.gz
Algorithm Hash digest
SHA256 88608df2ae4bfbd61adee09d8a9f8bea45e0860cad5ddae3f72b9a204d5c5543
MD5 19cf4fc1c481b3f0697cbf56837d1472
BLAKE2b-256 eacad1c515c99bf3317eee5a79b08715f3f135a0b1e83bcdf19d860d297c6ec7

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 42583694ab9614f0193308b8aa92b7f35a5180eba0219cf91cf39b396acf55a4
MD5 e81d7daf0342c68c20a9b8955b44150b
BLAKE2b-256 ea06eda2b5263364708eea745f0e48c129257e9e4974e18e1a97a637067187d5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 043b72fb6fcc6b49b415be7913ca148859b489b9c670100a020a012751c0e9bb
MD5 e97883ad75d64d90ceab4eff7757fe2d
BLAKE2b-256 250ff9ec2a8f5cd43b0021b7bf444f8132bd3b763e015db3dd8c3f9d1c57673f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 73de4732d9fadeafb958fddad5c60f1b4c0d956abd9538163c8f8b7bbb7ac1d3
MD5 f57b7b969aa625e0015190d466c77eeb
BLAKE2b-256 4418136a195ee7a2b1e11009d12cc580def2178498a976047d0db7a82ea97806

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 54dc919fc35da85896f9ae056bfd3c2c55ddd09025e977fefe72234ba424b6fd
MD5 e11bbcdc907271651741f396d1d7145d
BLAKE2b-256 8549ad9f048dba5fcdf51df316db1e232e1170bb68e665cc41ad85cfebf6cbc3

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4ba922f0560597020f1f13400079018d4c5491c4947942c35c4076aeccc873c3
MD5 b2adb4e9aafaaa7033504b73f0f0d3b4
BLAKE2b-256 69a4c4eccbc8f35a967cef0864ce0d2d1c0382a8e6377a4570f03197b159df39

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 75d59114c8df79b0d440020901746c9fcfa35c619ce92af557c62daa11880311
MD5 1bf9b2c595c01afc6b1a1ec942d2d533
BLAKE2b-256 19d79c44efa40a15fa877daea41832db4ccbbabb563ca584a37227b42b38cd54

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 27750189b7267579c6316df018dcf9c6bbb9e930077c0843bbc90ba97b1163ed
MD5 1ab7bbeb41af89cd8d777ef5e0bf06ea
BLAKE2b-256 ed1174dfee95a3e2805bf889693c7c6bdac0896c96ac3a4b3a69a36ccf2b47f2

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8e0f537cd5c8e4e84d969c044da37e3360f8397355560b39d6ea4a786aff64cf
MD5 f71e61a3ac3d090939918ed811394673
BLAKE2b-256 35e2c1c98be32b52515d0096d971962aa0e22c917560ee1168a2429c5a18b616

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 90c4de3f4bbdd96a6822487d7194377f052dcd18b52003eb774cf0c46e69a7c7
MD5 69f7ef153ab6f50ca88369bf99770f2f
BLAKE2b-256 6a73276342522b37e4f2700c0acc2cb58a3eeefebc6c976aec13debf32623cdb

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b3ba3fa892ea77c00b24c97acb68ef900353c0d262faba2c721c070faa41a221
MD5 17ad3e103470da92662c136c56675b56
BLAKE2b-256 4d5e20eedf4ea0ba252e1ec7eab4549100b2b6d4cfbd472c06741ed4891d5d4f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a702f562321817204d86320cd2fdedfaa310cff598d4f16c887075d026935bfc
MD5 a7e1abe0e83992c256b22939bd5eaa1f
BLAKE2b-256 59ef28632bc8a16f4160d09682a90108632a79bb9551e1d361bfb2ff19714112

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 077e3ef62c61b459f0de74b2061f10b0a9a2a4b81b02d637bb73a32d59534a23
MD5 c067162e094ad0a069870626931a2839
BLAKE2b-256 9861a90a9d902aef5726294cb27c1efcce3dfb9c55f20201667acf065ba1a1e5

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: winnerz-1.1.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for winnerz-1.1.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 f97c115c97002ea415eb11bd4a98689a50ee50dddde8f299b1336a2a10408d0e
MD5 b1611d788180da4528b08a0e94aa9a68
BLAKE2b-256 ff21a797d23412c93ba6355a2cb84b130c7ebc25951a0ae2612f11fb22901e6f

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp39-cp39-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp39-cp39-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9f307c968b1cd71918f2290b012f284a6a4123ab5f02c49fcfa4614feb609c6a
MD5 732f78336134c73150cedb27ececa6e7
BLAKE2b-256 0cec1ea4dc507f11e343659cab10c2f5951966107525b3b232e76c3e493658cc

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2ee293d3d45cbecbf007fdf63223a682a86d1d412342748afc70a7b10ef426aa
MD5 48d096f645446c32daf873505708dce3
BLAKE2b-256 4e6af40025d3dc927cd8239408bb476ce73876be58bf820d2c5a9cf9d8976544

See more details on using hashes here.

File details

Details for the file winnerz-1.1.3-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for winnerz-1.1.3-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8674f11b0c89933d60099f5d02ebc564a75b91e81a3f8fbed2c834b55c935cf9
MD5 dc6bbb0dd7c2989bbc2a0bc919f6355f
BLAKE2b-256 c3ca20669eb0ea6a20bf895c353f43e1f889ace04cfee4abf96b29b0b3519c1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page