Skip to main content

High-performance PDF text extraction powered by Zig

Project description

zpdf

High-performance PDF text extraction powered by Zig. ~4x faster than MuPDF on large documents.

Install

pip install zpdf

Usage

from zpdf import Document

with Document("paper.pdf") as doc:
    print(doc.page_count)

    # Extract all text (reading order)
    text = doc.extract_all()

    # Extract single page
    page_text = doc.extract_page(0)

    # Extract as markdown
    md = doc.extract_all_markdown()

    # Get text with bounding boxes
    spans = doc.extract_bounds(0)
    for span in spans:
        print(f"{span.text} at ({span.x0}, {span.y0})")

From bytes

with open("doc.pdf", "rb") as f:
    data = f.read()

with Document(data) as doc:
    text = doc.extract_all()

Benchmark

Text extraction on Apple M4 Pro:

Document Pages zpdf MuPDF Speedup
Intel SDM 5,252 582ms 2,152ms 3.7x
Pandas Docs 3,743 640ms 1,130ms 1.8x
C++ Standard 2,134 438ms 1,007ms 2.3x
PDF Reference 1,310 236ms 1,481ms 6.3x

License

CC0-1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

zpdf-0.1.0-py3-none-macosx_11_0_arm64.whl (168.2 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

File details

Details for the file zpdf-0.1.0-py3-none-macosx_11_0_arm64.whl.

File metadata

  • Download URL: zpdf-0.1.0-py3-none-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 168.2 kB
  • Tags: Python 3, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for zpdf-0.1.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4845caa1a0f51a05ec8b9fd12dffe5af3ad10982d90b220226fc58bbbe72fad1
MD5 c7738bc4087dda61b0381be360f2ce6f
BLAKE2b-256 919c6710b30d6969f74d46d2ff57bdb33f7cf10080a9acac4bffed71bef02ecd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page