Skip to main content

Native fast PDF text/table/image extraction (PyO3 binding over the turbo-parsepdf Rust core).

Project description

turbo-parsepdf

Fast native PDF text / table / image extraction for Python — a pure-Rust core (PyO3, stable-ABI wheels). Imports as turbo_parsepdf. Output as a dict, or HTML / Markdown / JSON strings.

pip install turbo-parsepdf

Benchmark vs the Python PDF stack

Wall-clock to extract every page's text, best-of-N (Apple M-series, release). Reproduce: python benches/competitive-py/bench.py (after python3 benches/gen-corpus.py).

document turbo-parsepdf pypdf PyMuPDF (MuPDF, C) pdfminer.six
100 pages 6.2 ms 237 ms · 38× 389 ms · 62× 1920 ms · 307×
20 pages 1.1 ms 80 ms 103 ms 419 ms
2 pages 0.06 ms 2.6 ms 4.0 ms 18 ms

Even including the Python FFI + dict-marshaling overhead, turbo is 38–307× faster — and its text is byte-identical to PyMuPDF (100% word recall).

import turbo_parsepdf

data = open("doc.pdf", "rb").read()

doc = turbo_parsepdf.parse(data)
# {"version": "1.7", "pages": [{"width": ..., "height": ..., "needs_ocr": False,
#   "lines": [{"text": ..., "x": ..., "y": ...}],
#   "tables": [{"rows": ..., "cols": ..., "cells": [[...]]}],
#   "images": [{"name": ..., "format": "Jpeg", "width": ..., ...}]}]}

turbo_parsepdf.parse_to_markdown(data)  # str
turbo_parsepdf.parse_to_html(data)      # str
turbo_parsepdf.parse_to_json(data)      # str

# Encrypted PDFs: pass the user or owner password.
turbo_parsepdf.parse(open("locked.pdf", "rb").read(), password="secret")

A fatal parse fault raises ValueError with a stable code (InvalidHeader, BadStream, …). Scanned/image-only pages come back with needs_ocr=True (OCR is out of scope).

Supports cross-reference streams + object streams (PDF 1.5+), all standard stream filters + predictors, /ToUnicode & encoding/AGL & CID font decoding, ruled tables, image XObject extraction, and standard-handler decryption (RC4 + AES-128/256, R2–R6).

Part of the turbo-parsepdf workspace. MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_parsepdf-0.1.1.tar.gz (93.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_parsepdf-0.1.1-cp38-abi3-win_amd64.whl (313.4 kB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_parsepdf-0.1.1-cp38-abi3-manylinux_2_34_x86_64.whl (438.8 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

turbo_parsepdf-0.1.1-cp38-abi3-macosx_11_0_arm64.whl (382.1 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_parsepdf-0.1.1.tar.gz.

File metadata

  • Download URL: turbo_parsepdf-0.1.1.tar.gz
  • Upload date:
  • Size: 93.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for turbo_parsepdf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d6e5b62e74566a25b4c7a52c46441ddc3f0f4d1da85d2a6889d84b273581f4d2
MD5 98c5fd739727c031dc65ac93081b974c
BLAKE2b-256 3f9ea1220684b86cfd83bab0363478df27c536da138807ae8380f79cb3c6992e

See more details on using hashes here.

File details

Details for the file turbo_parsepdf-0.1.1-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_parsepdf-0.1.1-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b3d85853e585b10d39cdaf7bd9530d37a636acd9c90a05d8b19968b12e2f289b
MD5 55754fe135e19234518c51569c6112d9
BLAKE2b-256 543b491a732eb29daf7057aa4788a5dc44cef273e5f800e3441c1824dbcee695

See more details on using hashes here.

File details

Details for the file turbo_parsepdf-0.1.1-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for turbo_parsepdf-0.1.1-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2b8d45e311008405dd7c198d37765953125f85efff8370a09af0d2f185249d30
MD5 a3d1d3030a9cb24e7ed9eea3cd70dd72
BLAKE2b-256 15bcd3ad100881646ef9ad40f0cc1cd1e8ca4a5daf1229e719bfa98389bb13f5

See more details on using hashes here.

File details

Details for the file turbo_parsepdf-0.1.1-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_parsepdf-0.1.1-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ce40d4fdd976af87cdf9e0c7c4953056f09627b543d328924ef1743f8dae7fa5
MD5 09117d4c8071ff0260d29cd22905f710
BLAKE2b-256 ed1c1f73673fd3e2b2625aec1f0a84ca645011847035091c6a4f9a5d7ee7c8bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page