Skip to main content

Native fast PDF text/table/image extraction (PyO3 binding over the turbo-parsepdf Rust core).

Project description

turbo-parsepdf

Fast native PDF text / table / image extraction for Python — a pure-Rust core (PyO3, stable-ABI wheels). Imports as turbo_parsepdf. Output as a dict, or HTML / Markdown / JSON strings. 38× faster than pypdf, 62× faster than PyMuPDF, 307× faster than pdfminer, with text byte-identical to PyMuPDF.

pip install turbo-parsepdf
import turbo_parsepdf

data = open("doc.pdf", "rb").read()

doc = turbo_parsepdf.parse(data)
# {"version": "1.7", "pages": [{"width": ..., "height": ..., "needs_ocr": False,
#   "lines": [{"text": ..., "x": ..., "y": ...}],
#   "tables": [{"rows": ..., "cols": ..., "cells": [[...]]}],
#   "images": [{"name": ..., "format": "Jpeg", "width": ..., ...}]}]}

turbo_parsepdf.parse_to_markdown(data)  # str
turbo_parsepdf.parse_to_html(data)      # str
turbo_parsepdf.parse_to_json(data)      # str

# Encrypted PDFs: pass the user or owner password.
turbo_parsepdf.parse(open("locked.pdf", "rb").read(), password="secret")

A fatal parse fault raises ValueError with a stable code (InvalidHeader, BadStream, …). Scanned/image-only pages come back with needs_ocr=True (OCR is out of scope).

Supports cross-reference streams + object streams (PDF 1.5+), all standard stream filters + predictors, /ToUnicode & encoding/AGL & CID font decoding, ruled tables, image XObject extraction, and standard-handler decryption (RC4 + AES-128/256, R2–R6).

Part of the turbo-parsepdf workspace. MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbo_parsepdf-0.1.0.tar.gz (92.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

turbo_parsepdf-0.1.0-cp38-abi3-win_amd64.whl (310.4 kB view details)

Uploaded CPython 3.8+Windows x86-64

turbo_parsepdf-0.1.0-cp38-abi3-manylinux_2_34_x86_64.whl (435.8 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

turbo_parsepdf-0.1.0-cp38-abi3-macosx_11_0_arm64.whl (381.1 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file turbo_parsepdf-0.1.0.tar.gz.

File metadata

  • Download URL: turbo_parsepdf-0.1.0.tar.gz
  • Upload date:
  • Size: 92.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for turbo_parsepdf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 730eef3d1a0f1b0e30e0c589571e1b6a8e155f7eef89a62b5d7a08032c0ef3a2
MD5 ba942681303c55ceddc842b3979d4af0
BLAKE2b-256 923054d639ed9ae66cd96cce53a9b075e2bea57c7237e9902f61fcbdc803eb84

See more details on using hashes here.

File details

Details for the file turbo_parsepdf-0.1.0-cp38-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for turbo_parsepdf-0.1.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 59a22c3d8fa093f304b9345d6b9f5c9c2d8522d3dcd72d003995f1981a64cca2
MD5 cea5923005ccd2d2cab2c0c2223430d5
BLAKE2b-256 d98b33d9e6c419afa70f76496497ead3651f4c7846088d9b467bb7a49108c434

See more details on using hashes here.

File details

Details for the file turbo_parsepdf-0.1.0-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for turbo_parsepdf-0.1.0-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6a85689a33b304ad5ca9a1a4ce8e7b1d1f9e53c5ed80b6f1d201936784afbc48
MD5 79870b89d407143f048cc3e10283b318
BLAKE2b-256 e860a97235dcb1137c8ba6370fdfae46afe94d560311d65be5588b162674845a

See more details on using hashes here.

File details

Details for the file turbo_parsepdf-0.1.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for turbo_parsepdf-0.1.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6273bf82b92ac18db8457f4b3d23630a55d919e20dcfd341295633d471d91a92
MD5 877a9ddf94117bea1f4fb305a9407c3b
BLAKE2b-256 16494d820e44d46bf5ec6a4eb0096e0df80c28392211236aa31e7e3ee1f714fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page