Skip to main content

Rust PDF parser with Python bindings

Project description

pdf-rs

pdf-rs is a Rust PDF parser with Python bindings. It focuses on fast structural inspection, text extraction, Markdown output, image/font metadata, incremental metadata and annotation updates, and LLM-friendly page chunks with OCR handoff requests.

Features

  • Parse classic xref tables, xref streams, and compressed object streams.
  • Decode common stream filters including Flate, ASCIIHex, ASCII85, RunLength, and PNG predictors.
  • Extract document metadata, page text, outlines, names, page labels, links, annotations, forms, embedded files, fonts, and image XObjects.
  • Export Markdown and page chunks for RAG/LLM ingestion.
  • Produce OCR request placeholders for scanned pages or image regions so an external OCR engine can be plugged in without coupling the parser to one OCR backend.
  • Provide Python bindings through maturin / PyO3.

Rust Usage

use pdf_rs::{Document, LlmParseOptions};

let document = Document::parse_file("paper.pdf")?;
println!("{}", document.text()?);

let llm = document.to_llm_document_with_options(LlmParseOptions::default())?;
for page in llm.pages {
    println!("page {}: {}", page.page_number, page.text);
}
# Ok::<(), pdf_rs::PdfError>(())

Python Usage

import pdf_rs

doc = pdf_rs.Document.open_mmap("paper.pdf")
print(doc.to_markdown())

for chunk in doc.llm_chunks():
    print(chunk["page_number"], chunk["text"])

for request in doc.ocr_requests(ocr="auto"):
    print(request)

CLI

cargo run --bin pdf-rs-cli -- paper.pdf --mmap --summary
cargo run --bin pdf-rs-cli -- paper.pdf --text
cargo run --bin pdf-rs-cli -- paper.pdf --ocr-requests

Local Checks

cargo fmt --check
cargo test
cargo clippy --all-targets -- -D warnings
cargo check --features mimalloc
cargo clippy --features python --all-targets -- -D warnings
uvx maturin build --features python --out target\wheels

To run the Python smoke test after building a wheel:

uv venv target\py-smoke --seed
uv pip install --python target\py-smoke\Scripts\python.exe --force-reinstall target\wheels\pdf_rs-*.whl
target\py-smoke\Scripts\python.exe tests\python_api.py

Release

GitHub Actions builds wheels on Linux, Windows, and macOS when a v*.*.* tag is pushed. The release workflow uploads distributions to PyPI using the PYPI_API_TOKEN repository secret and creates a GitHub release from the same artifacts.

git tag v0.1.0
git push origin v0.1.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_rs-0.1.0.tar.gz (669.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdf_rs-0.1.0-cp39-abi3-win_amd64.whl (447.3 kB view details)

Uploaded CPython 3.9+Windows x86-64

pdf_rs-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl (603.7 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.34+ x86-64

pdf_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (535.9 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file pdf_rs-0.1.0.tar.gz.

File metadata

  • Download URL: pdf_rs-0.1.0.tar.gz
  • Upload date:
  • Size: 669.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_rs-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7929b7925032ca0c23fb3482c6246e8b7eb3cea8490a6982af48cfe57398fc0e
MD5 1b634c95b7e5579f0d0583958c6b6d15
BLAKE2b-256 e75efc156d1194587641406f2dbdccd2323fd10431437fd9d0c4e39c283ce399

See more details on using hashes here.

File details

Details for the file pdf_rs-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: pdf_rs-0.1.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 447.3 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pdf_rs-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 81881545d4d9fd85a2ff008731a23de302748eb411798a5071cf0306d7d0c589
MD5 9ced4ad3cc09ab4df43fc006adf4f7ce
BLAKE2b-256 6199769c14d7acd43ca1cf94412c421a30703515e49340cb465c1159167b479e

See more details on using hashes here.

File details

Details for the file pdf_rs-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pdf_rs-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 de288ffabd436a6799812cfad671cad7340862cbcef1e85e09bc100a505a59fb
MD5 8663968df051a112ad661f97c0912ba8
BLAKE2b-256 412d60f9b0bfe76932b1463810ea4783f3b32409a3b948a2258c0e5a2edd1ff9

See more details on using hashes here.

File details

Details for the file pdf_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pdf_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 502e129282af0f00adbd494b895db71237377865f033097bf72953c8dcb012bf
MD5 7788cfe409fd95015d8a58b3ee54409b
BLAKE2b-256 ce64bbaeda0e945a30cbe0be6e7f4a71046248b990357593d4a45597c8be9951

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page