Rust PDF parser with Python bindings
Project description
pdf-rs
pdf-rs is a Rust PDF parser with Python bindings. It focuses on fast structural
inspection, text extraction, Markdown output, image/font metadata, incremental
metadata and annotation updates, and LLM-friendly page chunks with OCR handoff
requests.
Features
- Parse classic xref tables, xref streams, and compressed object streams.
- Decode common stream filters including Flate, ASCIIHex, ASCII85, RunLength, and PNG predictors.
- Extract document metadata, page text, outlines, names, page labels, links, annotations, forms, embedded files, fonts, and image XObjects.
- Export Markdown and page chunks for RAG/LLM ingestion.
- Produce OCR request placeholders for scanned pages or image regions so an external OCR engine can be plugged in without coupling the parser to one OCR backend.
- Provide Python bindings through
maturin/ PyO3.
Rust Usage
use pdf_rs::{Document, LlmParseOptions};
let document = Document::parse_file("paper.pdf")?;
println!("{}", document.text()?);
let llm = document.to_llm_document_with_options(LlmParseOptions::default())?;
for page in llm.pages {
println!("page {}: {}", page.page_number, page.text);
}
# Ok::<(), pdf_rs::PdfError>(())
Python Usage
import pdf_rs
doc = pdf_rs.Document.open_mmap("paper.pdf")
print(doc.to_markdown())
for chunk in doc.llm_chunks():
print(chunk["page_number"], chunk["text"])
for request in doc.ocr_requests(ocr="auto"):
print(request)
CLI
cargo run --bin pdf-rs-cli -- paper.pdf --mmap --summary
cargo run --bin pdf-rs-cli -- paper.pdf --text
cargo run --bin pdf-rs-cli -- paper.pdf --ocr-requests
Local Checks
cargo fmt --check
cargo test
cargo clippy --all-targets -- -D warnings
cargo check --features mimalloc
cargo clippy --features python --all-targets -- -D warnings
uvx maturin build --features python --out target\wheels
To run the Python smoke test after building a wheel:
uv venv target\py-smoke --seed
uv pip install --python target\py-smoke\Scripts\python.exe --force-reinstall target\wheels\pdf_rs-*.whl
target\py-smoke\Scripts\python.exe tests\python_api.py
Release
GitHub Actions builds wheels on Linux, Windows, and macOS when a v*.*.* tag is
pushed. The release workflow uploads distributions to PyPI using the
PYPI_API_TOKEN repository secret and creates a GitHub release from the same
artifacts.
git tag v0.1.0
git push origin v0.1.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_rs-0.1.0.tar.gz.
File metadata
- Download URL: pdf_rs-0.1.0.tar.gz
- Upload date:
- Size: 669.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7929b7925032ca0c23fb3482c6246e8b7eb3cea8490a6982af48cfe57398fc0e
|
|
| MD5 |
1b634c95b7e5579f0d0583958c6b6d15
|
|
| BLAKE2b-256 |
e75efc156d1194587641406f2dbdccd2323fd10431437fd9d0c4e39c283ce399
|
File details
Details for the file pdf_rs-0.1.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: pdf_rs-0.1.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 447.3 kB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81881545d4d9fd85a2ff008731a23de302748eb411798a5071cf0306d7d0c589
|
|
| MD5 |
9ced4ad3cc09ab4df43fc006adf4f7ce
|
|
| BLAKE2b-256 |
6199769c14d7acd43ca1cf94412c421a30703515e49340cb465c1159167b479e
|
File details
Details for the file pdf_rs-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pdf_rs-0.1.0-cp39-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 603.7 kB
- Tags: CPython 3.9+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de288ffabd436a6799812cfad671cad7340862cbcef1e85e09bc100a505a59fb
|
|
| MD5 |
8663968df051a112ad661f97c0912ba8
|
|
| BLAKE2b-256 |
412d60f9b0bfe76932b1463810ea4783f3b32409a3b948a2258c0e5a2edd1ff9
|
File details
Details for the file pdf_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: pdf_rs-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 535.9 kB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
502e129282af0f00adbd494b895db71237377865f033097bf72953c8dcb012bf
|
|
| MD5 |
7788cfe409fd95015d8a58b3ee54409b
|
|
| BLAKE2b-256 |
ce64bbaeda0e945a30cbe0be6e7f4a71046248b990357593d4a45597c8be9951
|