Pure-Rust PDF extraction that distills documents into clean, LLM-ready HTML — for LLMs and RAG, built on lopdf
Project description
distillPDF
Turn any PDF into clean, LLM-ready HTML — structure-aware, pure-Rust, MIT-licensed.
distillpdf reads a PDF and reconstructs its structure — reading order, headings,
paragraphs, lists, tables, and figures — then emits compact, semantic HTML (or plain
text) ready to feed to an LLM or a RAG pipeline. No styling noise, no layout junk: just the
content a model needs.
It's built on lopdf and shipped to Python via
PyO3 + maturin as a small, self-contained
wheel — a lightweight, permissively licensed alternative to AGPL/heavyweight extractors
(PyMuPDF, pdfminer, Unstructured), with no system dependencies and no Python runtime deps.
🧪 Early release (
0.0.2) — testers wanted. The API is small and may still change. If you have PDFs that come out wrong, please open an issue with the file (or a description) — real-world documents are exactly what this needs to get better.
Install
pip install distillpdf
Prebuilt wheels; no compiler or system libraries required.
Quickstart
import distillpdf
doc = distillpdf.open("paper.pdf") # or distillpdf.from_bytes(data)
html = doc.to_html() # clean, semantic HTML for an LLM
text = doc.extract_text() # plain text, in reading order
toc = doc.toc() # [(level, title, page, anchor_id), ...]
abstract = doc.section("abstract") # targeted section extraction
Need the raw pieces instead of HTML?
doc.extract_tables() # cell grids (handles multi-level / colspan headers)
doc.extract_images() # embedded images, with raw bytes
doc.extract_links() # hyperlinks with targets
doc.extract_fonts() # font inventory
doc.page_count() # number of pages
Why distillPDF
- Structure, not just text. Two-column reading order, multi-level table headers mapped
onto a single grid (
colspan), vector figures transcoded to inline SVG (including rotated axis labels), an auto-generated table of contents, and named section extraction (doc.section("methods")). - LLM-ready output. Lean, class-free HTML — semantic markup a model can read directly,
with anchor ids so
toc()entries link straight into the document. - Small & permissive. Pure Rust on
lopdf, MIT-licensed, no system dependencies, no Python runtime dependencies. Drops into any pipeline without license headaches. - Fast. Native Rust extraction with a release build tuned for speed (LTO, single codegen unit).
Scope
In scope: text, table, image, and font extraction, plus an HTML/markdown output layer for RAG and LLM ingestion.
Out of scope (for now): page rendering, PDF generation, OCR.
Comparison
| distillPDF | PyMuPDF | pdfminer.six | Unstructured | |
|---|---|---|---|---|
| License | MIT | AGPL / commercial | MIT | Apache (heavy deps) |
| Structure-aware HTML | ✅ | partial | ❌ | ✅ |
| System deps | none | none | none | many |
| Implementation | Rust | C | Python | Python |
Contributing & feedback
This is a young project and feedback is the fastest way to improve it. The most useful things you can do:
- Try it on your PDFs and tell me where the output is wrong — open an issue.
- Star the repo if it's useful, so others can find it.
- PRs welcome — see the development notes below.
Development
The test suite lives in tests/ (pytest) and runs on CI. It needs only
distillpdf installed. CI runs entirely on data we own — a self-contained demo PDF
(tests/demo/, end-to-end structure check) and a synthetic table corpus
(tests/corpus_tables/). The third-party PDF corpora (tests/corpus*/) are gitignored, so
their tests self-skip on a fresh clone and run only when the corpora are present locally for
deeper coverage.
Build from source with maturin:
git clone https://github.com/kkollsga/distillpdf
cd distillpdf
maturin develop --release # build + install into the current venv
bash tests/run.sh # build distillpdf + run pytest
pytest tests/ -q # or just run the tests against an installed build
License
MIT — see LICENSE. Use it anywhere, including commercial and closed-source projects.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distillpdf-0.0.2-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: distillpdf-0.0.2-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aba89d52bea59a5e3a9d2b778680f8fcce4a023dda7aad7da0e2049f33974b67
|
|
| MD5 |
aaa45c4c782498eb0defc23268c0d4e3
|
|
| BLAKE2b-256 |
dee1df5d688f1552a5ce0dffb6a8186fe1db119bd85d2a3f010d060120d2183e
|
Provenance
The following attestation bundles were made for distillpdf-0.0.2-cp38-abi3-win_amd64.whl:
Publisher:
publish.yml on kkollsga/distillpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillpdf-0.0.2-cp38-abi3-win_amd64.whl -
Subject digest:
aba89d52bea59a5e3a9d2b778680f8fcce4a023dda7aad7da0e2049f33974b67 - Sigstore transparency entry: 1712901172
- Sigstore integration time:
-
Permalink:
kkollsga/distillpdf@bb3d8debe159c2eb1a7b4c83c42802d675b5fef2 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/kkollsga
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bb3d8debe159c2eb1a7b4c83c42802d675b5fef2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bc8b04d3e351925df6ba8c923dedad4d0eeb8db7c36d519fde14680719627b8
|
|
| MD5 |
6b9caeb6f7a4405075987c4558e034f6
|
|
| BLAKE2b-256 |
cc6e575a7a5083b2a9bf4fac6c617ba0c1d01eb002fd4c911b666f58dafccd61
|
Provenance
The following attestation bundles were made for distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl:
Publisher:
publish.yml on kkollsga/distillpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl -
Subject digest:
7bc8b04d3e351925df6ba8c923dedad4d0eeb8db7c36d519fde14680719627b8 - Sigstore transparency entry: 1712901255
- Sigstore integration time:
-
Permalink:
kkollsga/distillpdf@bb3d8debe159c2eb1a7b4c83c42802d675b5fef2 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/kkollsga
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bb3d8debe159c2eb1a7b4c83c42802d675b5fef2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14b33cd46cd5702b928961a3705134b21a1a0e477b1b832c6976ce225d22ced5
|
|
| MD5 |
e9531fc4d16aef47d6a08de6d7248a31
|
|
| BLAKE2b-256 |
e3b167f7eeeafe2a66ac46536121dba1a2b7090d80ec0a1910f14a1695b65bcf
|
Provenance
The following attestation bundles were made for distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
publish.yml on kkollsga/distillpdf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
14b33cd46cd5702b928961a3705134b21a1a0e477b1b832c6976ce225d22ced5 - Sigstore transparency entry: 1712901226
- Sigstore integration time:
-
Permalink:
kkollsga/distillpdf@bb3d8debe159c2eb1a7b4c83c42802d675b5fef2 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/kkollsga
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bb3d8debe159c2eb1a7b4c83c42802d675b5fef2 -
Trigger Event:
push
-
Statement type: