Skip to main content

Pure-Rust PDF extraction that distills documents into clean, LLM-ready HTML — for LLMs and RAG, built on lopdf

Project description

distillPDF

Turn any PDF into clean, LLM-ready HTML — structure-aware, pure-Rust, MIT-licensed.

PyPI Python versions License: MIT CI Built with Rust

distillpdf reads a PDF and reconstructs its structure — reading order, headings, paragraphs, lists, tables, and figures — then emits compact, semantic HTML (or plain text) ready to feed to an LLM or a RAG pipeline. No styling noise, no layout junk: just the content a model needs.

It's built on lopdf and shipped to Python via PyO3 + maturin as a small, self-contained wheel — a lightweight, permissively licensed alternative to AGPL/heavyweight extractors (PyMuPDF, pdfminer, Unstructured), with no system dependencies and no Python runtime deps.

🧪 Early release (0.0.2) — testers wanted. The API is small and may still change. If you have PDFs that come out wrong, please open an issue with the file (or a description) — real-world documents are exactly what this needs to get better.

Install

pip install distillpdf

Prebuilt wheels; no compiler or system libraries required.

Quickstart

import distillpdf

doc = distillpdf.open("paper.pdf")        # or distillpdf.from_bytes(data)

html     = doc.to_html()                  # clean, semantic HTML for an LLM
text     = doc.extract_text()             # plain text, in reading order
toc      = doc.toc()                      # [(level, title, page, anchor_id), ...]
abstract = doc.section("abstract")        # targeted section extraction

Need the raw pieces instead of HTML?

doc.extract_tables()   # cell grids (handles multi-level / colspan headers)
doc.extract_images()   # embedded images, with raw bytes
doc.extract_links()    # hyperlinks with targets
doc.extract_fonts()    # font inventory
doc.page_count()       # number of pages

Why distillPDF

  • Structure, not just text. Two-column reading order, multi-level table headers mapped onto a single grid (colspan), vector figures transcoded to inline SVG (including rotated axis labels), an auto-generated table of contents, and named section extraction (doc.section("methods")).
  • LLM-ready output. Lean, class-free HTML — semantic markup a model can read directly, with anchor ids so toc() entries link straight into the document.
  • Small & permissive. Pure Rust on lopdf, MIT-licensed, no system dependencies, no Python runtime dependencies. Drops into any pipeline without license headaches.
  • Fast. Native Rust extraction with a release build tuned for speed (LTO, single codegen unit).

Scope

In scope: text, table, image, and font extraction, plus an HTML/markdown output layer for RAG and LLM ingestion.

Out of scope (for now): page rendering, PDF generation, OCR.

Comparison

distillPDF PyMuPDF pdfminer.six Unstructured
License MIT AGPL / commercial MIT Apache (heavy deps)
Structure-aware HTML partial
System deps none none none many
Implementation Rust C Python Python

Contributing & feedback

This is a young project and feedback is the fastest way to improve it. The most useful things you can do:

  1. Try it on your PDFs and tell me where the output is wrong — open an issue.
  2. Star the repo if it's useful, so others can find it.
  3. PRs welcome — see the development notes below.

Development

The test suite lives in tests/ (pytest) and runs on CI. It needs only distillpdf installed. CI runs entirely on data we own — a self-contained demo PDF (tests/demo/, end-to-end structure check) and a synthetic table corpus (tests/corpus_tables/). The third-party PDF corpora (tests/corpus*/) are gitignored, so their tests self-skip on a fresh clone and run only when the corpora are present locally for deeper coverage.

Build from source with maturin:

git clone https://github.com/kkollsga/distillpdf
cd distillpdf
maturin develop --release    # build + install into the current venv
bash tests/run.sh            # build distillpdf + run pytest
pytest tests/ -q             # or just run the tests against an installed build

License

MIT — see LICENSE. Use it anywhere, including commercial and closed-source projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

distillpdf-0.0.2-cp38-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file distillpdf-0.0.2-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: distillpdf-0.0.2-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for distillpdf-0.0.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 aba89d52bea59a5e3a9d2b778680f8fcce4a023dda7aad7da0e2049f33974b67
MD5 aaa45c4c782498eb0defc23268c0d4e3
BLAKE2b-256 dee1df5d688f1552a5ce0dffb6a8186fe1db119bd85d2a3f010d060120d2183e

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillpdf-0.0.2-cp38-abi3-win_amd64.whl:

Publisher: publish.yml on kkollsga/distillpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7bc8b04d3e351925df6ba8c923dedad4d0eeb8db7c36d519fde14680719627b8
MD5 6b9caeb6f7a4405075987c4558e034f6
BLAKE2b-256 cc6e575a7a5083b2a9bf4fac6c617ba0c1d01eb002fd4c911b666f58dafccd61

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillpdf-0.0.2-cp38-abi3-manylinux_2_34_x86_64.whl:

Publisher: publish.yml on kkollsga/distillpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 14b33cd46cd5702b928961a3705134b21a1a0e477b1b832c6976ce225d22ced5
MD5 e9531fc4d16aef47d6a08de6d7248a31
BLAKE2b-256 e3b167f7eeeafe2a66ac46536121dba1a2b7090d80ec0a1910f14a1695b65bcf

See more details on using hashes here.

Provenance

The following attestation bundles were made for distillpdf-0.0.2-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on kkollsga/distillpdf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page