Fast Rust-native PDF and document extraction for Python, with Markdown, LaTeX, and JSON output.
Project description
Dongler
Turn PDFs and documents into clean Markdown, LaTeX, or structured JSON.
Rust-native, runs locally, no hosted service · API key · LLM · or OCR for digitally born PDFs.
Documentation · Quick start · API reference · LLM context
Dongler is built around a path-first workflow: load a file, inspect the document object when you need to, then render the output format your pipeline wants. One Rust core powers the CLI, Python, TypeScript, and Rust APIs, so the extraction model is identical everywhere.
Install
cargo install dongler # CLI + Rust
pip install dongler # Python
npm install @cristianexer/dongler # Node / TypeScript
For the Rust library, depend on dongler-core. The public dongler crate is the CLI package.
To run extraction in the browser or another WebAssembly host, build the
dongler-wasm crate (make build-wasm). It exposes the same engine over an
in-memory byte API, so files can be parsed client-side with no server. See
crates/dongler-wasm/README.md.
Parse a PDF
|
Python import dongler
doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
data = doc.to_dict()
|
TypeScript import { load } from "@cristianexer/dongler";
const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();
|
Rust use dongler_core::load_path;
let doc = load_path("report.pdf")?;
println!("{}", doc.to_markdown()?);
|
What you get
|
📄 Markdown · LaTeX · JSON
⚡ Native speed, local runtime
🧱 Structured document model
|
🧩 One API across stacks
📦 Pipeline-friendly batches
🔌 Beyond PDF
|
Why Dongler
Use Dongler when the job starts with a document path and the next step needs useful text quickly:
- Convert PDFs to Markdown for indexing, review, or RAG ingestion.
- Keep page/block/table/image metadata available through JSON.
- Run locally in scripts, services, queues, notebooks, and shell workflows.
- Use the same extraction model across Python, Node.js, Rust, and the CLI.
Supported inputs
Dongler focuses on digitally born PDFs and also supports native extraction for DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX. It also reads gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land.
Batch processing
One result per file — a bad or unsupported document does not stop the batch.
import dongler
for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
if result["ok"]:
print(result["document"].to_markdown())
else:
print(f"{result['path']}: {result['error']}")
CLI
dongler --version
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract notes.txt --format latex
PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.
Documentation
Benchmarks
Dongler sustains ~90 born-digital pages/second on a single host with no GPU, and on olmOCR-Bench (1,403 real PDFs, 7,019 unit checks) its table-structure pass rate improved +9.7% relative over the previous release on the identical harness. Full results, methodology, and before/after examples are on the benchmarks page.
Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset. olmOCR-Bench re-measured 2026-06-11 after the table-extraction work; the full table regenerates on the next release run.
Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.
| Dataset | Status | Local data | Docs eval | Coverage | Pages/sec | GT accuracy |
|---|---|---|---|---|---|---|
| DocLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| DocBank | ok | 735.6 MB | 200 | 100.0% / 100.0% / 100.0% | 81.94 | 89.5% |
| PubTabNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubTables-1M | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| TableBank | ok | 1.6 MB | 10 | 100.0% / 100.0% / 100.0% | 193.45 | 100.0% |
| FUNSD | ok | 42.6 MB | 200 | 100.0% / 48.9% / 100.0% | 96.09 | 100.0% |
| SROIE | ok | 627.3 MB | 1264 | 100.0% / 92.7% / 100.0% | 231.85 | 100.0% |
| RVL-CDIP | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| READoc | ok | 39.9 MB | 959 | 100.0% / n/a / n/a | 96.86 | 100.0% |
| OmniDocBench | ok | 40.3 MB | 1 | 100.0% / 100.0% / 100.0% | 1030.96 | 88.5% |
| olmOCR-Bench | ok | 340.5 MB | 1403 | 100.0% / 100.0% / 100.0% | 21.15 | 22.7% |
| ckorzen benchmark | ok | 67.1 MB | 192 | 100.0% / 15.4% / 100.0% | 100.37 | 88.4% |
| S2ORC | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PMC OA | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| arXiv source/PDF | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
Extraction-quality improvements
A controlled A/B of the current parser against the previous release baseline — run on the full olmOCR-Bench corpus (1403 real PDFs, 7019 unit checks, identical harness and release build) — isolates the gains from the recent text-spacing and table-structure work:
| Signal | Before | After |
|---|---|---|
| olmOCR table-structure checks passed | 59.7% | 65.5% (+9.7% relative) |
| olmOCR reading-order checks passed | 30.7% | 32.0% |
| Overall olmOCR checks passed | 1562 / 7019 | 1595 / 7019 |
| Throughput (born-digital) | ~90 pages/sec | ~90 pages/sec |
Born-digital word segmentation is fixed end to end (UNITEDSTATES → UNITED STATES,
Netincome → Net income, fi scal → fiscal), and multi-section financial statements now
extract as a single aligned table — in Markdown, LaTeX, and JSON — instead of a label column
followed by a detached block of numbers. See the
benchmarks page for the full breakdown.
License
Dongler is MIT licensed. Copyright (c) 2026 Daniel Fat. See LICENSE and NOTICE
for the full notice text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dongler-0.3.14.tar.gz.
File metadata
- Download URL: dongler-0.3.14.tar.gz
- Upload date:
- Size: 174.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4905c10cb7804db2dd05910e6f8988a6f9d0e408f77b5ec8befb337996cf52a1
|
|
| MD5 |
f99120f5de69ecdea74a9c6365befd0d
|
|
| BLAKE2b-256 |
b12fbe25ffcfcad038a4c564886f48bd4757885594fb015d6ba77aa0a08d738a
|
Provenance
The following attestation bundles were made for dongler-0.3.14.tar.gz:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.3.14.tar.gz -
Subject digest:
4905c10cb7804db2dd05910e6f8988a6f9d0e408f77b5ec8befb337996cf52a1 - Sigstore transparency entry: 1808382769
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@cc5427566d5832e945d9dc8c635576f091b398b3 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@cc5427566d5832e945d9dc8c635576f091b398b3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dongler-0.3.14-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: dongler-0.3.14-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4180e0e473704a24413785e9e6f7e4c35b54194f1caf9a5fdfdf78a688d27c2d
|
|
| MD5 |
7030cd9adc99f1dfb453631cf3863183
|
|
| BLAKE2b-256 |
1169eff1171d4f767535c5aba57c4adc3cc1f66aed692c86e3d4d4bc932e729c
|
Provenance
The following attestation bundles were made for dongler-0.3.14-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.3.14-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
4180e0e473704a24413785e9e6f7e4c35b54194f1caf9a5fdfdf78a688d27c2d - Sigstore transparency entry: 1808382802
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@cc5427566d5832e945d9dc8c635576f091b398b3 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@cc5427566d5832e945d9dc8c635576f091b398b3 -
Trigger Event:
push
-
Statement type: