Created by Daniel Fat. Python bindings for a Rust-native document extraction engine focused on Markdown and LaTeX output.
Project description
Dongler
Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.
Install
cargo install dongler
pip install dongler
npm install @cristianexer/dongler
For Rust library usage, depend on dongler-core. The public dongler crate is
the CLI package.
How to Use
Dongler supports native extraction for PDFs, DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX today, including gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land. The same API works across supported formats, so you can use the same code to extract Markdown from a PDF invoice, spreadsheet, web page, email, dataset annotation, or plain text note.
Python:
import dongler
doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
TypeScript:
import { load } from "@cristianexer/dongler";
const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
Rust:
use dongler_core::load_path;
fn main() -> dongler_core::Result<()> {
let doc = load_path("invoice.pdf")?;
println!("{}", doc.to_markdown()?);
Ok(())
}
Batch Processing
Batch processing returns one result per file. One bad or unsupported document does not stop the batch.
Python:
import dongler
for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
if result["ok"]:
print(result["document"].to_markdown())
else:
print(f"{result['path']}: {result['error']}")
TypeScript:
import { loadMany } from "@cristianexer/dongler";
for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
if (result.ok) {
console.log(result.document!.toMarkdown());
} else {
console.error(`${result.path}: ${result.error}`);
}
}
Rust:
use dongler_core::load_many;
for result in load_many(["notes.txt", "invoice.pdf"]) {
if result.ok {
println!("{}", result.document.unwrap().to_markdown().unwrap());
} else {
eprintln!("{}: {}", result.path, result.error.unwrap());
}
}
CLI
dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract deck.pptx --format markdown
dongler extract notes.odt --format markdown
dongler extract annotations.json --format markdown
dongler extract boxes.csv --format json
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json
PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.
Benchmarks
Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.
Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.
| Dataset | Status | Local data | Docs eval | Coverage | Pages/sec | GT accuracy |
|---|---|---|---|---|---|---|
| DocLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubLayNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| DocBank | ok | 735.6 MB | 200 | 100.0% / 100.0% / 100.0% | 81.94 | 89.5% |
| PubTabNet | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PubTables-1M | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| TableBank | ok | 1.6 MB | 10 | 100.0% / 100.0% / 100.0% | 193.45 | 100.0% |
| FUNSD | ok | 42.6 MB | 200 | 100.0% / 48.9% / 100.0% | 96.09 | 100.0% |
| SROIE | ok | 627.3 MB | 1264 | 100.0% / 92.7% / 100.0% | 231.85 | 100.0% |
| RVL-CDIP | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| READoc | ok | 39.9 MB | 959 | 100.0% / n/a / n/a | 96.86 | 100.0% |
| OmniDocBench | ok | 40.3 MB | 1 | 100.0% / 100.0% / 100.0% | 1030.96 | 88.5% |
| olmOCR-Bench | ok | 340.5 MB | 1403 | 100.0% / 100.0% / 100.0% | 20.97 | 20.3% |
| ckorzen benchmark | ok | 67.1 MB | 192 | 100.0% / 15.4% / 100.0% | 100.37 | 88.4% |
| S2ORC | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| PMC OA | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
| arXiv source/PDF | missing | 0.0 MB | 0 | n/a / n/a / n/a | n/a | n/a |
License
Dongler is licensed under the MIT License. See LICENSE and NOTICE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dongler-0.3.1.tar.gz.
File metadata
- Download URL: dongler-0.3.1.tar.gz
- Upload date:
- Size: 106.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
343c96e168ddb96951c358312cd7d9443e0001d7283124ea347dcdd5b8f66e43
|
|
| MD5 |
fd7fb2b7984bf24f62e021deb99d3aa5
|
|
| BLAKE2b-256 |
d0399e0d2011d0421940ca09987605f77d918e0aaeb7ed9a3006debb15e4a404
|
Provenance
The following attestation bundles were made for dongler-0.3.1.tar.gz:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.3.1.tar.gz -
Subject digest:
343c96e168ddb96951c358312cd7d9443e0001d7283124ea347dcdd5b8f66e43 - Sigstore transparency entry: 1666989992
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@6d6c821a807a4d612bcce902245598bc36fda6ef -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@6d6c821a807a4d612bcce902245598bc36fda6ef -
Trigger Event:
push
-
Statement type:
File details
Details for the file dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 994.9 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dbba00755bdd6809e128cfa1cd13fbdbbcc3af33c6b795df4a90cbb90e8f4b6
|
|
| MD5 |
45726b295e16981b2b0e6542eea0d3f2
|
|
| BLAKE2b-256 |
b6e31ce1b00af4cd782bd92f782a9b5516b8d9fc13dfa50e7eaedfa97fe3d39c
|
Provenance
The following attestation bundles were made for dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
0dbba00755bdd6809e128cfa1cd13fbdbbcc3af33c6b795df4a90cbb90e8f4b6 - Sigstore transparency entry: 1666990162
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@6d6c821a807a4d612bcce902245598bc36fda6ef -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@6d6c821a807a4d612bcce902245598bc36fda6ef -
Trigger Event:
push
-
Statement type: