Skip to main content

Fast Rust-native PDF and document extraction for Python, with Markdown, LaTeX, and JSON output.

Project description

Dongler — fast, Rust-native document extraction to Markdown, LaTeX, and JSON

Dongler

Turn PDFs and documents into clean Markdown, LaTeX, or structured JSON.
Rust-native, runs locally, no hosted service · API key · LLM · or OCR for digitally born PDFs.

PyPI crates.io npm

Release Build License: MIT Docs

Documentation  ·  Quick start  ·  API reference  ·  LLM context


Dongler is built around a path-first workflow: load a file, inspect the document object when you need to, then render the output format your pipeline wants. One Rust core powers the CLI, Python, TypeScript, and Rust APIs, so the extraction model is identical everywhere.

Install

cargo install dongler                  # CLI + Rust
pip install dongler                    # Python
npm install @cristianexer/dongler      # Node / TypeScript

For the Rust library, depend on dongler-core. The public dongler crate is the CLI package.

Parse a PDF

Python

import dongler

doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
data = doc.to_dict()

TypeScript

import { load } from "@cristianexer/dongler";

const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();

Rust

use dongler_core::load_path;

let doc = load_path("report.pdf")?;
println!("{}", doc.to_markdown()?);

What you get

📄 Markdown · LaTeX · JSON
Three renderers from one document object — headings, tables, lists, figures, and emphasis.

⚡ Native speed, local runtime
A custom Rust PDF parser with rayon page-parallelism. No hosted service, API key, LLM, or OCR for born-digital PDFs.

🧱 Structured document model
Page, block, table, image, span, warning, and metadata fields — with source anchors back to PDF objects.

🧩 One API across stacks
The same extraction model in Python, Node.js, Rust, and the CLI.

📦 Pipeline-friendly batches
Batch APIs return one result per file — a single bad document never stops the job.

🔌 Beyond PDF
Native extraction for DOCX/XLSX/PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, images, and archives.

Why Dongler

Use Dongler when the job starts with a document path and the next step needs useful text quickly:

  • Convert PDFs to Markdown for indexing, review, or RAG ingestion.
  • Keep page/block/table/image metadata available through JSON.
  • Run locally in scripts, services, queues, notebooks, and shell workflows.
  • Use the same extraction model across Python, Node.js, Rust, and the CLI.

Supported inputs

Dongler focuses on digitally born PDFs and also supports native extraction for DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX. It also reads gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land.

Batch processing

One result per file — a bad or unsupported document does not stop the batch.

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

CLI

dongler --version
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx   --format json
dongler extract notes.txt   --format latex

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

Documentation

Benchmarks

Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.

Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.

Dataset Status Local data Docs eval Coverage Pages/sec GT accuracy
DocLayNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PubLayNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
DocBank ok 735.6 MB 200 100.0% / 100.0% / 100.0% 81.94 89.5%
PubTabNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PubTables-1M missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
TableBank ok 1.6 MB 10 100.0% / 100.0% / 100.0% 193.45 100.0%
FUNSD ok 42.6 MB 200 100.0% / 48.9% / 100.0% 96.09 100.0%
SROIE ok 627.3 MB 1264 100.0% / 92.7% / 100.0% 231.85 100.0%
RVL-CDIP missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
READoc ok 39.9 MB 959 100.0% / n/a / n/a 96.86 100.0%
OmniDocBench ok 40.3 MB 1 100.0% / 100.0% / 100.0% 1030.96 88.5%
olmOCR-Bench ok 340.5 MB 1403 100.0% / 100.0% / 100.0% 20.97 20.3%
ckorzen benchmark ok 67.1 MB 192 100.0% / 15.4% / 100.0% 100.37 88.4%
S2ORC missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PMC OA missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
arXiv source/PDF missing 0.0 MB 0 n/a / n/a / n/a n/a n/a

License

Dongler is MIT licensed. Copyright (c) 2026 Daniel Fat. See LICENSE and NOTICE for the full notice text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dongler-0.3.5.tar.gz (125.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dongler-0.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file dongler-0.3.5.tar.gz.

File metadata

  • Download URL: dongler-0.3.5.tar.gz
  • Upload date:
  • Size: 125.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.3.5.tar.gz
Algorithm Hash digest
SHA256 46f4c54ff32330bb9ee178a0e2be5798f548a505bb353df7283f0906ad6ea0ca
MD5 0f0df97032ccea9e872deec0b6e080ea
BLAKE2b-256 c1fb52942d105525861b41193b0df0f19209598e99d0ea5d5106ba431e42ef62

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.5.tar.gz:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dongler-0.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dongler-0.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 67cdd8ef1a5aadd1bf2052a8d9c2b074b008f3b445d4b996aa846ecb33f59ee0
MD5 6bb2fa379b6e403a44f5db566d61aa78
BLAKE2b-256 36ccc1d9a6dd6a26d16fb78223a53ea28b045e37f44699de5aa7664197b9abf3

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page