Skip to main content

Fast Rust-native PDF and document extraction for Python, with Markdown, LaTeX, and JSON output.

Project description

Dongler logo

Dongler

PyPI package crates.io package npm package

Dongler is a fast, Rust-native PDF extraction package for developers who need clean Markdown, LaTeX, or structured JSON without wiring together a stack of document tools.

It is designed around the practical path-first workflow: load a file, inspect the document object when needed, then render the output format your pipeline needs. The same core engine powers the CLI, Python package, TypeScript package, and Rust API.

Install

cargo install dongler
pip install dongler
npm install @cristianexer/dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

Parse a PDF

Python:

import dongler

doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
data = doc.to_dict()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("report.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

What You Get

  • Markdown, LaTeX, and JSON renderers from the same document object.
  • Page, block, table, image, warning, and metadata fields for downstream code.
  • Rust-native PDF extraction with no hosted service, API key, LLM, or OCR dependency for digitally born PDFs.
  • Python, TypeScript, Rust, and CLI entrypoints over the same core.
  • Batch APIs that return one result per file, so one unsupported document does not stop a job.

Why Dongler

Use Dongler when the job starts with a document path and the next step needs useful text quickly:

  • Convert PDFs to Markdown for indexing, review, or RAG ingestion.
  • Keep page/block/table/image metadata available through JSON.
  • Run locally in scripts, services, queues, notebooks, and shell workflows.
  • Use the same extraction model across Python, Node.js, Rust, and the CLI.

Supported Inputs

Dongler focuses on digitally born PDFs and also supports native extraction for DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX today. It also supports gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages.

Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land.

More Examples

Plain text, Markdown, office files, and data files use the same API:

import dongler

doc = dongler.load("invoice.docx")
markdown = doc.to_markdown()
latex = doc.to_latex()

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract deck.pptx --format markdown
dongler extract notes.odt --format markdown
dongler extract annotations.json --format markdown
dongler extract boxes.csv --format json
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

Developer Docs

Benchmarks

Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.

Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.

Dataset Status Local data Docs eval Coverage Pages/sec GT accuracy
DocLayNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PubLayNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
DocBank ok 735.6 MB 200 100.0% / 100.0% / 100.0% 81.94 89.5%
PubTabNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PubTables-1M missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
TableBank ok 1.6 MB 10 100.0% / 100.0% / 100.0% 193.45 100.0%
FUNSD ok 42.6 MB 200 100.0% / 48.9% / 100.0% 96.09 100.0%
SROIE ok 627.3 MB 1264 100.0% / 92.7% / 100.0% 231.85 100.0%
RVL-CDIP missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
READoc ok 39.9 MB 959 100.0% / n/a / n/a 96.86 100.0%
OmniDocBench ok 40.3 MB 1 100.0% / 100.0% / 100.0% 1030.96 88.5%
olmOCR-Bench ok 340.5 MB 1403 100.0% / 100.0% / 100.0% 20.97 20.3%
ckorzen benchmark ok 67.1 MB 192 100.0% / 15.4% / 100.0% 100.37 88.4%
S2ORC missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PMC OA missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
arXiv source/PDF missing 0.0 MB 0 n/a / n/a / n/a n/a n/a

License

Dongler is MIT licensed. Copyright (c) 2026 Daniel Fat. See LICENSE and NOTICE for the full notice text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dongler-0.3.4.tar.gz (109.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (996.4 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file dongler-0.3.4.tar.gz.

File metadata

  • Download URL: dongler-0.3.4.tar.gz
  • Upload date:
  • Size: 109.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.3.4.tar.gz
Algorithm Hash digest
SHA256 7accfcf74ab2a5f707fcc4d1737e6f9abdc7be4aa1fe15cdcb8a3d8df998ec45
MD5 5ea84e74c3c255d8bb39546870de576d
BLAKE2b-256 4ea47f833ce56e9e5137df766ace25793cb0a091eace7b37543f234ddce78332

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.4.tar.gz:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bf30f85148d9235df07da640bc8da82c21b8823ca3dcd4de72e146d7884bbf64
MD5 33a83775609d963153d1d88f16213d9a
BLAKE2b-256 f2c73ba9fbd0ec15d2f98f1ee966dd3ed108aebdc019851f15732ab44616625a

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page