Skip to main content

Created by Daniel Fat. Python bindings for a Rust-native document extraction engine focused on Markdown and LaTeX output.

Project description

Dongler logo

Dongler

Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.

Install

cargo install dongler
pip install dongler
npm install @cristianexer/dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

How to Use

Dongler supports native extraction for PDFs, DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX today, including gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages. Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land. The same API works across supported formats, so you can use the same code to extract Markdown from a PDF invoice, spreadsheet, web page, email, dataset annotation, or plain text note.

Python:

import dongler

doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("invoice.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract deck.pptx --format markdown
dongler extract notes.odt --format markdown
dongler extract annotations.json --format markdown
dongler extract boxes.csv --format json
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

Benchmarks

Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.

Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.

Dataset Status Local data Docs eval Coverage Pages/sec GT accuracy
DocLayNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PubLayNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
DocBank ok 735.6 MB 200 100.0% / 100.0% / 100.0% 81.94 89.5%
PubTabNet missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PubTables-1M missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
TableBank ok 1.6 MB 10 100.0% / 100.0% / 100.0% 193.45 100.0%
FUNSD ok 42.6 MB 200 100.0% / 48.9% / 100.0% 96.09 100.0%
SROIE ok 627.3 MB 1264 100.0% / 92.7% / 100.0% 231.85 100.0%
RVL-CDIP missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
READoc ok 39.9 MB 959 100.0% / n/a / n/a 96.86 100.0%
OmniDocBench ok 40.3 MB 1 100.0% / 100.0% / 100.0% 1030.96 88.5%
olmOCR-Bench ok 340.5 MB 1403 100.0% / 100.0% / 100.0% 20.97 20.3%
ckorzen benchmark ok 67.1 MB 192 100.0% / 15.4% / 100.0% 100.37 88.4%
S2ORC missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
PMC OA missing 0.0 MB 0 n/a / n/a / n/a n/a n/a
arXiv source/PDF missing 0.0 MB 0 n/a / n/a / n/a n/a n/a

License

Dongler is licensed under the MIT License. See LICENSE and NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dongler-0.3.1.tar.gz (106.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (994.9 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file dongler-0.3.1.tar.gz.

File metadata

  • Download URL: dongler-0.3.1.tar.gz
  • Upload date:
  • Size: 106.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.3.1.tar.gz
Algorithm Hash digest
SHA256 343c96e168ddb96951c358312cd7d9443e0001d7283124ea347dcdd5b8f66e43
MD5 fd7fb2b7984bf24f62e021deb99d3aa5
BLAKE2b-256 d0399e0d2011d0421940ca09987605f77d918e0aaeb7ed9a3006debb15e4a404

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.1.tar.gz:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0dbba00755bdd6809e128cfa1cd13fbdbbcc3af33c6b795df4a90cbb90e8f4b6
MD5 45726b295e16981b2b0e6542eea0d3f2
BLAKE2b-256 b6e31ce1b00af4cd782bd92f782a9b5516b8d9fc13dfa50e7eaedfa97fe3d39c

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page