Fast Rust-native PDF and document extraction for Python, with Markdown, LaTeX, and JSON output.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cristianexer

These details have not been verified by PyPI

Project description

Dongler logo

Dongler

Dongler is a fast, Rust-native PDF extraction package for developers who need clean Markdown, LaTeX, or structured JSON without wiring together a stack of document tools.

It is designed around the practical path-first workflow: load a file, inspect the document object when needed, then render the output format your pipeline needs. The same core engine powers the CLI, Python package, TypeScript package, and Rust API.

Install

cargo install dongler
pip install dongler
npm install @cristianexer/dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

Parse a PDF

Python:

import dongler

doc = dongler.load("report.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
data = doc.to_dict()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("report.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
const data = doc.toObject();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("report.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

What You Get

Markdown, LaTeX, and JSON renderers from the same document object.
Page, block, table, image, warning, and metadata fields for downstream code.
Rust-native PDF extraction with no hosted service, API key, LLM, or OCR dependency for digitally born PDFs.
Python, TypeScript, Rust, and CLI entrypoints over the same core.
Batch APIs that return one result per file, so one unsupported document does not stop a job.

Why Dongler

Use Dongler when the job starts with a document path and the next step needs useful text quickly:

Convert PDFs to Markdown for indexing, review, or RAG ingestion.
Keep page/block/table/image metadata available through JSON.
Run locally in scripts, services, queues, notebooks, and shell workflows.
Use the same extraction model across Python, Node.js, Rust, and the CLI.

Supported Inputs

Dongler focuses on digitally born PDFs and also supports native extraction for DOCX, XLSX, PPTX, ODT/ODS/ODP, HTML/XML, EML, JSON/JSONL, CSV/TSV, image metadata including TIFF, and plain text/Markdown/TeX today. It also supports gzip-compressed text/JSON/XML/CSV corpus files, bare gzip source files, and zip/tar/tar.gz source packages.

Legacy binary Office and Outlook containers are detected and return explicit planned-format errors until their engines land.

More Examples

Plain text, Markdown, office files, and data files use the same API:

import dongler

doc = dongler.load("invoice.docx")
markdown = doc.to_markdown()
latex = doc.to_latex()

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract report.docx --format markdown
dongler extract book.xlsx --format json
dongler extract deck.pptx --format markdown
dongler extract notes.odt --format markdown
dongler extract annotations.json --format markdown
dongler extract boxes.csv --format json
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

Developer Docs

Benchmarks

Generated by scripts/run-benchmarks.py on 2026-05-28 19:56:50 BST. Local cache: 1894.9 MB. All discovered files per dataset.

Coverage is parse / bbox / anchors. Ground-truth accuracy is token-F1, olmOCR unit-check pass rate, or full-image IoU; n/a means no local target signal. Detailed task names, discovery counts, native scores, and notes are recorded in eval/out/benchmarks/latest.json.

Dataset	Status	Local data	Docs eval	Coverage	Pages/sec	GT accuracy
DocLayNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PubLayNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
DocBank	ok	735.6 MB	200	100.0% / 100.0% / 100.0%	81.94	89.5%
PubTabNet	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PubTables-1M	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
TableBank	ok	1.6 MB	10	100.0% / 100.0% / 100.0%	193.45	100.0%
FUNSD	ok	42.6 MB	200	100.0% / 48.9% / 100.0%	96.09	100.0%
SROIE	ok	627.3 MB	1264	100.0% / 92.7% / 100.0%	231.85	100.0%
RVL-CDIP	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
READoc	ok	39.9 MB	959	100.0% / n/a / n/a	96.86	100.0%
OmniDocBench	ok	40.3 MB	1	100.0% / 100.0% / 100.0%	1030.96	88.5%
olmOCR-Bench	ok	340.5 MB	1403	100.0% / 100.0% / 100.0%	20.97	20.3%
ckorzen benchmark	ok	67.1 MB	192	100.0% / 15.4% / 100.0%	100.37	88.4%
S2ORC	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
PMC OA	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a
arXiv source/PDF	missing	0.0 MB	0	n/a / n/a / n/a	n/a	n/a

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cristianexer

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.17

Jun 14, 2026

0.3.16

Jun 13, 2026

0.3.15

Jun 13, 2026

0.3.14

Jun 13, 2026

0.3.13

Jun 12, 2026

0.3.12

Jun 11, 2026

0.3.11

Jun 11, 2026

0.3.10

Jun 10, 2026

0.3.9

Jun 10, 2026

0.3.8

Jun 10, 2026

0.3.7

Jun 10, 2026

0.3.6

Jun 10, 2026

0.3.5

Jun 10, 2026

This version

0.3.4

May 29, 2026

0.3.3

May 29, 2026

0.3.2

May 29, 2026

0.3.1

May 29, 2026

0.3.0

May 28, 2026

0.2.0

May 27, 2026

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dongler-0.3.4.tar.gz (109.1 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (996.4 kB view details)

Uploaded May 29, 2026 CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file dongler-0.3.4.tar.gz.

File metadata

Download URL: dongler-0.3.4.tar.gz
Upload date: May 29, 2026
Size: 109.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`7accfcf74ab2a5f707fcc4d1737e6f9abdc7be4aa1fe15cdcb8a3d8df998ec45`
MD5	`5ea84e74c3c255d8bb39546870de576d`
BLAKE2b-256	`4ea47f833ce56e9e5137df766ace25793cb0a091eace7b37543f234ddce78332`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.4.tar.gz:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dongler-0.3.4.tar.gz
- Subject digest: 7accfcf74ab2a5f707fcc4d1737e6f9abdc7be4aa1fe15cdcb8a3d8df998ec45
- Sigstore transparency entry: 1668265658
- Sigstore integration time: May 29, 2026
Source repository:
- Permalink: cristianexer/dongler@736c68a8461a794a58fe1b5925613927786ab570
- Branch / Tag: refs/heads/main
- Owner: https://github.com/cristianexer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@736c68a8461a794a58fe1b5925613927786ab570
- Trigger Event: push

File details

Details for the file dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: May 29, 2026
Size: 996.4 kB
Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`bf30f85148d9235df07da640bc8da82c21b8823ca3dcd4de72e146d7884bbf64`
MD5	`33a83775609d963153d1d88f16213d9a`
BLAKE2b-256	`f2c73ba9fbd0ec15d2f98f1ee966dd3ed108aebdc019851f15732ab44616625a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dongler-0.3.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Subject digest: bf30f85148d9235df07da640bc8da82c21b8823ca3dcd4de72e146d7884bbf64
- Sigstore transparency entry: 1668265809
- Sigstore integration time: May 29, 2026
Source repository:
- Permalink: cristianexer/dongler@736c68a8461a794a58fe1b5925613927786ab570
- Branch / Tag: refs/heads/main
- Owner: https://github.com/cristianexer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@736c68a8461a794a58fe1b5925613927786ab570
- Trigger Event: push

dongler 0.3.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Dongler

Install

Parse a PDF

What You Get

Why Dongler

Supported Inputs

More Examples

Batch Processing

CLI

Developer Docs

Benchmarks

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance