Created by Daniel Fat. Python bindings for a Rust-native document extraction engine focused on Markdown and LaTeX output.
Project description
Dongler
Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.
Created by Daniel Fat.
Status
Dongler ships .txt extraction and a native Rust PDF extraction path with page
geometry, text source anchors, basic table reconstruction, and image object
positions.
| Format | Detection | Extraction |
|---|---|---|
.txt, .text |
yes | supported |
.pdf |
yes | supported |
| Word, Excel, HTML, images, email | yes | planned |
Current outputs:
- Markdown
- LaTeX
- JSON
- Dongler's typed document IR
Install
cargo install dongler
pip install dongler
npm install @cristianexer/dongler
For Rust library usage, depend on dongler-core. The public dongler crate is
the CLI package.
API
Document extraction is a two-step process: load a document path, then render the extracted structure in the format you need. The same document object can be rendered to Markdown, LaTeX, or JSON without re-extracting the document.
Python:
import dongler
doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
TypeScript:
import { load } from "@cristianexer/dongler";
const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
Rust:
use dongler_core::load_path;
fn main() -> dongler_core::Result<()> {
let doc = load_path("invoice.pdf")?;
println!("{}", doc.to_markdown()?);
Ok(())
}
Batch Processing
Batch processing returns one result per file. One bad or unsupported document does not stop the batch.
Python:
import dongler
for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
if result["ok"]:
print(result["document"].to_markdown())
else:
print(f"{result['path']}: {result['error']}")
TypeScript:
import { loadMany } from "@cristianexer/dongler";
for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
if (result.ok) {
console.log(result.document!.toMarkdown());
} else {
console.error(`${result.path}: ${result.error}`);
}
}
Rust:
use dongler_core::load_many;
for result in load_many(["notes.txt", "invoice.pdf"]) {
if result.ok {
println!("{}", result.document.unwrap().to_markdown().unwrap());
} else {
eprintln!("{}: {}", result.path, result.error.unwrap());
}
}
CLI
dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json
PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.
API Surface
The high-level object API:
- Rust:
load_path(path),load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - Python:
dongler.load(path),dongler.load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - TypeScript:
load(path),loadMany(paths),doc.toMarkdown(),doc.toLatex(),doc.toJson()
Compatibility functions remain available:
parse_textto_markdownto_latexto_jsondetect_format
Documentation
The Docusaurus documentation site lives in website/ and builds from docs/.
cd website
npm install
npm run start
npm run build
Development
make test
make build
Focused commands:
make test-rust
make test-python
make test-js
make build-docs
License
Dongler is licensed under the MIT License. See LICENSE and NOTICE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dongler-0.2.0.tar.gz.
File metadata
- Download URL: dongler-0.2.0.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89e17f84f5093496343d55c63f8004ba0242df7b6ed04ae119f019028b86dfbe
|
|
| MD5 |
b2084dd11e4308a9ec0eb7cdba3e9905
|
|
| BLAKE2b-256 |
44eb2a9256298db625967fb1b575113a90a94c2c36ca288c5d8a6d3488829b5f
|
Provenance
The following attestation bundles were made for dongler-0.2.0.tar.gz:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.2.0.tar.gz -
Subject digest:
89e17f84f5093496343d55c63f8004ba0242df7b6ed04ae119f019028b86dfbe - Sigstore transparency entry: 1647080192
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@c1ed9a3a7d87d0fd71a0799267e5a72d2df379e7 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@c1ed9a3a7d87d0fd71a0799267e5a72d2df379e7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 579.6 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69c2ff0304c7740ee7b3fd798893673b8aa932395e4404e163d5f5c0378a1181
|
|
| MD5 |
29e6bc328ae4509a672ddfd7ab22bed3
|
|
| BLAKE2b-256 |
b84aecb29f08e2a3bb71f46ed80ebf611c8572442f3339319d1f769fe86ea13f
|
Provenance
The following attestation bundles were made for dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
69c2ff0304c7740ee7b3fd798893673b8aa932395e4404e163d5f5c0378a1181 - Sigstore transparency entry: 1647080299
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@c1ed9a3a7d87d0fd71a0799267e5a72d2df379e7 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@c1ed9a3a7d87d0fd71a0799267e5a72d2df379e7 -
Trigger Event:
push
-
Statement type: