Created by Daniel Fat. Python bindings for a Rust-native document extraction engine focused on Markdown and LaTeX output.
Project description
Dongler
Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.
Created by Daniel Fat.
Status
Dongler 0.1.0 ships the stable package shape and a real .txt extraction
path. PDF is the primary product target and the public API is designed for that
workflow, but PDF extraction is not implemented yet.
| Format | Detection | Extraction |
|---|---|---|
.txt, .text |
yes | supported |
.pdf |
yes | planned |
| Word, Excel, HTML, images, email | yes | planned |
Current outputs:
- Markdown
- LaTeX
- JSON
- Dongler's typed document IR
Install
cargo install dongler
pip install dongler
npm install dongler
For Rust library usage, depend on dongler-core. The public dongler crate is
the CLI package.
Planned PDF Workflow
This is the API Dongler is building toward. Today, the same calls detect PDFs and return a clear planned-format error until the PDF engine lands.
Python:
import dongler
doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()
TypeScript:
import { load } from "dongler";
const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();
Rust:
use dongler_core::load_path;
fn main() -> dongler_core::Result<()> {
let doc = load_path("invoice.pdf")?;
println!("{}", doc.to_markdown()?);
Ok(())
}
Works Today
The same object API works today for text files.
Python:
import dongler
doc = dongler.load("notes.txt")
print(doc.metadata["block_count"])
print(doc.to_markdown())
print(doc.to_latex())
TypeScript:
import { load } from "dongler";
const doc = load("notes.txt");
console.log(doc.metadata.block_count);
console.log(doc.toMarkdown());
console.log(doc.toLatex());
Rust:
use dongler_core::load_path;
fn main() -> dongler_core::Result<()> {
let doc = load_path("notes.txt")?;
println!("blocks: {}", doc.metadata.block_count);
println!("{}", doc.to_latex()?);
Ok(())
}
Batch Processing
Batch processing returns one result per file. One bad or unsupported document does not stop the batch.
Python:
import dongler
for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
if result["ok"]:
print(result["document"].to_markdown())
else:
print(f"{result['path']}: {result['error']}")
TypeScript:
import { loadMany } from "dongler";
for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
if (result.ok) {
console.log(result.document!.toMarkdown());
} else {
console.error(`${result.path}: ${result.error}`);
}
}
Rust:
use dongler_core::load_many;
for result in load_many(["notes.txt", "invoice.pdf"]) {
if result.ok {
println!("{}", result.document.unwrap().to_markdown().unwrap());
} else {
eprintln!("{}: {}", result.path, result.error.unwrap());
}
}
CLI
dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json
PDF extraction through the CLI will use the same engine as the Rust, Python, and TypeScript packages once it is implemented.
API Surface
The high-level object API:
- Rust:
load_path(path),load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - Python:
dongler.load(path),dongler.load_many(paths),doc.to_markdown(),doc.to_latex(),doc.to_json() - TypeScript:
load(path),loadMany(paths),doc.toMarkdown(),doc.toLatex(),doc.toJson()
Compatibility functions remain available:
parse_textto_markdownto_latexto_jsondetect_format
Architecture
Rust is the source of truth. Python and TypeScript are thin native bindings over the Rust core.
flowchart LR
Path["Document path"] --> Format["Format detection"]
Format --> Loader["Source loader"]
Loader --> Engine["Extraction engine"]
Engine --> IR["Document IR"]
IR --> Markdown["Markdown"]
IR --> Latex["LaTeX"]
IR --> Json["JSON"]
IR --> Python["Python object API"]
IR --> TypeScript["TypeScript object API"]
IR --> CLI["CLI"]
The current text engine proves the pipeline. The PDF engine will plug into the same loader, engine, IR, and renderer boundaries.
Documentation
The Docusaurus documentation site lives in website/ and builds from docs/.
cd website
npm install
npm run start
npm run build
Development
make test
make build
Focused commands:
make test-rust
make test-python
make test-js
make build-docs
License
Dongler is licensed under the MIT License. See LICENSE and NOTICE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dongler-0.1.0.tar.gz.
File metadata
- Download URL: dongler-0.1.0.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00e34eef5b761ca9d9adb1b2aba8e4818adf9e6ae0efa31ee759104437c6ebfa
|
|
| MD5 |
2bd47f2ad531f0090273e854a5138e9f
|
|
| BLAKE2b-256 |
930a7e3482cf10c23e0361ee1b44a940d1ef2d0085512aa7a49bf637b7c6ae00
|
Provenance
The following attestation bundles were made for dongler-0.1.0.tar.gz:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.1.0.tar.gz -
Subject digest:
00e34eef5b761ca9d9adb1b2aba8e4818adf9e6ae0efa31ee759104437c6ebfa - Sigstore transparency entry: 1645974908
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@09e1e08c827c4660ed758fec138276a3ae646a2e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@09e1e08c827c4660ed758fec138276a3ae646a2e -
Trigger Event:
push
-
Statement type:
File details
Details for the file dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 342.2 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67bf23e27afeb8c75a3d1e467b24414f23710afae786c2d8295da3c41a4eae29
|
|
| MD5 |
056d835958f6f61981b9fe2fb70a8bbf
|
|
| BLAKE2b-256 |
adb163cc8990ad63dcde640bb8997adcddf948633b0d7abb9ca553eea32a3b8e
|
Provenance
The following attestation bundles were made for dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
workflow.yml on cristianexer/dongler
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
67bf23e27afeb8c75a3d1e467b24414f23710afae786c2d8295da3c41a4eae29 - Sigstore transparency entry: 1645975077
- Sigstore integration time:
-
Permalink:
cristianexer/dongler@09e1e08c827c4660ed758fec138276a3ae646a2e -
Branch / Tag:
refs/heads/main - Owner: https://github.com/cristianexer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@09e1e08c827c4660ed758fec138276a3ae646a2e -
Trigger Event:
push
-
Statement type: