Skip to main content

Created by Daniel Fat. Python bindings for a Rust-native document extraction engine focused on Markdown and LaTeX output.

Project description

Dongler logo

Dongler

Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.

Created by Daniel Fat.

Status

Dongler 0.1.0 ships the stable package shape and a real .txt extraction path. PDF is the primary product target and the public API is designed for that workflow, but PDF extraction is not implemented yet.

Format Detection Extraction
.txt, .text yes supported
.pdf yes planned
Word, Excel, HTML, images, email yes planned

Current outputs:

  • Markdown
  • LaTeX
  • JSON
  • Dongler's typed document IR

Install

cargo install dongler
pip install dongler
npm install dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

Planned PDF Workflow

This is the API Dongler is building toward. Today, the same calls detect PDFs and return a clear planned-format error until the PDF engine lands.

Python:

import dongler

doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()

TypeScript:

import { load } from "dongler";

const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("invoice.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

Works Today

The same object API works today for text files.

Python:

import dongler

doc = dongler.load("notes.txt")
print(doc.metadata["block_count"])
print(doc.to_markdown())
print(doc.to_latex())

TypeScript:

import { load } from "dongler";

const doc = load("notes.txt");
console.log(doc.metadata.block_count);
console.log(doc.toMarkdown());
console.log(doc.toLatex());

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("notes.txt")?;
    println!("blocks: {}", doc.metadata.block_count);
    println!("{}", doc.to_latex()?);
    Ok(())
}

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI will use the same engine as the Rust, Python, and TypeScript packages once it is implemented.

API Surface

The high-level object API:

  • Rust: load_path(path), load_many(paths), doc.to_markdown(), doc.to_latex(), doc.to_json()
  • Python: dongler.load(path), dongler.load_many(paths), doc.to_markdown(), doc.to_latex(), doc.to_json()
  • TypeScript: load(path), loadMany(paths), doc.toMarkdown(), doc.toLatex(), doc.toJson()

Compatibility functions remain available:

  • parse_text
  • to_markdown
  • to_latex
  • to_json
  • detect_format

Architecture

Rust is the source of truth. Python and TypeScript are thin native bindings over the Rust core.

flowchart LR
    Path["Document path"] --> Format["Format detection"]
    Format --> Loader["Source loader"]
    Loader --> Engine["Extraction engine"]
    Engine --> IR["Document IR"]
    IR --> Markdown["Markdown"]
    IR --> Latex["LaTeX"]
    IR --> Json["JSON"]
    IR --> Python["Python object API"]
    IR --> TypeScript["TypeScript object API"]
    IR --> CLI["CLI"]

The current text engine proves the pipeline. The PDF engine will plug into the same loader, engine, IR, and renderer boundaries.

Documentation

The Docusaurus documentation site lives in website/ and builds from docs/.

cd website
npm install
npm run start
npm run build

Development

make test
make build

Focused commands:

make test-rust
make test-python
make test-js
make build-docs

License

Dongler is licensed under the MIT License. See LICENSE and NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dongler-0.1.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (342.2 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file dongler-0.1.0.tar.gz.

File metadata

  • Download URL: dongler-0.1.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 00e34eef5b761ca9d9adb1b2aba8e4818adf9e6ae0efa31ee759104437c6ebfa
MD5 2bd47f2ad531f0090273e854a5138e9f
BLAKE2b-256 930a7e3482cf10c23e0361ee1b44a940d1ef2d0085512aa7a49bf637b7c6ae00

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.1.0.tar.gz:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 67bf23e27afeb8c75a3d1e467b24414f23710afae786c2d8295da3c41a4eae29
MD5 056d835958f6f61981b9fe2fb70a8bbf
BLAKE2b-256 adb163cc8990ad63dcde640bb8997adcddf948633b0d7abb9ca553eea32a3b8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page