Skip to main content

Created by Daniel Fat. Python bindings for a Rust-native document extraction engine focused on Markdown and LaTeX output.

Project description

Dongler logo

Dongler

Dongler is a Rust-native document extraction engine with Python and TypeScript bindings. It is built for the workflow developers actually need: load a document path, extract structure once, then render clean Markdown or LaTeX from the same document object.

Created by Daniel Fat.

Status

Dongler ships .txt extraction and a native Rust PDF extraction path with page geometry, text source anchors, basic table reconstruction, and image object positions.

Format Detection Extraction
.txt, .text yes supported
.pdf yes supported
Word, Excel, HTML, images, email yes planned

Current outputs:

  • Markdown
  • LaTeX
  • JSON
  • Dongler's typed document IR

Install

cargo install dongler
pip install dongler
npm install @cristianexer/dongler

For Rust library usage, depend on dongler-core. The public dongler crate is the CLI package.

API

Document extraction is a two-step process: load a document path, then render the extracted structure in the format you need. The same document object can be rendered to Markdown, LaTeX, or JSON without re-extracting the document.

Python:

import dongler

doc = dongler.load("invoice.pdf")
markdown = doc.to_markdown()
latex = doc.to_latex()

TypeScript:

import { load } from "@cristianexer/dongler";

const doc = load("invoice.pdf");
const markdown = doc.toMarkdown();
const latex = doc.toLatex();

Rust:

use dongler_core::load_path;

fn main() -> dongler_core::Result<()> {
    let doc = load_path("invoice.pdf")?;
    println!("{}", doc.to_markdown()?);
    Ok(())
}

Batch Processing

Batch processing returns one result per file. One bad or unsupported document does not stop the batch.

Python:

import dongler

for result in dongler.load_many(["notes.txt", "invoice.pdf"]):
    if result["ok"]:
        print(result["document"].to_markdown())
    else:
        print(f"{result['path']}: {result['error']}")

TypeScript:

import { loadMany } from "@cristianexer/dongler";

for (const result of loadMany(["notes.txt", "invoice.pdf"])) {
  if (result.ok) {
    console.log(result.document!.toMarkdown());
  } else {
    console.error(`${result.path}: ${result.error}`);
  }
}

Rust:

use dongler_core::load_many;

for result in load_many(["notes.txt", "invoice.pdf"]) {
    if result.ok {
        println!("{}", result.document.unwrap().to_markdown().unwrap());
    } else {
        eprintln!("{}: {}", result.path, result.error.unwrap());
    }
}

CLI

dongler --version
dongler inspect notes.txt
dongler inspect invoice.pdf
dongler extract notes.txt --format markdown
dongler extract notes.txt --format latex
dongler extract notes.txt --format json

PDF extraction through the CLI uses the same Rust-native engine as the Rust, Python, and TypeScript packages.

API Surface

The high-level object API:

  • Rust: load_path(path), load_many(paths), doc.to_markdown(), doc.to_latex(), doc.to_json()
  • Python: dongler.load(path), dongler.load_many(paths), doc.to_markdown(), doc.to_latex(), doc.to_json()
  • TypeScript: load(path), loadMany(paths), doc.toMarkdown(), doc.toLatex(), doc.toJson()

Compatibility functions remain available:

  • parse_text
  • to_markdown
  • to_latex
  • to_json
  • detect_format

Documentation

The Docusaurus documentation site lives in website/ and builds from docs/.

cd website
npm install
npm run start
npm run build

Development

make test
make build

Focused commands:

make test-rust
make test-python
make test-js
make build-docs

License

Dongler is licensed under the MIT License. See LICENSE and NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dongler-0.2.0.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (579.6 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

File details

Details for the file dongler-0.2.0.tar.gz.

File metadata

  • Download URL: dongler-0.2.0.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dongler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 89e17f84f5093496343d55c63f8004ba0242df7b6ed04ae119f019028b86dfbe
MD5 b2084dd11e4308a9ec0eb7cdba3e9905
BLAKE2b-256 44eb2a9256298db625967fb1b575113a90a94c2c36ca288c5d8a6d3488829b5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.2.0.tar.gz:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 69c2ff0304c7740ee7b3fd798893673b8aa932395e4404e163d5f5c0378a1181
MD5 29e6bc328ae4509a672ddfd7ab22bed3
BLAKE2b-256 b84aecb29f08e2a3bb71f46ed80ebf611c8572442f3339319d1f769fe86ea13f

See more details on using hashes here.

Provenance

The following attestation bundles were made for dongler-0.2.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: workflow.yml on cristianexer/dongler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page