Skip to main content

DOCX to Markdown converter written in Rust

Project description

undocx

Crates.io PyPI docs.rs License: MIT

Fast, accurate DOCX to Markdown converter written in Rust with Python bindings.

Conversion Demo

DOCX (input) Markdown (output)
DOCX input document Converted Markdown output

Click images to see full GitHub-rendered files. Headings, bold/italic/underline, tables, nested lists, footnotes, code blocks, track changes -- all converted automatically.

Install

pip install undocx          # Python
cargo install undocx        # CLI
# Rust library
[dependencies]
undocx = "0.3"

Quick Start

CLI

undocx report.docx output.md              # convert to file
undocx report.docx                         # print to stdout
undocx report.docx -o out.md --images-dir ./img  # extract images

Python

import undocx

markdown = undocx.convert_docx("report.docx")           # from path
markdown = undocx.convert_docx(open("r.docx","rb").read())  # from bytes

Rust

use undocx::{ConvertOptions, DocxToMarkdown, ImageHandling};

let options = ConvertOptions {
    image_handling: ImageHandling::SaveToDir("./images".into()),
    ..Default::default()
};
let converter = DocxToMarkdown::new(options);
let markdown = converter.convert("report.docx")?;

Supported Features

Category Elements
Text Bold, italic, underline, strikethrough, superscript/subscript
Structure Heading 1-9, Title, Subtitle, alignment (center/right)
Lists Ordered (decimal, letter, roman, Korean, circled), unordered, nested
Tables Colspan, rowspan, nested tables, multi-paragraph cells
Links External, internal bookmarks, TOC anchors
Images Inline, floating, VML legacy -- base64 embed, save to dir, or skip
Notes Footnotes, endnotes, comments (as Markdown [^ref])
Track changes Insertions (<ins>), deletions (~~strikethrough~~)
Other Page/column/line breaks, SDT, field codes, bookmarks, symbols

Options

Field Default Description
image_handling Inline Inline / SaveToDir(path) / Skip
preserve_whitespace false Keep original spacing
html_underline true <u> tags for underline
html_strikethrough false <s> tags instead of ~~
strict_reference_validation false Fail on broken note/comment refs

Advanced: Custom Pipeline

Replace the default extractor or renderer:

let converter = DocxToMarkdown::with_components(
    ConvertOptions::default(),
    MyExtractor,    // impl AstExtractor
    MyRenderer,     // impl Renderer
);

See docs/API_POLICY.md for stability guarantees on these traits.

Development

cargo test --all-features                                  # test
cargo clippy --all-features --tests -- -D warnings         # lint
./scripts/run_perf_benchmark.sh                            # bench

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

undocx-0.4.0.tar.gz (901.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

undocx-0.4.0-cp312-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12+Windows x86-64

undocx-0.4.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

undocx-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file undocx-0.4.0.tar.gz.

File metadata

  • Download URL: undocx-0.4.0.tar.gz
  • Upload date:
  • Size: 901.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for undocx-0.4.0.tar.gz
Algorithm Hash digest
SHA256 c46c935acdd4d83b6438f1e58cd10fe3cd9734934967382ba39af68e07d479b4
MD5 f627564cfd5186b70296e14df70e2577
BLAKE2b-256 58ec39aa0b4dc21da2cb1324bcb01366dd81e58c6aa043c8283bae8ea409834f

See more details on using hashes here.

File details

Details for the file undocx-0.4.0-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: undocx-0.4.0-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for undocx-0.4.0-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c786101affb55add125e3616733c92c573dfa886f9157ebe65af137b49edfe2e
MD5 06acdd6e1204f366d680ee5eea9427ba
BLAKE2b-256 653a3510617fb5ca91d6de19dbe042507ec681d679aea53e4510d764c43d2a9c

See more details on using hashes here.

File details

Details for the file undocx-0.4.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for undocx-0.4.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 d79497c327aca4a69c7b044608926b3397d6d195c95d2168f74f0e234d00f2da
MD5 b979fbee8c2745ccb3e3a3e1c4bf32c0
BLAKE2b-256 0e85a4949815e3a8855f50266edcca858be1f05315d60893586a5660a9cf8a45

See more details on using hashes here.

File details

Details for the file undocx-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for undocx-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d864a4fb8c827a92d112efafb5dce56284c06aa48d032bf87d620037bf7cb2f7
MD5 5759be53e9d385a80158380d84f5ead9
BLAKE2b-256 f0e6538d3931124f37dd7a8154e7ec388156f49585f54bff53281e188570d719

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page