Skip to main content

DOCX to Markdown converter written in Rust

Project description

undocx

Crates.io PyPI docs.rs License: MIT

Fast, accurate DOCX to Markdown converter built for LLM/RAG pipelines. Written in Rust with Python bindings.

  • 16.5x faster than pandoc — 3.3ms per file average
  • LLM-optimized — Clean Markdown output ready for embeddings, chunking, and retrieval
  • Full fidelity — Tables, footnotes, track changes, images, nested lists, and more

For HumansFor AgentsBenchmarksFeaturesContributing


Conversion Demo

DOCX (input) Markdown (output)
DOCX input document Converted Markdown output

Click images to see full GitHub-rendered files.

Benchmarks

Measured on 39 DOCX files × 10 iterations (reproduce it yourself):

Tool Avg (ms) Median (ms) Min (ms) Max (ms)
undocx 3.34 3.22 2.89 5.46
markitdown 18.25 17.45 14.63 41.81
pandoc 55.08 54.11 40.31 69.51

undocx is 16.5x faster than pandoc and 5.5x faster than markitdown.

Feature undocx pandoc markitdown
Language Rust Haskell Python
Speed (avg) 3.3ms/file 55ms/file 18ms/file
Tables (colspan/rowspan) Yes Partial Yes
Track changes Yes Yes No
Footnotes/Endnotes Yes Yes No
Comments Yes No No
VML legacy images Yes No No
Korean numbering Yes No No
Python API Yes CLI only Yes
Rust API Yes No No

For Humans

Install and convert — that's it.

pip install undocx          # Python
cargo install undocx        # CLI

CLI

undocx report.docx output.md              # convert to file
undocx report.docx                         # print to stdout
undocx report.docx -o out.md --images-dir ./img  # extract images

Python

import undocx

markdown = undocx.convert_docx("report.docx")

For Agents

Designed for document preprocessing in LLM/RAG pipelines.

Python — RAG ingestion

import undocx

# Skip images for text-only RAG ingestion
md = undocx.convert_docx("report.docx", image_handling="skip")

# Process bytes from S3, HTTP, or any byte stream
md = undocx.convert_docx(doc_bytes, image_handling="skip")

Rust — One-liner

let md = undocx::convert("report.docx")?;
let md = undocx::convert_bytes(&bytes)?;

Rust — Builder (optimal for RAG)

let md = undocx::builder()
    .skip_images()
    .convert("report.docx")?;

Rust — Pluggable architecture

let converter = DocxToMarkdown::with_components(
    ConvertOptions::default(),
    MyExtractor,    // impl AstExtractor
    MyRenderer,     // impl Renderer
);

See docs/API_POLICY.md for stability guarantees on these traits.

# Cargo.toml
[dependencies]
undocx = "0.4"

Tips for RAG pipelines:

  • Use image_handling="skip" to reduce token count
  • Output is clean Markdown — split on ## headers for semantic chunking
  • Footnotes and comments are preserved as [^ref] for full context

Supported Features

Category Elements
Text Bold, italic, underline, strikethrough, superscript/subscript
Structure Heading 1-9, Title, Subtitle, alignment (center/right)
Lists Ordered (decimal, letter, roman, Korean, circled), unordered, nested
Tables Colspan, rowspan, nested tables, multi-paragraph cells
Links External, internal bookmarks, TOC anchors
Images Inline, floating, VML legacy — base64 embed, save to dir, or skip
Notes Footnotes, endnotes, comments (as Markdown [^ref])
Track changes Insertions (<ins>), deletions (~~strikethrough~~)
Other Page/column/line breaks, SDT, field codes, bookmarks, symbols

Options

Field Default Description
image_handling Inline Inline / SaveToDir(path) / Skip
preserve_whitespace false Keep original spacing
html_underline true <u> tags for underline
html_strikethrough false <s> tags instead of ~~
strict_reference_validation false Fail on broken note/comment refs

Development

cargo test --all-features                                  # test
cargo clippy --all-features --tests -- -D warnings         # lint
python examples/benchmark_comparison.py ./tests/pandoc 10  # bench

See CONTRIBUTING.md for development setup and guidelines.

License

MIT — see LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

undocx-0.5.2.tar.gz (906.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

undocx-0.5.2-cp312-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12+Windows x86-64

undocx-0.5.2-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.4 MB view details)

Uploaded CPython 3.12+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

undocx-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file undocx-0.5.2.tar.gz.

File metadata

  • Download URL: undocx-0.5.2.tar.gz
  • Upload date:
  • Size: 906.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for undocx-0.5.2.tar.gz
Algorithm Hash digest
SHA256 7dde2105bfa1e5dbfe4216cedbdc88b211b030eafb49e99453bf254ae940d645
MD5 8d24131e81a85cce0d9ed22887a253e3
BLAKE2b-256 bb134dfe5f17985c67288d3dac88bf78e4b73d5c855a20ef77e59e236a43c122

See more details on using hashes here.

File details

Details for the file undocx-0.5.2-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: undocx-0.5.2-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for undocx-0.5.2-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f69ff548118ed649f2b09d1b09df1e6300d5befed90185efbf8e986edbec9153
MD5 05b670897d49706cadd0f6ec5d02ae71
BLAKE2b-256 3768749b86a2bd0ede574481cdd7bc571fbe456e57800d2edcc09e7e3ab817db

See more details on using hashes here.

File details

Details for the file undocx-0.5.2-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for undocx-0.5.2-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 8b415bc4e558bbb0a74778ab96719e85868ee39791f5c2377e5f9f31037f282c
MD5 b3aaa321f3947c3b4c6eeaaeb4a223de
BLAKE2b-256 78499d64f0730e3ad9d140669f7a9c9fd90b1df56149021d4590ef0a728df32f

See more details on using hashes here.

File details

Details for the file undocx-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for undocx-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 36127d15539cecb6d13eef54d799f1395d936dbf6b6690cb0c6d2a142554149c
MD5 c1f0b08beb7c99455b51d23f7e89f742
BLAKE2b-256 fece4e82ecf93662b3118178e0797a7b6065546b527fddaa4123917e0e1a7721

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page