DOCX to Markdown converter written in Rust
Project description
undocx
Fast, accurate DOCX to Markdown converter built for LLM/RAG pipelines. Written in Rust with Python bindings.
- 16.5x faster than pandoc — 3.3ms per file average
- LLM-optimized — Clean Markdown output ready for embeddings, chunking, and retrieval
- Full fidelity — Tables, footnotes, track changes, images, nested lists, and more
For Humans • For Agents • Benchmarks • Features • Contributing
Conversion Demo
| DOCX (input) | Markdown (output) |
Click images to see full GitHub-rendered files.
Benchmarks
Measured on 39 DOCX files × 10 iterations (reproduce it yourself):
| Tool | Avg (ms) | Median (ms) | Min (ms) | Max (ms) |
|---|---|---|---|---|
| undocx | 3.34 | 3.22 | 2.89 | 5.46 |
| markitdown | 18.25 | 17.45 | 14.63 | 41.81 |
| pandoc | 55.08 | 54.11 | 40.31 | 69.51 |
undocx is 16.5x faster than pandoc and 5.5x faster than markitdown.
| Feature | undocx | pandoc | markitdown |
|---|---|---|---|
| Language | Rust | Haskell | Python |
| Speed (avg) | 3.3ms/file | 55ms/file | 18ms/file |
| Tables (colspan/rowspan) | Yes | Partial | Yes |
| Track changes | Yes | Yes | No |
| Footnotes/Endnotes | Yes | Yes | No |
| Comments | Yes | No | No |
| VML legacy images | Yes | No | No |
| Korean numbering | Yes | No | No |
| Python API | Yes | CLI only | Yes |
| Rust API | Yes | No | No |
For Humans
Install and convert — that's it.
pip install undocx # Python
cargo install undocx # CLI
CLI
undocx report.docx output.md # convert to file
undocx report.docx # print to stdout
undocx report.docx -o out.md --images-dir ./img # extract images
Python
import undocx
markdown = undocx.convert_docx("report.docx")
For Agents
Designed for document preprocessing in LLM/RAG pipelines.
Python — RAG ingestion
import undocx
# Skip images for text-only RAG ingestion
md = undocx.convert_docx("report.docx", image_handling="skip")
# Process bytes from S3, HTTP, or any byte stream
md = undocx.convert_docx(doc_bytes, image_handling="skip")
Rust — One-liner
let md = undocx::convert("report.docx")?;
let md = undocx::convert_bytes(&bytes)?;
Rust — Builder (optimal for RAG)
let md = undocx::builder()
.skip_images()
.convert("report.docx")?;
Rust — Pluggable architecture
let converter = DocxToMarkdown::with_components(
ConvertOptions::default(),
MyExtractor, // impl AstExtractor
MyRenderer, // impl Renderer
);
See docs/API_POLICY.md for stability guarantees on these traits.
# Cargo.toml
[dependencies]
undocx = "0.4"
Tips for RAG pipelines:
- Use
image_handling="skip"to reduce token count - Output is clean Markdown — split on
##headers for semantic chunking - Footnotes and comments are preserved as
[^ref]for full context
Supported Features
| Category | Elements |
|---|---|
| Text | Bold, italic, underline, strikethrough, superscript/subscript |
| Structure | Heading 1-9, Title, Subtitle, alignment (center/right) |
| Lists | Ordered (decimal, letter, roman, Korean, circled), unordered, nested |
| Tables | Colspan, rowspan, nested tables, multi-paragraph cells |
| Links | External, internal bookmarks, TOC anchors |
| Images | Inline, floating, VML legacy — base64 embed, save to dir, or skip |
| Notes | Footnotes, endnotes, comments (as Markdown [^ref]) |
| Track changes | Insertions (<ins>), deletions (~~strikethrough~~) |
| Other | Page/column/line breaks, SDT, field codes, bookmarks, symbols |
Options
| Field | Default | Description |
|---|---|---|
image_handling |
Inline |
Inline / SaveToDir(path) / Skip |
preserve_whitespace |
false |
Keep original spacing |
html_underline |
true |
<u> tags for underline |
html_strikethrough |
false |
<s> tags instead of ~~ |
strict_reference_validation |
false |
Fail on broken note/comment refs |
Development
cargo test --all-features # test
cargo clippy --all-features --tests -- -D warnings # lint
python examples/benchmark_comparison.py ./tests/pandoc 10 # bench
See CONTRIBUTING.md for development setup and guidelines.
License
MIT — see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file undocx-0.5.2.tar.gz.
File metadata
- Download URL: undocx-0.5.2.tar.gz
- Upload date:
- Size: 906.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dde2105bfa1e5dbfe4216cedbdc88b211b030eafb49e99453bf254ae940d645
|
|
| MD5 |
8d24131e81a85cce0d9ed22887a253e3
|
|
| BLAKE2b-256 |
bb134dfe5f17985c67288d3dac88bf78e4b73d5c855a20ef77e59e236a43c122
|
File details
Details for the file undocx-0.5.2-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: undocx-0.5.2-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f69ff548118ed649f2b09d1b09df1e6300d5befed90185efbf8e986edbec9153
|
|
| MD5 |
05b670897d49706cadd0f6ec5d02ae71
|
|
| BLAKE2b-256 |
3768749b86a2bd0ede574481cdd7bc571fbe456e57800d2edcc09e7e3ab817db
|
File details
Details for the file undocx-0.5.2-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: undocx-0.5.2-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.12+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b415bc4e558bbb0a74778ab96719e85868ee39791f5c2377e5f9f31037f282c
|
|
| MD5 |
b3aaa321f3947c3b4c6eeaaeb4a223de
|
|
| BLAKE2b-256 |
78499d64f0730e3ad9d140669f7a9c9fd90b1df56149021d4590ef0a728df32f
|
File details
Details for the file undocx-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: undocx-0.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36127d15539cecb6d13eef54d799f1395d936dbf6b6690cb0c6d2a142554149c
|
|
| MD5 |
c1f0b08beb7c99455b51d23f7e89f742
|
|
| BLAKE2b-256 |
fece4e82ecf93662b3118178e0797a7b6065546b527fddaa4123917e0e1a7721
|