DOCX to Markdown converter written in Rust
Project description
undocx
Fast, accurate DOCX to Markdown converter written in Rust with Python bindings.
Conversion Demo
| DOCX (input) | Markdown (output) |
Click images to see full GitHub-rendered files. Headings, bold/italic/underline, tables, nested lists, footnotes, code blocks, track changes -- all converted automatically.
Install
pip install undocx # Python
cargo install undocx # CLI
# Rust library
[dependencies]
undocx = "0.3"
Quick Start
CLI
undocx report.docx output.md # convert to file
undocx report.docx # print to stdout
undocx report.docx -o out.md --images-dir ./img # extract images
Python
import undocx
markdown = undocx.convert_docx("report.docx") # from path
markdown = undocx.convert_docx(open("r.docx","rb").read()) # from bytes
Rust
use undocx::{ConvertOptions, DocxToMarkdown, ImageHandling};
let options = ConvertOptions {
image_handling: ImageHandling::SaveToDir("./images".into()),
..Default::default()
};
let converter = DocxToMarkdown::new(options);
let markdown = converter.convert("report.docx")?;
Supported Features
| Category | Elements |
|---|---|
| Text | Bold, italic, underline, strikethrough, superscript/subscript |
| Structure | Heading 1-9, Title, Subtitle, alignment (center/right) |
| Lists | Ordered (decimal, letter, roman, Korean, circled), unordered, nested |
| Tables | Colspan, rowspan, nested tables, multi-paragraph cells |
| Links | External, internal bookmarks, TOC anchors |
| Images | Inline, floating, VML legacy -- base64 embed, save to dir, or skip |
| Notes | Footnotes, endnotes, comments (as Markdown [^ref]) |
| Track changes | Insertions (<ins>), deletions (~~strikethrough~~) |
| Other | Page/column/line breaks, SDT, field codes, bookmarks, symbols |
Options
| Field | Default | Description |
|---|---|---|
image_handling |
Inline |
Inline / SaveToDir(path) / Skip |
preserve_whitespace |
false |
Keep original spacing |
html_underline |
true |
<u> tags for underline |
html_strikethrough |
false |
<s> tags instead of ~~ |
strict_reference_validation |
false |
Fail on broken note/comment refs |
Advanced: Custom Pipeline
Replace the default extractor or renderer:
let converter = DocxToMarkdown::with_components(
ConvertOptions::default(),
MyExtractor, // impl AstExtractor
MyRenderer, // impl Renderer
);
See docs/API_POLICY.md for stability guarantees on these traits.
Development
cargo test --all-features # test
cargo clippy --all-features --tests -- -D warnings # lint
./scripts/run_perf_benchmark.sh # bench
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file undocx-0.4.0.tar.gz.
File metadata
- Download URL: undocx-0.4.0.tar.gz
- Upload date:
- Size: 901.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c46c935acdd4d83b6438f1e58cd10fe3cd9734934967382ba39af68e07d479b4
|
|
| MD5 |
f627564cfd5186b70296e14df70e2577
|
|
| BLAKE2b-256 |
58ec39aa0b4dc21da2cb1324bcb01366dd81e58c6aa043c8283bae8ea409834f
|
File details
Details for the file undocx-0.4.0-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: undocx-0.4.0-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c786101affb55add125e3616733c92c573dfa886f9157ebe65af137b49edfe2e
|
|
| MD5 |
06acdd6e1204f366d680ee5eea9427ba
|
|
| BLAKE2b-256 |
653a3510617fb5ca91d6de19dbe042507ec681d679aea53e4510d764c43d2a9c
|
File details
Details for the file undocx-0.4.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: undocx-0.4.0-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d79497c327aca4a69c7b044608926b3397d6d195c95d2168f74f0e234d00f2da
|
|
| MD5 |
b979fbee8c2745ccb3e3a3e1c4bf32c0
|
|
| BLAKE2b-256 |
0e85a4949815e3a8855f50266edcca858be1f05315d60893586a5660a9cf8a45
|
File details
Details for the file undocx-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: undocx-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d864a4fb8c827a92d112efafb5dce56284c06aa48d032bf87d620037bf7cb2f7
|
|
| MD5 |
5759be53e9d385a80158380d84f5ead9
|
|
| BLAKE2b-256 |
f0e6538d3931124f37dd7a8154e7ec388156f49585f54bff53281e188570d719
|