DOCX to Markdown converter written in Rust
Project description
dm2xcod
DOCX to Markdown converter in Rust with Python bindings.
Table of Contents
- Why dm2xcod
- Requirements
- Installation
- Quick Start
- API Reference
- CLI Reference
- Architecture Overview
- Development
- License
Why dm2xcod
- Rust-based converter focused on predictable performance.
- Covers common DOCX structures: headings, lists, tables, notes, links, images.
- Supports image handling strategies: inline base64, save to directory, or skip.
- Exposes both CLI and Python (
PyO3) entry points. - Includes strict reference validation for footnote/comment/endnote integrity.
Requirements
- Rust
1.75+(building from source) - Python
3.12+(ABI3 wheel compatibility)
Installation
Python package
pip install dm2xcod
CLI (cargo)
cargo install dm2xcod
Rust library
[dependencies]
dm2xcod = "0.3"
Quick Start
CLI
# write to file
dm2xcod input.docx output.md
# print markdown to stdout
dm2xcod input.docx
Python
import dm2xcod
# path input
markdown = dm2xcod.convert_docx("document.docx")
print(markdown)
# bytes input
with open("document.docx", "rb") as f:
markdown = dm2xcod.convert_docx(f.read())
Rust
use dm2xcod::{ConvertOptions, DocxToMarkdown};
fn main() -> anyhow::Result<()> {
let converter = DocxToMarkdown::new(ConvertOptions::default());
let markdown = converter.convert("document.docx")?;
println!("{}", markdown);
Ok(())
}
API Reference
ConvertOptions
| Field | Type | Default | Description |
|---|---|---|---|
image_handling |
ImageHandling |
Inline |
Image output strategy |
preserve_whitespace |
bool |
false |
Preserve original spacing more strictly |
html_underline |
bool |
true |
Use HTML tags for underline output |
html_strikethrough |
bool |
false |
Use HTML tags for strikethrough output |
strict_reference_validation |
bool |
false |
Fail on unresolved note/comment references |
ImageHandling variants:
ImageHandling::InlineImageHandling::SaveToDir(PathBuf)ImageHandling::Skip
Example with non-default options:
use dm2xcod::{ConvertOptions, DocxToMarkdown, ImageHandling};
fn main() -> Result<(), dm2xcod::Error> {
let options = ConvertOptions {
image_handling: ImageHandling::SaveToDir("./images".into()),
preserve_whitespace: true,
html_underline: true,
html_strikethrough: true,
strict_reference_validation: true,
};
let converter = DocxToMarkdown::new(options);
let markdown = converter.convert("document.docx")?;
println!("{}", markdown);
Ok(())
}
Advanced: Custom extractor/renderer injection
DocxToMarkdown::with_components(options, extractor, renderer) lets you replace the default pipeline.
use dm2xcod::adapters::docx::AstExtractor;
use dm2xcod::converter::ConversionContext;
use dm2xcod::core::ast::{BlockNode, DocumentAst};
use dm2xcod::render::Renderer;
use dm2xcod::{ConvertOptions, DocxToMarkdown, Result};
use rs_docx::document::BodyContent;
#[derive(Debug, Default, Clone, Copy)]
struct MyExtractor;
impl AstExtractor for MyExtractor {
fn extract<'a>(
&self,
_body: &[BodyContent<'a>],
_context: &mut ConversionContext<'a>,
) -> Result<DocumentAst> {
Ok(DocumentAst {
blocks: vec![BlockNode::Paragraph("custom pipeline".to_string())],
references: Default::default(),
})
}
}
#[derive(Debug, Default, Clone, Copy)]
struct MyRenderer;
impl Renderer for MyRenderer {
fn render(&self, document: &DocumentAst) -> Result<String> {
Ok(format!("blocks={}", document.blocks.len()))
}
}
fn main() -> Result<()> {
let converter = DocxToMarkdown::with_components(
ConvertOptions::default(),
MyExtractor,
MyRenderer,
);
let output = converter.convert("document.docx")?;
println!("{}", output);
Ok(())
}
Python API
dm2xcod.convert_docx(input: str | bytes) -> str- Current Python entry point uses default conversion options.
CLI Reference
dm2xcod <INPUT> [OUTPUT] [--images-dir <DIR>] [--skip-images]
| Argument/Option | Description |
|---|---|
<INPUT> |
Input DOCX path (required) |
[OUTPUT] |
Output Markdown path (optional, otherwise stdout) |
--images-dir <DIR> |
Save extracted images to a directory |
--skip-images |
Skip image extraction/output |
Architecture Overview
Conversion pipeline:
- Parse DOCX (
rs_docx) - Build conversion context (relationships, numbering, styles, references, image strategy)
- Extract AST via adapter (
AstExtractor) - Validate references (optional strict mode)
- Render final markdown via renderer (
Renderer)
Project layout:
src/
adapters/ # Input adapters (DOCX -> AST extraction boundary)
core/ # Shared AST/model types
converter/ # Orchestration and conversion context
render/ # Markdown rendering + escaping
lib.rs # Public API (Rust + Python bindings)
main.rs # CLI entrypoint
Development
Build from source
# Rust library/CLI
cargo build --release
# Python extension in local env
pip install maturin
maturin develop --features python
Test and lint
cargo test --all-features
cargo clippy --all-features --tests -- -D warnings
Performance benchmark
# default: tests/aaa, 3 iterations, max 5 files
./scripts/run_perf_benchmark.sh
# custom: input_dir iterations max_files
./scripts/run_perf_benchmark.sh ./samples 5 10
Latest benchmark record (2026-02-14):
- Command:
./scripts/run_perf_benchmark.sh ./tests/aaa 10 10 - Threshold gate:
./scripts/check_perf_threshold.sh ./output_tests/perf/latest.json 15.0(pass) - Environment:
macOS 26.2 (Darwin arm64),rustc 1.92.0 (ded5c06cf 2025-12-08) - Result file:
output_tests/perf/latest.json
{"input_dir":"./tests/aaa","iterations":10,"files":2,"samples":20,"avg_ms":1.651,"min_ms":0.434,"max_ms":6.081,"total_ms":33.029,"overall_ms":33.034}
Performance threshold gate
# fails if avg_ms exceeds threshold
./scripts/check_perf_threshold.sh ./output_tests/perf/latest.json 15.0
Release notes
# auto-detect previous tag to HEAD
./scripts/generate_release_notes.sh
# explicit range and output file
./scripts/generate_release_notes.sh v0.3.9 v0.3.10 ./output_tests/release_notes.md
API stability policy
See docs/API_POLICY.md.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dm2xcod-0.3.14.tar.gz.
File metadata
- Download URL: dm2xcod-0.3.14.tar.gz
- Upload date:
- Size: 887.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd8d4015966caf6424bbc4afecdf819f0677f327addd8e579aeb2c150b31c729
|
|
| MD5 |
2450c69dc917e071bce8783dfac49abe
|
|
| BLAKE2b-256 |
ec20c07e2e322f55000c0b0354bc5019e327195f300dff6b37947dc52f4b9ca8
|
File details
Details for the file dm2xcod-0.3.14-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: dm2xcod-0.3.14-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b9c757203a5962d08af0f6f19eb3b0ded1b1ec4b8eef768e8a0ae2cffb840b5
|
|
| MD5 |
57ecfdd95196c6df47ee8b7927f43e8c
|
|
| BLAKE2b-256 |
1d833843614b9b154c9689a629108452b5feaec9e72fa10d9a7ddd1dc95f6f47
|
File details
Details for the file dm2xcod-0.3.14-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: dm2xcod-0.3.14-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
143a24d954d43f2d812393cb8e7d6ab20356c848f6167e77da8fa9b4f6a070d2
|
|
| MD5 |
5ea4284e68bc5900df65cc519e5e352f
|
|
| BLAKE2b-256 |
8d6b44f1e300c52c8e0bb65df038012c252628a63f7a93650719adafd7430c9b
|
File details
Details for the file dm2xcod-0.3.14-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: dm2xcod-0.3.14-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 2.3 MB
- Tags: CPython 3.12+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9abd4d67c4f0265812165d256a31245fd3c28f5da68ba1ab34002ff9e8f01f5
|
|
| MD5 |
9e8f23c9a623d182729325c8dc5e04f1
|
|
| BLAKE2b-256 |
4aa4e72814bf40e7aaead6ef2d8113c3fc29cde8ef0a4a6afac6a49deb4e0ca4
|