Skip to main content

DOCX to Markdown converter written in Rust

Project description

dm2xcod

PyPI License: MIT

DOCX to Markdown converter in Rust with Python bindings.

Table of Contents

Why dm2xcod

  • Rust-based converter focused on predictable performance.
  • Covers common DOCX structures: headings, lists, tables, notes, links, images.
  • Supports image handling strategies: inline base64, save to directory, or skip.
  • Exposes both CLI and Python (PyO3) entry points.
  • Includes strict reference validation for footnote/comment/endnote integrity.

Requirements

  • Rust 1.75+ (building from source)
  • Python 3.12+ (ABI3 wheel compatibility)

Installation

Python package

pip install dm2xcod

CLI (cargo)

cargo install dm2xcod

Rust library

[dependencies]
dm2xcod = "0.3"

Quick Start

CLI

# write to file
dm2xcod input.docx output.md

# print markdown to stdout
dm2xcod input.docx

Python

import dm2xcod

# path input
markdown = dm2xcod.convert_docx("document.docx")
print(markdown)

# bytes input
with open("document.docx", "rb") as f:
    markdown = dm2xcod.convert_docx(f.read())

Rust

use dm2xcod::{ConvertOptions, DocxToMarkdown};

fn main() -> anyhow::Result<()> {
    let converter = DocxToMarkdown::new(ConvertOptions::default());
    let markdown = converter.convert("document.docx")?;
    println!("{}", markdown);
    Ok(())
}

API Reference

ConvertOptions

Field Type Default Description
image_handling ImageHandling Inline Image output strategy
preserve_whitespace bool false Preserve original spacing more strictly
html_underline bool true Use HTML tags for underline output
html_strikethrough bool false Use HTML tags for strikethrough output
strict_reference_validation bool false Fail on unresolved note/comment references

ImageHandling variants:

  • ImageHandling::Inline
  • ImageHandling::SaveToDir(PathBuf)
  • ImageHandling::Skip

Example with non-default options:

use dm2xcod::{ConvertOptions, DocxToMarkdown, ImageHandling};

fn main() -> Result<(), dm2xcod::Error> {
    let options = ConvertOptions {
        image_handling: ImageHandling::SaveToDir("./images".into()),
        preserve_whitespace: true,
        html_underline: true,
        html_strikethrough: true,
        strict_reference_validation: true,
    };

    let converter = DocxToMarkdown::new(options);
    let markdown = converter.convert("document.docx")?;
    println!("{}", markdown);
    Ok(())
}

Advanced: Custom extractor/renderer injection

DocxToMarkdown::with_components(options, extractor, renderer) lets you replace the default pipeline.

use dm2xcod::adapters::docx::AstExtractor;
use dm2xcod::converter::ConversionContext;
use dm2xcod::core::ast::{BlockNode, DocumentAst};
use dm2xcod::render::Renderer;
use dm2xcod::{ConvertOptions, DocxToMarkdown, Result};
use rs_docx::document::BodyContent;

#[derive(Debug, Default, Clone, Copy)]
struct MyExtractor;

impl AstExtractor for MyExtractor {
    fn extract<'a>(
        &self,
        _body: &[BodyContent<'a>],
        _context: &mut ConversionContext<'a>,
    ) -> Result<DocumentAst> {
        Ok(DocumentAst {
            blocks: vec![BlockNode::Paragraph("custom pipeline".to_string())],
            references: Default::default(),
        })
    }
}

#[derive(Debug, Default, Clone, Copy)]
struct MyRenderer;

impl Renderer for MyRenderer {
    fn render(&self, document: &DocumentAst) -> Result<String> {
        Ok(format!("blocks={}", document.blocks.len()))
    }
}

fn main() -> Result<()> {
    let converter = DocxToMarkdown::with_components(
        ConvertOptions::default(),
        MyExtractor,
        MyRenderer,
    );
    let output = converter.convert("document.docx")?;
    println!("{}", output);
    Ok(())
}

Python API

  • dm2xcod.convert_docx(input: str | bytes) -> str
  • Current Python entry point uses default conversion options.

CLI Reference

dm2xcod <INPUT> [OUTPUT] [--images-dir <DIR>] [--skip-images]
Argument/Option Description
<INPUT> Input DOCX path (required)
[OUTPUT] Output Markdown path (optional, otherwise stdout)
--images-dir <DIR> Save extracted images to a directory
--skip-images Skip image extraction/output

Architecture Overview

Conversion pipeline:

  1. Parse DOCX (rs_docx)
  2. Build conversion context (relationships, numbering, styles, references, image strategy)
  3. Extract AST via adapter (AstExtractor)
  4. Validate references (optional strict mode)
  5. Render final markdown via renderer (Renderer)

Project layout:

src/
  adapters/      # Input adapters (DOCX -> AST extraction boundary)
  core/          # Shared AST/model types
  converter/     # Orchestration and conversion context
  render/        # Markdown rendering + escaping
  lib.rs         # Public API (Rust + Python bindings)
  main.rs        # CLI entrypoint

Development

Build from source

# Rust library/CLI
cargo build --release

# Python extension in local env
pip install maturin
maturin develop --features python

Test and lint

cargo test --all-features
cargo clippy --all-features --tests -- -D warnings

Performance benchmark

# default: tests/aaa, 3 iterations, max 5 files
./scripts/run_perf_benchmark.sh

# custom: input_dir iterations max_files
./scripts/run_perf_benchmark.sh ./samples 5 10

Latest benchmark record (2026-02-14):

  • Command: ./scripts/run_perf_benchmark.sh ./tests/aaa 10 10
  • Threshold gate: ./scripts/check_perf_threshold.sh ./output_tests/perf/latest.json 15.0 (pass)
  • Environment: macOS 26.2 (Darwin arm64), rustc 1.92.0 (ded5c06cf 2025-12-08)
  • Result file: output_tests/perf/latest.json
{"input_dir":"./tests/aaa","iterations":10,"files":2,"samples":20,"avg_ms":1.651,"min_ms":0.434,"max_ms":6.081,"total_ms":33.029,"overall_ms":33.034}

Performance threshold gate

# fails if avg_ms exceeds threshold
./scripts/check_perf_threshold.sh ./output_tests/perf/latest.json 15.0

Release notes

# auto-detect previous tag to HEAD
./scripts/generate_release_notes.sh

# explicit range and output file
./scripts/generate_release_notes.sh v0.3.9 v0.3.10 ./output_tests/release_notes.md

API stability policy

See docs/API_POLICY.md.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dm2xcod-0.3.14.tar.gz (887.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

dm2xcod-0.3.14-cp312-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12+Windows x86-64

dm2xcod-0.3.14-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

dm2xcod-0.3.14-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (2.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file dm2xcod-0.3.14.tar.gz.

File metadata

  • Download URL: dm2xcod-0.3.14.tar.gz
  • Upload date:
  • Size: 887.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for dm2xcod-0.3.14.tar.gz
Algorithm Hash digest
SHA256 fd8d4015966caf6424bbc4afecdf819f0677f327addd8e579aeb2c150b31c729
MD5 2450c69dc917e071bce8783dfac49abe
BLAKE2b-256 ec20c07e2e322f55000c0b0354bc5019e327195f300dff6b37947dc52f4b9ca8

See more details on using hashes here.

File details

Details for the file dm2xcod-0.3.14-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: dm2xcod-0.3.14-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.4

File hashes

Hashes for dm2xcod-0.3.14-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1b9c757203a5962d08af0f6f19eb3b0ded1b1ec4b8eef768e8a0ae2cffb840b5
MD5 57ecfdd95196c6df47ee8b7927f43e8c
BLAKE2b-256 1d833843614b9b154c9689a629108452b5feaec9e72fa10d9a7ddd1dc95f6f47

See more details on using hashes here.

File details

Details for the file dm2xcod-0.3.14-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for dm2xcod-0.3.14-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 143a24d954d43f2d812393cb8e7d6ab20356c848f6167e77da8fa9b4f6a070d2
MD5 5ea4284e68bc5900df65cc519e5e352f
BLAKE2b-256 8d6b44f1e300c52c8e0bb65df038012c252628a63f7a93650719adafd7430c9b

See more details on using hashes here.

File details

Details for the file dm2xcod-0.3.14-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for dm2xcod-0.3.14-cp312-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 a9abd4d67c4f0265812165d256a31245fd3c28f5da68ba1ab34002ff9e8f01f5
MD5 9e8f23c9a623d182729325c8dc5e04f1
BLAKE2b-256 4aa4e72814bf40e7aaead6ef2d8113c3fc29cde8ef0a4a6afac6a49deb4e0ca4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page