Skip to main content

High-fidelity Word (.docx) to Markdown converter. Preserves tables (vMerge), footnotes, field codes, bibliography, bold/italic/underline, and numbered lists.

Project description

docx2md-cli

Python License PyPI

High-fidelity Word (.docx) to Markdown for documents where citations, tables, footnotes, and structure need to survive conversion.

Why This Exists

Most DOCX-to-Markdown tools do fine on simple prose, then fall over on the details that matter in real reports and papers. docx2md-cli exists to preserve Word-specific structure such as field-code references, bibliography content controls, vertically merged tables, inline footnotes, and list numbering with minimal cleanup after conversion.

Feature Comparison

Feature docx2md-cli Pandoc MarkItDown mammoth
Bold / Italic / Underline
Footnotes (inline position)
Field codes ([N] refs) Partial
Bibliography (SDT)
Vertical merge (vMerge)
Split table detection
Numbered list distinction
Nested list levels
Image extraction + rename
YAML frontmatter

Quick Start

pip install docx2md-cli
docx2md input.docx
docx2md input.docx -o output.md

Optional frontmatter support:

pip install "docx2md-cli[frontmatter]"

CLI Reference

Basic usage:

docx2md input.docx
docx2md input.docx -o output.md --extract-images images
docx2md input.docx --skip-before-heading --no-frontmatter

All flags:

Flag Description Example
-o, --output PATH Write Markdown to PATH. Use - for stdout. docx2md input.docx -o output.md
--extract-images DIR Extract embedded images and link them in Markdown. docx2md input.docx --extract-images images
--skip-before-heading Ignore content before the first real Word heading. docx2md input.docx --skip-before-heading
--frontmatter FILE Prepend custom YAML frontmatter from a file. docx2md input.docx --frontmatter meta.yaml
--no-frontmatter Disable both auto and custom frontmatter. docx2md input.docx --no-frontmatter
-q, --quiet Suppress stats output. docx2md input.docx -q
--json-stats Emit machine-readable stats JSON. docx2md input.docx --json-stats
-v, --version Print the installed version. docx2md --version

Streaming examples:

cat input.docx | docx2md - -o -
docx2md input.docx --json-stats
docx2md input.docx -o - --no-frontmatter

Python API

from docx2md_cli import convert

result = convert(
    "input.docx",
    output_path="output.md",
    images_dir="images",
    skip_before_heading=False,
    frontmatter_path=None,
    frontmatter_dict=None,
    no_frontmatter=False,
    print_stats=True,
    json_stats=False,
)

Parameters:

Parameter Type Description
input_path `str bytes
output_path `str None`
images_dir `str None`
skip_before_heading bool Skip cover pages or prefatory content before Heading N.
frontmatter_path `str None`
frontmatter_dict `dict None`
no_frontmatter bool Disable frontmatter generation.
print_stats bool Emit conversion stats when writing output.
json_stats bool Emit stats as JSON instead of human-readable text.
stats_stream `TextIO None`

Return value:

print(result.lines[:3])
print(result.stats["table_rows"])
print(result.as_json())

convert() returns ConvertResult, which is list-like for backward compatibility and also exposes .lines, .stats, and .as_json().

For AI Agents

Use stdout-friendly and machine-readable modes when chaining tools:

docx2md input.docx --json-stats
docx2md input.docx -q -o output.md
cat input.docx | docx2md - -o -
from docx2md_cli import convert

result = convert("input.docx", print_stats=False, no_frontmatter=True)
stats = result.stats
payload = result.as_json()

--quiet avoids human-oriented console output. --json-stats gives structured stats for automation. -o - writes Markdown to stdout. ConvertResult lets agents inspect lines and counters without reparsing terminal output.

Supported Languages

Caption matching currently recognizes:

  • Spanish: Figura, Tabla
  • English: Figure, Table
  • French: Tableau
  • German: Abbildung, Tabelle
  • Portuguese: Tabela
  • Italian: Tabella

Word heading detection intentionally follows the standard Heading N style names.

How It Works

The converter walks the Word document body in order instead of flattening everything to plain text. A field-code state machine preserves citation references, numbering.xml is read directly to distinguish ordered vs unordered lists and nested levels, and the table walker handles vMerge and split-table cases before emitting Markdown. Footnotes are collected from footnotes.xml, bibliography SDTs are extracted, and image filenames can be derived from nearby captions.

Contributing

Issues welcome. PRs welcome. Run pytest before submitting.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2md_cli-0.2.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docx2md_cli-0.2.0-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file docx2md_cli-0.2.0.tar.gz.

File metadata

  • Download URL: docx2md_cli-0.2.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for docx2md_cli-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2a77ebb43c612018ee3e08570def3334658eee3f5594f0251262efd6e21fa30e
MD5 402aa16ba5e3fee80dbf937f0e63bc08
BLAKE2b-256 921ea37bd7417378215832afa44d2c8916024914a83615b30233948684cb79d5

See more details on using hashes here.

File details

Details for the file docx2md_cli-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: docx2md_cli-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for docx2md_cli-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a749dfa4bee35711345f80463084d2d95b1d307e7d0137dd7a8c4595b588a039
MD5 3989cc978a00abb8d4148c177f3dd10d
BLAKE2b-256 ab952ae9ed17130392a39ceec6807dcb156fe8cd8a47331eb3268f44ed9e287b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page