High-fidelity Word (.docx) to Markdown converter. Preserves tables (vMerge), footnotes, field codes, bibliography, bold/italic/underline, and numbered lists.
Project description
docx2md-cli
High-fidelity Word (.docx) to Markdown for documents where citations, tables, footnotes, and structure need to survive conversion.
Why This Exists
Most DOCX-to-Markdown tools do fine on simple prose, then fall over on the details that matter in real reports and papers. docx2md-cli exists to preserve Word-specific structure such as field-code references, bibliography content controls, vertically merged tables, inline footnotes, and list numbering with minimal cleanup after conversion.
Feature Comparison
| Feature | docx2md-cli | Pandoc | MarkItDown | mammoth |
|---|---|---|---|---|
| Bold / Italic / Underline | ✅ | ✅ | ❌ | ✅ |
| Footnotes (inline position) | ✅ | ✅ | ❌ | ✅ |
Field codes ([N] refs) |
✅ | Partial | ❌ | ❌ |
| Bibliography (SDT) | ✅ | ❌ | ❌ | ❌ |
| Vertical merge (vMerge) | ✅ | ❌ | ❌ | ❌ |
| Split table detection | ✅ | ❌ | ❌ | ❌ |
| Numbered list distinction | ✅ | ✅ | ❌ | ❌ |
| Nested list levels | ✅ | ✅ | ❌ | ❌ |
| Image extraction + rename | ✅ | ✅ | ❌ | ❌ |
| YAML frontmatter | ✅ | ❌ | ❌ | ❌ |
Quick Start
pip install docx2md-cli
docx2md input.docx
docx2md input.docx -o output.md
Optional frontmatter support:
pip install "docx2md-cli[frontmatter]"
CLI Reference
Basic usage:
docx2md input.docx
docx2md input.docx -o output.md --extract-images images
docx2md input.docx --skip-before-heading --no-frontmatter
All flags:
| Flag | Description | Example |
|---|---|---|
-o, --output PATH |
Write Markdown to PATH. Use - for stdout. |
docx2md input.docx -o output.md |
--extract-images DIR |
Extract embedded images and link them in Markdown. | docx2md input.docx --extract-images images |
--skip-before-heading |
Ignore content before the first real Word heading. | docx2md input.docx --skip-before-heading |
--frontmatter FILE |
Prepend custom YAML frontmatter from a file. | docx2md input.docx --frontmatter meta.yaml |
--no-frontmatter |
Disable both auto and custom frontmatter. | docx2md input.docx --no-frontmatter |
-q, --quiet |
Suppress stats output. | docx2md input.docx -q |
--json-stats |
Emit machine-readable stats JSON. | docx2md input.docx --json-stats |
-v, --version |
Print the installed version. | docx2md --version |
Streaming examples:
cat input.docx | docx2md - -o -
docx2md input.docx --json-stats
docx2md input.docx -o - --no-frontmatter
Python API
from docx2md_cli import convert
result = convert(
"input.docx",
output_path="output.md",
images_dir="images",
skip_before_heading=False,
frontmatter_path=None,
frontmatter_dict=None,
no_frontmatter=False,
print_stats=True,
json_stats=False,
)
Parameters:
| Parameter | Type | Description |
|---|---|---|
input_path |
`str | bytes |
output_path |
`str | None` |
images_dir |
`str | None` |
skip_before_heading |
bool |
Skip cover pages or prefatory content before Heading N. |
frontmatter_path |
`str | None` |
frontmatter_dict |
`dict | None` |
no_frontmatter |
bool |
Disable frontmatter generation. |
print_stats |
bool |
Emit conversion stats when writing output. |
json_stats |
bool |
Emit stats as JSON instead of human-readable text. |
stats_stream |
`TextIO | None` |
Return value:
print(result.lines[:3])
print(result.stats["table_rows"])
print(result.as_json())
convert() returns ConvertResult, which is list-like for backward compatibility and also exposes .lines, .stats, and .as_json().
For AI Agents
Use stdout-friendly and machine-readable modes when chaining tools:
docx2md input.docx --json-stats
docx2md input.docx -q -o output.md
cat input.docx | docx2md - -o -
from docx2md_cli import convert
result = convert("input.docx", print_stats=False, no_frontmatter=True)
stats = result.stats
payload = result.as_json()
--quiet avoids human-oriented console output. --json-stats gives structured stats for automation. -o - writes Markdown to stdout. ConvertResult lets agents inspect lines and counters without reparsing terminal output.
Supported Languages
Caption matching currently recognizes:
- Spanish:
Figura,Tabla - English:
Figure,Table - French:
Tableau - German:
Abbildung,Tabelle - Portuguese:
Tabela - Italian:
Tabella
Word heading detection intentionally follows the standard Heading N style names.
How It Works
The converter walks the Word document body in order instead of flattening everything to plain text. A field-code state machine preserves citation references, numbering.xml is read directly to distinguish ordered vs unordered lists and nested levels, and the table walker handles vMerge and split-table cases before emitting Markdown. Footnotes are collected from footnotes.xml, bibliography SDTs are extracted, and image filenames can be derived from nearby captions.
Contributing
Issues welcome. PRs welcome. Run pytest before submitting.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docx2md_cli-0.2.0.tar.gz.
File metadata
- Download URL: docx2md_cli-0.2.0.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a77ebb43c612018ee3e08570def3334658eee3f5594f0251262efd6e21fa30e
|
|
| MD5 |
402aa16ba5e3fee80dbf937f0e63bc08
|
|
| BLAKE2b-256 |
921ea37bd7417378215832afa44d2c8916024914a83615b30233948684cb79d5
|
File details
Details for the file docx2md_cli-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docx2md_cli-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a749dfa4bee35711345f80463084d2d95b1d307e7d0137dd7a8c4595b588a039
|
|
| MD5 |
3989cc978a00abb8d4148c177f3dd10d
|
|
| BLAKE2b-256 |
ab952ae9ed17130392a39ceec6807dcb156fe8cd8a47331eb3268f44ed9e287b
|