A Python tool for converting Microsoft Word documents (.docx/.doc) to Markdown format

These details have not been verified by PyPI

Project links

Homepage

Project description

DOCX to Markdown Converter

A Python-based Word to Markdown converter for Microsoft Word documents.

Features

✅ Support for heading conversion (H1-H6)
✅ Support for paragraph text
✅ Support for bold, italic, underline formatting
✅ Support for ordered and unordered lists
✅ Support for table conversion
✅ Support for image extraction and conversion
✅ Automatic folder structure creation
✅ Automatic blank line and format cleanup
✅ Command line interface
✅ Batch conversion support
✅ Smart title handling with proper heading level adjustment
✅ Intelligent formatting merge (e.g., adjacent underline tags)
✅ Font-size based heading detection (when no heading styles are present)
✅ Legacy .doc support via LibreOffice conversion

Installation

Install from source

git clone https://github.com/HNRobert/word2md.git
cd word2md
pip install -e .

Install dependencies only

pip install -r requirements.txt

Or install directly:

pip install python-docx

Optional: legacy `.doc` support

Python python-docx cannot read .doc files directly. This project supports .doc by converting it to a temporary .docx using LibreOffice.

macOS: brew install --cask libreoffice
Ensure the soffice command is available in your PATH (LibreOffice installs it).
Alternatively, you can set the word2md_SOFFICE_PATH environment variable to the full path of your LibreOffice soffice executable (useful on Windows or custom installs).

Examples:

macOS / Linux (bash/zsh):

# export the path to soffice binary
export word2md_SOFFICE_PATH=/Applications/LibreOffice.app/Contents/MacOS/soffice

Windows (PowerShell):

# set environment variable for current session
$env:word2md_SOFFICE_PATH = 'C:\\Program Files\\LibreOffice\\program\\soffice.exe'

Usage

Command Line Tool

After installation, you can use the word2md command:

# Convert single file
word2md document.docx

# Convert legacy .doc (requires LibreOffice)
word2md document.doc

# Specify output file
word2md document.docx -o output.md

# Show verbose output
word2md document.docx -v

# Batch conversion
word2md *.docx -o output_directory/

Python Script

You can also run the converter directly:

# Convert single file to auto-generated folder structure
python main.py document.docx

# Convert legacy .doc (requires LibreOffice)
python main.py document.doc

# Specify output file
python main.py document.docx -o output.md

# Show verbose output
python main.py document.docx -o output.md -v

Advanced Usage

# Batch conversion to output directory
python main.py *.docx -o output_directory/

# Output to stdout
python main.py document.docx

Project Structure

The project is now organized as a modular package:

word2md/
├── main.py                    # Main entry point
├── docx_converter/            # Main package
│   ├── __init__.py           # Package initialization
│   ├── cli.py                # Command line interface
│   ├── converter.py          # Main converter class
│   ├── document_processor.py # Document processing logic
│   ├── paragraph_processor.py # Paragraph processing
│   ├── formatting.py         # Text formatting (bold, italic, etc.)
│   ├── list_processor.py     # List handling
│   ├── table_processor.py    # Table conversion
│   ├── image_processor.py    # Image processing in paragraphs
│   ├── image_extractor.py    # Image extraction from DOCX
│   └── utils.py              # Utility functions
├── assets/
│   └── sample.docx           # Sample test file
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Supported Formats

Text Formatting

Bold → **Bold**
Italic → *Italic*
Underline → <u>Underline</u>

Heading Detection

The converter supports multiple methods for detecting headings:

Style-based detection: Converts Word heading styles (Heading 1-6, Title) to Markdown headings
Font-size based detection: When no heading styles are present, automatically detects headings based on font size hierarchy
- Analyses all paragraphs with uniform font sizes
- Determines the baseline font size (most common size, usually normal text)
- Assigns heading levels to larger font sizes in descending order
- Example: If baseline is 12pt, then 18pt → # (H1), 16pt → ## (H2), 14pt → ### (H3)

Headings

Word heading styles → Markdown headings (# ## ### etc.)
Smart title handling: When a "Title" style is present, all other headings are automatically adjusted down one level

Lists

Unordered lists (•, -, * etc.) → - Item
Ordered lists (1., 2., etc.) → 1. Item

Tables

Word tables → Markdown table format

Images

Automatic extraction of images from DOCX
Save to assets/ directory under document name folder
Create proper image references in Markdown: ![Image](./assets/image_001.png)

Output Structure

After conversion, the following structure is created:

document_name/
├── document_name.md
└── assets/
    ├── image_001.jpg
    ├── image_002.png
    └── ...

Example

Input (DOCX)

A document with the following structure:

Title style: "TEST DOC"
Heading 1: "Title 1"
Heading 2: "Title 2"
Heading 3: "Title 3"
Various text formatting including bold, italic, and underlined text

Output (Markdown)

# TEST DOC

## Title 1

### Title 2

#### Title 3

This is a paragraph with **bold text**, _italic text_, and <u>underlined text</u>.

- Unordered list item 1
- Unordered list item 2

1. Ordered list item 1
2. Ordered list item 2

![Image](./assets/image_001.jpg)

Development

Current Project Structure

word2md/
├── main.py                    # Main entry point
├── docx_converter/            # Main package
│   ├── __init__.py           # Package initialization
│   ├── cli.py                # Command line interface
│   ├── converter.py          # Main converter class
│   ├── document_processor.py # Document processing logic
│   ├── paragraph_processor.py # Paragraph processing
│   ├── formatting.py         # Text formatting (bold, italic, etc.)
│   ├── list_processor.py     # List handling
│   ├── table_processor.py    # Table conversion
│   ├── image_processor.py    # Image processing in paragraphs
│   ├── image_extractor.py    # Image extraction from DOCX
│   └── utils.py              # Utility functions
├── assets/
│   └── sample.docx           # Sample test file
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Architecture Benefits

Modular Design: Each component has a single responsibility
Easy Testing: Individual modules can be tested independently
Maintainable: Clear separation of concerns
Extensible: Easy to add new features or modify existing ones

Key Modules

DocxToMarkdownConverter: Main orchestrator class
DocumentProcessor: Handles document-level processing and title detection
ParagraphProcessor: Manages paragraph conversion and formatting
ImageExtractor: Extracts and maps images from DOCX files
ListProcessor: Handles ordered and unordered list conversion
TableProcessor: Converts Word tables to Markdown format
TextFormatter: Handles text formatting (bold, italic, underline)

Extending Functionality

The modular structure makes it easy to extend functionality:

Adding New Text Formatting

Edit docx_converter/formatting.py to add support for new text styles.

Supporting New List Types

Modify docx_converter/list_processor.py to handle different list formats.

Enhancing Image Processing

Update docx_converter/image_processor.py and docx_converter/image_extractor.py for advanced image handling.

Custom Document Elements

Add new processors in the docx_converter/ directory and integrate them via document_processor.py.

Development Workflow

Install dependencies: pip install -r requirements.txt
Run tests: python main.py assets/sample.docx
Add new features in appropriate modules
Test with various DOCX files
Update documentation

Notes

The converter primarily supports basic document formats; complex formatting may require manual adjustment
Images are automatically extracted and saved to the assets folder
Complex table layouts may need manual optimization
Some Word-specific formats have no equivalent in Markdown and will be simplified

License

MIT License

Contributing

Issues and Pull Requests are welcome to improve this converter.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.3

Mar 28, 2026

1.0.2

Feb 6, 2026

1.0.1

Feb 6, 2026

This version

1.0.0

Feb 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word2md-1.0.0.tar.gz (21.8 kB view details)

Uploaded Feb 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

word2md-1.0.0-py3-none-any.whl (23.9 kB view details)

Uploaded Feb 6, 2026 Python 3

File details

Details for the file word2md-1.0.0.tar.gz.

File metadata

Download URL: word2md-1.0.0.tar.gz
Upload date: Feb 6, 2026
Size: 21.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for word2md-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`88ed7021413a0ac6f886fed34cad5def8e833dfffeaebc53afabb2d276b5a4b5`
MD5	`65bf4fcac44ba1972b2f7f481df88ed2`
BLAKE2b-256	`5dcc8a3b14362784b99843be688af4803d9d7248c1ee80375a3817573bf6333f`

See more details on using hashes here.

File details

Details for the file word2md-1.0.0-py3-none-any.whl.

File metadata

Download URL: word2md-1.0.0-py3-none-any.whl
Upload date: Feb 6, 2026
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for word2md-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f4b3b07f8c1f8ccbd931ce127a09fad332eaff6e1c07556631ca09813610d624`
MD5	`01bfaab16b774e0783478461602e607d`
BLAKE2b-256	`cfdc26a7a1c576a87fc248d7d2180773082984a4c340c8777b52f3a35fb554c0`

See more details on using hashes here.

word2md 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DOCX to Markdown Converter

Features

Installation

Install from source

Install dependencies only

Optional: legacy .doc support

Usage

Command Line Tool

Python Script

Advanced Usage

Project Structure

Supported Formats

Text Formatting

Heading Detection

Headings

Lists

Tables

Images

Output Structure

Example

Input (DOCX)

Output (Markdown)

Development

Current Project Structure

Architecture Benefits

Key Modules

Extending Functionality

Adding New Text Formatting

Supporting New List Types

Enhancing Image Processing

Custom Document Elements

Development Workflow

Notes

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Optional: legacy `.doc` support