A Python tool for converting Microsoft Word documents (.docx/.doc) to Markdown format
Project description
DOCX to Markdown Converter
A Python-based Word to Markdown converter for Microsoft Word documents.
Features
- ✅ Support for heading conversion (H1-H6)
- ✅ Support for paragraph text
- ✅ Support for bold, italic, underline formatting
- ✅ Support for ordered and unordered lists
- ✅ Support for table conversion
- ✅ Support for image extraction and conversion
- ✅ Automatic folder structure creation
- ✅ Automatic blank line and format cleanup
- ✅ Command line interface
- ✅ Batch conversion support
- ✅ Smart title handling with proper heading level adjustment
- ✅ Intelligent formatting merge (e.g., adjacent underline tags)
- ✅ Font-size based heading detection (when no heading styles are present)
- ✅ Legacy
.docsupport via LibreOffice conversion
Installation
Install from source
git clone https://github.com/HNRobert/word2md.git
cd word2md
pip install -e .
Install dependencies only
pip install -r requirements.txt
Or install directly:
pip install python-docx
Optional: legacy .doc support
Python python-docx cannot read .doc files directly. This project supports .doc by converting it to a temporary .docx using LibreOffice.
- macOS:
brew install --cask libreoffice - Ensure the
sofficecommand is available in yourPATH(LibreOffice installs it). - Alternatively, you can set the
WORD2MD_SOFFICE_PATHenvironment variable to the full path of your LibreOfficesofficeexecutable (useful on Windows or custom installs).
Examples:
- macOS / Linux (bash/zsh):
# export the path to soffice binary
export WORD2MD_SOFFICE_PATH=/Applications/LibreOffice.app/Contents/MacOS/soffice
- Windows (PowerShell):
# set environment variable for current session
$env:WORD2MD_SOFFICE_PATH = 'C:\\Program Files\\LibreOffice\\program\\soffice.exe'
Usage
Command Line Tool
After installation, you can use the word2md command:
# Convert single file
word2md document.docx
# Convert legacy .doc (requires LibreOffice)
word2md document.doc
# Specify output file
word2md document.docx -o output.md
# Show verbose output
word2md document.docx -v
# Batch conversion
word2md *.docx -o output_directory/
Python Script
You can also run the converter directly:
# Convert single file to auto-generated folder structure
python main.py document.docx
# Convert legacy .doc (requires LibreOffice)
python main.py document.doc
# Specify output file
python main.py document.docx -o output.md
# Show verbose output
python main.py document.docx -o output.md -v
Advanced Usage
# Batch conversion to output directory
python main.py *.docx -o output_directory/
# Output to stdout
python main.py document.docx
Project Structure
The project is now organized as a modular package:
word2md/
├── main.py # Main entry point
├── docx_converter/ # Main package
│ ├── __init__.py # Package initialization
│ ├── cli.py # Command line interface
│ ├── converter.py # Main converter class
│ ├── document_processor.py # Document processing logic
│ ├── paragraph_processor.py # Paragraph processing
│ ├── formatting.py # Text formatting (bold, italic, etc.)
│ ├── list_processor.py # List handling
│ ├── table_processor.py # Table conversion
│ ├── image_processor.py # Image processing in paragraphs
│ ├── image_extractor.py # Image extraction from DOCX
│ └── utils.py # Utility functions
├── assets/
│ └── sample.docx # Sample test file
├── requirements.txt # Dependencies
└── README.md # Documentation
Supported Formats
Text Formatting
- Bold →
**Bold** - Italic →
*Italic* - Underline →
<u>Underline</u>
Heading Detection
The converter supports multiple methods for detecting headings:
- Style-based detection: Converts Word heading styles (Heading 1-6, Title) to Markdown headings
- Font-size based detection: When no heading styles are present, automatically detects headings based on font size hierarchy
- Analyses all paragraphs with uniform font sizes
- Determines the baseline font size (most common size, usually normal text)
- Assigns heading levels to larger font sizes in descending order
- Example: If baseline is 12pt, then 18pt → # (H1), 16pt → ## (H2), 14pt → ### (H3)
Headings
- Word heading styles → Markdown headings (# ## ### etc.)
- Smart title handling: When a "Title" style is present, all other headings are automatically adjusted down one level
Lists
- Unordered lists (•, -, * etc.) →
- Item - Ordered lists (1., 2., etc.) →
1. Item
Tables
- Word tables → Markdown table format
Images
- Automatic extraction of images from DOCX
- Save to
assets/directory under document name folder - Create proper image references in Markdown:

Output Structure
After conversion, the following structure is created:
document_name/
├── document_name.md
└── assets/
├── image_001.jpg
├── image_002.png
└── ...
Example
Input (DOCX)
A document with the following structure:
- Title style: "TEST DOC"
- Heading 1: "Title 1"
- Heading 2: "Title 2"
- Heading 3: "Title 3"
- Various text formatting including bold, italic, and underlined text
Output (Markdown)
# TEST DOC
## Title 1
### Title 2
#### Title 3
This is a paragraph with **bold text**, _italic text_, and <u>underlined text</u>.
- Unordered list item 1
- Unordered list item 2
1. Ordered list item 1
2. Ordered list item 2

Development
Current Project Structure
word2md/
├── main.py # Main entry point
├── docx_converter/ # Main package
│ ├── __init__.py # Package initialization
│ ├── cli.py # Command line interface
│ ├── converter.py # Main converter class
│ ├── document_processor.py # Document processing logic
│ ├── paragraph_processor.py # Paragraph processing
│ ├── formatting.py # Text formatting (bold, italic, etc.)
│ ├── list_processor.py # List handling
│ ├── table_processor.py # Table conversion
│ ├── image_processor.py # Image processing in paragraphs
│ ├── image_extractor.py # Image extraction from DOCX
│ └── utils.py # Utility functions
├── assets/
│ └── sample.docx # Sample test file
├── requirements.txt # Dependencies
└── README.md # Documentation
Architecture Benefits
- Modular Design: Each component has a single responsibility
- Easy Testing: Individual modules can be tested independently
- Maintainable: Clear separation of concerns
- Extensible: Easy to add new features or modify existing ones
Key Modules
DocxToMarkdownConverter: Main orchestrator classDocumentProcessor: Handles document-level processing and title detectionParagraphProcessor: Manages paragraph conversion and formattingImageExtractor: Extracts and maps images from DOCX filesListProcessor: Handles ordered and unordered list conversionTableProcessor: Converts Word tables to Markdown formatTextFormatter: Handles text formatting (bold, italic, underline)
Extending Functionality
The modular structure makes it easy to extend functionality:
Adding New Text Formatting
Edit docx_converter/formatting.py to add support for new text styles.
Supporting New List Types
Modify docx_converter/list_processor.py to handle different list formats.
Enhancing Image Processing
Update docx_converter/image_processor.py and docx_converter/image_extractor.py for advanced image handling.
Custom Document Elements
Add new processors in the docx_converter/ directory and integrate them via document_processor.py.
Development Workflow
- Install dependencies:
pip install -r requirements.txt - Run tests:
python main.py assets/sample.docx - Add new features in appropriate modules
- Test with various DOCX files
- Update documentation
Notes
- The converter primarily supports basic document formats; complex formatting may require manual adjustment
- Images are automatically extracted and saved to the assets folder
- Complex table layouts may need manual optimization
- Some Word-specific formats have no equivalent in Markdown and will be simplified
License
MIT License
Contributing
Issues and Pull Requests are welcome to improve this converter.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file word2md-1.0.1.tar.gz.
File metadata
- Download URL: word2md-1.0.1.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc447f3e075bdce661b44686898a126ad50e67f6a5cdd9ca877117f3b9430b99
|
|
| MD5 |
b70cff077ea3308500eabab568934754
|
|
| BLAKE2b-256 |
99a6ef763a90f80f2ade7d56a3deaf40d027ddb0654e683b93a3d41ffc753811
|
File details
Details for the file word2md-1.0.1-py3-none-any.whl.
File metadata
- Download URL: word2md-1.0.1-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
345fd893b9a7af37ea3ef1ea756492e7c55a5e6d140d3de822ed3e4c759fba96
|
|
| MD5 |
236bd4c15d1e44364258d83e4228d792
|
|
| BLAKE2b-256 |
f6b5c257013072e72b1065bd9a1a30497650be69ffc90742fc9a6a9064166de4
|