Skip to main content

A Python tool for converting Microsoft Word documents (.docx/.doc) to Markdown format

Project description

DOCX to Markdown Converter

A Python-based Word to Markdown converter for Microsoft Word documents.

Features

  • ✅ Support for heading conversion (H1-H6)
  • ✅ Support for paragraph text
  • ✅ Support for bold, italic, underline formatting
  • ✅ Support for ordered and unordered lists
  • ✅ Support for table conversion
  • ✅ Support for image extraction and conversion
  • ✅ Automatic folder structure creation
  • ✅ Automatic blank line and format cleanup
  • ✅ Command line interface
  • ✅ Batch conversion support
  • ✅ Smart title handling with proper heading level adjustment
  • ✅ Intelligent formatting merge (e.g., adjacent underline tags)
  • ✅ Font-size based heading detection (when no heading styles are present)
  • ✅ Legacy .doc support via LibreOffice conversion

Installation

Install from source

git clone https://github.com/HNRobert/word2md.git
cd word2md
pip install -e .

Install dependencies only

pip install -r requirements.txt

Or install directly:

pip install python-docx

Optional: legacy .doc support

Python python-docx cannot read .doc files directly. This project supports .doc by converting it to a temporary .docx using LibreOffice.

  • macOS: brew install --cask libreoffice
  • Ensure the soffice command is available in your PATH (LibreOffice installs it).
  • Alternatively, you can set the word2md_SOFFICE_PATH environment variable to the full path of your LibreOffice soffice executable (useful on Windows or custom installs).

Examples:

  • macOS / Linux (bash/zsh):
# export the path to soffice binary
export word2md_SOFFICE_PATH=/Applications/LibreOffice.app/Contents/MacOS/soffice
  • Windows (PowerShell):
# set environment variable for current session
$env:word2md_SOFFICE_PATH = 'C:\\Program Files\\LibreOffice\\program\\soffice.exe'

Usage

Command Line Tool

After installation, you can use the word2md command:

# Convert single file
word2md document.docx

# Convert legacy .doc (requires LibreOffice)
word2md document.doc

# Specify output file
word2md document.docx -o output.md

# Show verbose output
word2md document.docx -v

# Batch conversion
word2md *.docx -o output_directory/

Python Script

You can also run the converter directly:

# Convert single file to auto-generated folder structure
python main.py document.docx

# Convert legacy .doc (requires LibreOffice)
python main.py document.doc

# Specify output file
python main.py document.docx -o output.md

# Show verbose output
python main.py document.docx -o output.md -v

Advanced Usage

# Batch conversion to output directory
python main.py *.docx -o output_directory/

# Output to stdout
python main.py document.docx

Project Structure

The project is now organized as a modular package:

word2md/
├── main.py                    # Main entry point
├── docx_converter/            # Main package
│   ├── __init__.py           # Package initialization
│   ├── cli.py                # Command line interface
│   ├── converter.py          # Main converter class
│   ├── document_processor.py # Document processing logic
│   ├── paragraph_processor.py # Paragraph processing
│   ├── formatting.py         # Text formatting (bold, italic, etc.)
│   ├── list_processor.py     # List handling
│   ├── table_processor.py    # Table conversion
│   ├── image_processor.py    # Image processing in paragraphs
│   ├── image_extractor.py    # Image extraction from DOCX
│   └── utils.py              # Utility functions
├── assets/
│   └── sample.docx           # Sample test file
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Supported Formats

Text Formatting

  • Bold**Bold**
  • Italic*Italic*
  • Underline → <u>Underline</u>

Heading Detection

The converter supports multiple methods for detecting headings:

  1. Style-based detection: Converts Word heading styles (Heading 1-6, Title) to Markdown headings
  2. Font-size based detection: When no heading styles are present, automatically detects headings based on font size hierarchy
    • Analyses all paragraphs with uniform font sizes
    • Determines the baseline font size (most common size, usually normal text)
    • Assigns heading levels to larger font sizes in descending order
    • Example: If baseline is 12pt, then 18pt → # (H1), 16pt → ## (H2), 14pt → ### (H3)

Headings

  • Word heading styles → Markdown headings (# ## ### etc.)
  • Smart title handling: When a "Title" style is present, all other headings are automatically adjusted down one level

Lists

  • Unordered lists (•, -, * etc.) → - Item
  • Ordered lists (1., 2., etc.) → 1. Item

Tables

  • Word tables → Markdown table format

Images

  • Automatic extraction of images from DOCX
  • Save to assets/ directory under document name folder
  • Create proper image references in Markdown: ![Image](./assets/image_001.png)

Output Structure

After conversion, the following structure is created:

document_name/
├── document_name.md
└── assets/
    ├── image_001.jpg
    ├── image_002.png
    └── ...

Example

Input (DOCX)

A document with the following structure:

  • Title style: "TEST DOC"
  • Heading 1: "Title 1"
  • Heading 2: "Title 2"
  • Heading 3: "Title 3"
  • Various text formatting including bold, italic, and underlined text

Output (Markdown)

# TEST DOC

## Title 1

### Title 2

#### Title 3

This is a paragraph with **bold text**, _italic text_, and <u>underlined text</u>.

- Unordered list item 1
- Unordered list item 2

1. Ordered list item 1
2. Ordered list item 2

![Image](./assets/image_001.jpg)

Development

Current Project Structure

word2md/
├── main.py                    # Main entry point
├── docx_converter/            # Main package
│   ├── __init__.py           # Package initialization
│   ├── cli.py                # Command line interface
│   ├── converter.py          # Main converter class
│   ├── document_processor.py # Document processing logic
│   ├── paragraph_processor.py # Paragraph processing
│   ├── formatting.py         # Text formatting (bold, italic, etc.)
│   ├── list_processor.py     # List handling
│   ├── table_processor.py    # Table conversion
│   ├── image_processor.py    # Image processing in paragraphs
│   ├── image_extractor.py    # Image extraction from DOCX
│   └── utils.py              # Utility functions
├── assets/
│   └── sample.docx           # Sample test file
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Architecture Benefits

  • Modular Design: Each component has a single responsibility
  • Easy Testing: Individual modules can be tested independently
  • Maintainable: Clear separation of concerns
  • Extensible: Easy to add new features or modify existing ones

Key Modules

  • DocxToMarkdownConverter: Main orchestrator class
  • DocumentProcessor: Handles document-level processing and title detection
  • ParagraphProcessor: Manages paragraph conversion and formatting
  • ImageExtractor: Extracts and maps images from DOCX files
  • ListProcessor: Handles ordered and unordered list conversion
  • TableProcessor: Converts Word tables to Markdown format
  • TextFormatter: Handles text formatting (bold, italic, underline)

Extending Functionality

The modular structure makes it easy to extend functionality:

Adding New Text Formatting

Edit docx_converter/formatting.py to add support for new text styles.

Supporting New List Types

Modify docx_converter/list_processor.py to handle different list formats.

Enhancing Image Processing

Update docx_converter/image_processor.py and docx_converter/image_extractor.py for advanced image handling.

Custom Document Elements

Add new processors in the docx_converter/ directory and integrate them via document_processor.py.

Development Workflow

  1. Install dependencies: pip install -r requirements.txt
  2. Run tests: python main.py assets/sample.docx
  3. Add new features in appropriate modules
  4. Test with various DOCX files
  5. Update documentation

Notes

  1. The converter primarily supports basic document formats; complex formatting may require manual adjustment
  2. Images are automatically extracted and saved to the assets folder
  3. Complex table layouts may need manual optimization
  4. Some Word-specific formats have no equivalent in Markdown and will be simplified

License

MIT License

Contributing

Issues and Pull Requests are welcome to improve this converter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word2md-1.0.0.tar.gz (21.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

word2md-1.0.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file word2md-1.0.0.tar.gz.

File metadata

  • Download URL: word2md-1.0.0.tar.gz
  • Upload date:
  • Size: 21.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for word2md-1.0.0.tar.gz
Algorithm Hash digest
SHA256 88ed7021413a0ac6f886fed34cad5def8e833dfffeaebc53afabb2d276b5a4b5
MD5 65bf4fcac44ba1972b2f7f481df88ed2
BLAKE2b-256 5dcc8a3b14362784b99843be688af4803d9d7248c1ee80375a3817573bf6333f

See more details on using hashes here.

File details

Details for the file word2md-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: word2md-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for word2md-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4b3b07f8c1f8ccbd931ce127a09fad332eaff6e1c07556631ca09813610d624
MD5 01bfaab16b774e0783478461602e607d
BLAKE2b-256 cfdc26a7a1c576a87fc248d7d2180773082984a4c340c8777b52f3a35fb554c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page