Skip to main content

A Python tool for converting Microsoft Word documents (.docx/.doc) to Markdown format

Project description

MS Word to Markdown Converter

A Python-based Word to Markdown converter for Microsoft Word documents.

Features

  • Support for heading conversion (H1-H6)
  • Support for paragraph text
  • Support for bold, italic, underline formatting
  • Support for ordered and unordered lists
  • Support for table conversion
  • Support for image extraction and conversion
  • Automatic folder structure creation
  • Automatic blank line and format cleanup
  • Command line interface
  • Batch conversion support
  • Smart title handling with proper heading level adjustment
  • Intelligent formatting merge (e.g., adjacent underline tags)
  • Font-size based heading detection (when no heading styles are present)
  • Legacy .doc support via LibreOffice conversion

Installation

Install from source

git clone https://github.com/HNRobert/word2md.git
cd word2md
pip install -e .

Install dependencies only

pip install -r requirements.txt

Or install directly:

pip install python-docx

Optional: legacy .doc support

Python python-docx cannot read .doc files directly. This project supports .doc by converting it to a temporary .docx using LibreOffice.

  • macOS: brew install --cask libreoffice
  • Ensure the soffice command is available in your PATH (LibreOffice installs it).
  • Alternatively, you can set the WORD2MD_SOFFICE_PATH environment variable to the full path of your LibreOffice soffice executable (useful on Windows or custom installs).

Examples:

  • macOS / Linux (bash/zsh):
# export the path to soffice binary
export WORD2MD_SOFFICE_PATH=/Applications/LibreOffice.app/Contents/MacOS/soffice
  • Windows (PowerShell):
# set environment variable for current session
$env:WORD2MD_SOFFICE_PATH = 'C:\\Program Files\\LibreOffice\\program\\soffice.exe'

Usage

Command Line Tool

After installation, you can use the word2md command:

# Convert single file
word2md document.docx

# Convert legacy .doc (requires LibreOffice)
word2md document.doc

# Specify output file
word2md document.docx -o output.md

# Show verbose output
word2md document.docx -v

# Ignore all images and output a single Markdown file
word2md document.docx --ignore-images

# Batch conversion
word2md *.docx -o output_directory/

Python Script

You can also run the converter directly:

# Convert single file to auto-generated folder structure
python main.py document.docx

# Convert legacy .doc (requires LibreOffice)
python main.py document.doc

# Specify output file
python main.py document.docx -o output.md

# Show verbose output
python main.py document.docx -o output.md -v

# Ignore all images and output a single Markdown file
python main.py document.docx --ignore-images

Advanced Usage

# Batch conversion to output directory
python main.py *.docx -o output_directory/

# Output to stdout
python main.py document.docx

Project Structure

The project is now organized as a modular package:

word2md/
├── main.py                    # Main entry point
├── docx_converter/            # Main package
│   ├── __init__.py           # Package initialization
│   ├── cli.py                # Command line interface
│   ├── converter.py          # Main converter class
│   ├── document_processor.py # Document processing logic
│   ├── paragraph_processor.py # Paragraph processing
│   ├── formatting.py         # Text formatting (bold, italic, etc.)
│   ├── list_processor.py     # List handling
│   ├── table_processor.py    # Table conversion
│   ├── image_processor.py    # Image processing in paragraphs
│   ├── image_extractor.py    # Image extraction from DOCX
│   └── utils.py              # Utility functions
├── assets/
│   └── sample.docx           # Sample test file
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Supported Formats

Text Formatting

  • Bold**Bold**
  • Italic*Italic*
  • Underline → <u>Underline</u>

Heading Detection

The converter supports multiple methods for detecting headings:

  1. Style-based detection: Converts Word heading styles (Heading 1-6, Title) to Markdown headings
  2. Font-size based detection: When no heading styles are present, automatically detects headings based on font size hierarchy
    • Analyses all paragraphs with uniform font sizes
    • Determines the baseline font size (most common size, usually normal text)
    • Assigns heading levels to larger font sizes in descending order
    • Example: If baseline is 12pt, then 18pt → # (H1), 16pt → ## (H2), 14pt → ### (H3)

Headings

  • Word heading styles → Markdown headings (# ## ### etc.)
  • Smart title handling: When a "Title" style is present, all other headings are automatically adjusted down one level

Lists

  • Unordered lists (•, -, * etc.) → - Item
  • Ordered lists (1., 2., etc.) → 1. Item

Tables

  • Word tables → Markdown table format

Images

  • Automatic extraction of images from DOCX
  • Save to assets/ directory under document name folder
  • Create proper image references in Markdown: ![Image](./assets/image_001.png)
  • Optional --ignore-images / --no-images mode to skip all images

Output Structure

After conversion, the following structure is created:

document_name/
├── document_name.md
└── assets/
    ├── image_001.jpg
    ├── image_002.png
    └── ...

When using --ignore-images, output is a single Markdown file (no subfolder and no assets/ directory):

document_name.md

Example

Input (DOCX)

A document with the following structure:

  • Title style: "TEST DOC"
  • Heading 1: "Title 1"
  • Heading 2: "Title 2"
  • Heading 3: "Title 3"
  • Various text formatting including bold, italic, and underlined text

Output (Markdown)

# TEST DOC

## Title 1

### Title 2

#### Title 3

This is a paragraph with **bold text**, _italic text_, and <u>underlined text</u>.

- Unordered list item 1
- Unordered list item 2

1. Ordered list item 1
2. Ordered list item 2

![Image](./assets/image_001.jpg)

Development

Current Project Structure

word2md/
├── main.py                    # Main entry point
├── docx_converter/            # Main package
│   ├── __init__.py           # Package initialization
│   ├── cli.py                # Command line interface
│   ├── converter.py          # Main converter class
│   ├── document_processor.py # Document processing logic
│   ├── paragraph_processor.py # Paragraph processing
│   ├── formatting.py         # Text formatting (bold, italic, etc.)
│   ├── list_processor.py     # List handling
│   ├── table_processor.py    # Table conversion
│   ├── image_processor.py    # Image processing in paragraphs
│   ├── image_extractor.py    # Image extraction from DOCX
│   └── utils.py              # Utility functions
├── assets/
│   └── sample.docx           # Sample test file
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Architecture Benefits

  • Modular Design: Each component has a single responsibility
  • Easy Testing: Individual modules can be tested independently
  • Maintainable: Clear separation of concerns
  • Extensible: Easy to add new features or modify existing ones

Key Modules

  • DocxToMarkdownConverter: Main orchestrator class
  • DocumentProcessor: Handles document-level processing and title detection
  • ParagraphProcessor: Manages paragraph conversion and formatting
  • ImageExtractor: Extracts and maps images from DOCX files
  • ListProcessor: Handles ordered and unordered list conversion
  • TableProcessor: Converts Word tables to Markdown format
  • TextFormatter: Handles text formatting (bold, italic, underline)

Extending Functionality

The modular structure makes it easy to extend functionality:

Adding New Text Formatting

Edit docx_converter/formatting.py to add support for new text styles.

Supporting New List Types

Modify docx_converter/list_processor.py to handle different list formats.

Enhancing Image Processing

Update docx_converter/image_processor.py and docx_converter/image_extractor.py for advanced image handling.

Custom Document Elements

Add new processors in the docx_converter/ directory and integrate them via document_processor.py.

Development Workflow

  1. Install dependencies: pip install -r requirements.txt
  2. Run tests: python main.py assets/sample.docx
  3. Add new features in appropriate modules
  4. Test with various DOCX files
  5. Update documentation

Manual publish to PyPI (workflow)

This repository provides a manual GitHub Action to publish the package to PyPI. The workflow is triggered via the Actions UI (Manual publish to PyPI → Run workflow).

Behaviour:

  • It requires a version input (semantic version like 1.0.1).
  • It will update docx_converter/__init__.py and setup.py with the provided version.
  • If files change, it commits & pushes the change back to the main branch and optionally creates a v<version> tag.
  • Finally it builds sdist+wheel and publishes to PyPI using the PYPI_API_TOKEN secret. If you requested a tag (the tag input), the workflow will also create a GitHub Release (tag v<version>) and upload the generated artifacts from dist/* to the release.

Set up:

  • Add PYPI_API_TOKEN as a repository secret (Repository Settings → Secrets and variables → Actions → New repository secret).
  • Trigger the workflow via the Actions page and supply version. To create a tag/release, check the tag checkbox.

Note: The workflow only runs on manual dispatch to avoid accidental publishes on routine pushes.

Notes

  1. The converter primarily supports basic document formats; complex formatting may require manual adjustment
  2. Images are automatically extracted and saved to the assets folder
  3. Complex table layouts may need manual optimization
  4. Some Word-specific formats have no equivalent in Markdown and will be simplified

License

MIT License

Contributing

Issues and Pull Requests are welcome to improve this converter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

word2md-1.0.3.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

word2md-1.0.3-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file word2md-1.0.3.tar.gz.

File metadata

  • Download URL: word2md-1.0.3.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for word2md-1.0.3.tar.gz
Algorithm Hash digest
SHA256 aaaf6473f488533620c3262f2244c2f03e6b87c80d32f59fcb6a9f7dfcce0828
MD5 b74913a862545b17715393017982106d
BLAKE2b-256 69b355f4937bd6e71490355c1140d1738047e9aa8a5f638d1fe8870afd693146

See more details on using hashes here.

File details

Details for the file word2md-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: word2md-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for word2md-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 235054e8aeae355ea33fe40b7015d2a6fa93d97e8aab609fe4275bb60cf6e7f9
MD5 6163b04d18d3537ce7ea5914978bdacd
BLAKE2b-256 0b5a8396d442dcebf2a4107c9ac48d5c2691cab1e34f86155c1b731060626d26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page