Skip to main content

A library for converting DOCX files to HTML and plain text

Project description

DOCX Parser Converter - Python

Python implementation of the DOCX parser and converter. Built with Python 3.10+, Pydantic models, and lxml.

For installation and quick start, see the main README.

⚠️ Breaking Changes in v1.0.0

Version 1.0.0 introduces a completely rewritten API. If you're upgrading from a previous version, please read the CHANGELOG.md for the full migration guide.

Quick Migration

Old API (deprecated):

from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter

docx_content = read_binary_from_file_path("document.docx")
converter = DocxToHtmlConverter(docx_content)
html = converter.convert_to_html()

New API (recommended):

from docx_parser_converter import docx_to_html

html = docx_to_html("document.docx")

The old API still works but emits deprecation warnings. It will be removed in a future version.

Configuration

Use ConversionConfig to customize the conversion:

from docx_parser_converter import ConversionConfig, docx_to_html, docx_to_text

# HTML conversion options
config = ConversionConfig(
    # HTML-specific options
    title="My Document",           # Document title in <title> tag
    language="en",                 # HTML lang attribute
    style_mode="inline",           # "inline", "class", or "none"
    use_semantic_tags=False,       # Use CSS spans (False) vs <strong>, <em> (True)
    fragment_only=False,           # Output just content without HTML wrapper
    custom_css="body { margin: 2em; }",  # Custom CSS to include
    responsive=True,               # Include viewport meta tag

    # Text-specific options
    text_formatting="plain",       # "plain" or "markdown"
    table_mode="auto",             # "auto", "ascii", "tabs", or "plain"
    paragraph_separator="\n\n",    # Separator between paragraphs
)

html = docx_to_html("document.docx", config=config)
text = docx_to_text("document.docx", config=config)

Configuration Options

HTML Options

Option Type Default Description
style_mode "inline" | "class" | "none" "inline" How to output CSS styles
use_semantic_tags bool False Use semantic tags (<strong>, <em>) vs CSS spans
preserve_whitespace bool False Preserve whitespace in content
title str "" Document title for HTML output
language str "en" HTML lang attribute
fragment_only bool False Output only content, no HTML wrapper
custom_css str | None None Custom CSS to include
css_files list[str] [] External CSS files to reference
responsive bool True Include viewport meta tag
include_print_styles bool False Include print media query styles

Text Options

Option Type Default Description
text_formatting "plain" | "markdown" "plain" Output format
table_mode "auto" | "ascii" | "tabs" | "plain" "auto" Table rendering mode
paragraph_separator str "\n\n" Separator between paragraphs
preserve_empty_paragraphs bool True Preserve empty paragraphs

Table Rendering Modes

  • auto: Automatically selects ASCII for tables with visible borders, tabs for others
  • ascii: ASCII box drawing characters (+, -, |)
  • tabs: Tab-separated columns
  • plain: Space-separated columns

Example ASCII table output:

+----------+----------+
| Header 1 | Header 2 |
+----------+----------+
| Cell 1   | Cell 2   |
+----------+----------+

Markdown Formatting

When using text_formatting="markdown", formatting is preserved:

config = ConversionConfig(text_formatting="markdown")
text = docx_to_text("document.docx", config=config)

# Output: "This is **bold** and *italic* text."

Input Types

The library accepts multiple input types:

from pathlib import Path
from io import BytesIO

# File path as string
html = docx_to_html("document.docx")

# File path as Path object
html = docx_to_html(Path("document.docx"))

# Bytes content
with open("document.docx", "rb") as f:
    content = f.read()
html = docx_to_html(content)

# File-like object
with open("document.docx", "rb") as f:
    html = docx_to_html(f)

# None returns empty output
html = docx_to_html(None)  # Returns empty HTML document
text = docx_to_text(None)  # Returns ""

Supported DOCX Elements

Text Formatting

  • Bold, italic, underline, strikethrough
  • Subscript, superscript
  • Highlight colors
  • Font family, size, and color
  • All caps, small caps
  • Various underline styles (single, double, dotted, dashed, wave, etc.) with color support

Paragraph Formatting

  • Alignment (left, center, right, justify)
  • Indentation (left, right, first line, hanging)
  • Spacing (before, after, line spacing)
  • Borders and shading
  • Keep with next, keep lines together, page break before

Lists and Numbering

  • Bullet lists
  • Numbered lists (decimal, roman, letters, ordinal)
  • Multi-level lists with various formats
  • List restart and override support

Tables

  • Simple and complex tables
  • Cell merging (horizontal and vertical)
  • Full border support (outer borders, inside grid lines, per-cell borders)
  • Cell-level border overrides (tcBorders override tblBorders)
  • Cell shading and backgrounds
  • Column widths and table alignment

Other Elements

  • Hyperlinks (external URLs resolved from relationships)
  • Line breaks and page breaks
  • Tab characters
  • Special characters (soft hyphen, non-breaking hyphen)

Error Handling

The library provides specific exceptions for different error cases:

from docx_parser_converter import docx_to_html

try:
    html = docx_to_html("document.docx")
except FileNotFoundError:
    print("File not found")
except ValueError as e:
    print(f"Invalid DOCX: {e}")
except Exception as e:
    print(f"Error: {e}")

Image Format Support

Images are extracted from DOCX files and embedded in HTML as base64 data URLs. Browser rendering support varies by format:

Format Extensions Browser Support
PNG .png ✅ Full
JPEG .jpg, .jpeg ✅ Full
GIF .gif ✅ Full (including animation)
WebP .webp ✅ Full
SVG .svg ✅ Full
BMP .bmp ✅ Full
TIFF .tif, .tiff ⚠️ Safari only
EMF .emf ❌ Not supported
WMF .wmf ❌ Not supported

Notes:

  • TIFF images will only display in Safari; other browsers will show a broken image
  • EMF/WMF are Windows vector formats that browsers cannot render natively
  • Images in plain text output are skipped (no alt text placeholders)

Known Limitations

Not Currently Supported

  • Headers and footers: Document headers/footers are not included
  • Footnotes and endnotes: These are not extracted
  • Comments and track changes: Revision marks are not processed
  • OLE objects: Embedded Excel charts, etc. are not supported
  • Text boxes: Floating text boxes and shapes are not extracted
  • Complex field codes: Most field codes besides hyperlinks
  • RTL/BiDi text: Right-to-left text may not render correctly
  • Password-protected files: Encrypted documents cannot be opened

Partial Support

  • Styles: Style inheritance works but complex conditional formatting is limited
  • Themes: Theme colors and fonts are not resolved
  • Custom XML: Custom document properties are not extracted
  • Sections: Section properties (columns, page size) affect content but aren't fully rendered

Development

Setup

# Clone the repository
git clone https://github.com/omer-go/docx-parser-converter.git
cd docx-parser-converter/docx_parser_converter_python

# Install PDM (if not already installed)
pip install pdm

# Install dependencies
pdm install

# Install dev dependencies
pdm install -G dev

Running Tests

# Run all tests
pdm run pytest

# Run with coverage
pdm run pytest --cov

# Run specific test file
pdm run pytest tests/unit/test_api.py

Type Checking

pdm run pyright

Linting

pdm run ruff check .
pdm run ruff format .

Project Structure

docx_parser_converter_python/
├── api.py              # Public API (docx_to_html, docx_to_text, ConversionConfig)
├── core/               # Core utilities
│   ├── docx_reader.py  # DOCX file opening and validation
│   ├── xml_extractor.py # XML content extraction
│   ├── constants.py    # XML namespaces and paths
│   └── exceptions.py   # Custom exceptions
├── models/             # Pydantic models
│   ├── common/         # Shared models (Color, Border, Spacing, etc.)
│   ├── document/       # Document models (Paragraph, Run, Table, etc.)
│   ├── numbering/      # Numbering definitions
│   └── styles/         # Style definitions
├── parsers/            # XML to Pydantic conversion
│   ├── common/         # Common element parsers
│   ├── document/       # Document element parsers
│   ├── numbering/      # Numbering parsers
│   └── styles/         # Style parsers
├── converters/         # Model to output conversion
│   ├── common/         # Style resolution, numbering tracking
│   ├── html/           # HTML conversion
│   └── text/           # Text conversion
└── tests/              # Test suite
    ├── unit/           # Unit tests
    ├── integration/    # Integration tests
    └── fixtures/       # Test DOCX files

Architecture

The library follows a three-phase conversion process:

  1. Parse: DOCX XML → Pydantic models

    • Open and validate DOCX file
    • Extract document.xml, styles.xml, numbering.xml
    • Parse XML to strongly-typed Pydantic models
  2. Resolve: Apply style inheritance

    • Merge document defaults → style chain → direct formatting
    • Track numbering counters for lists
  3. Convert: Models → Output format

    • HTML: Generate semantic HTML with CSS
    • Text: Extract plain text with optional Markdown

License

MIT License

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md for guidelines.

Related Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx_parser_converter-1.0.3.tar.gz (206.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docx_parser_converter-1.0.3-py3-none-any.whl (142.0 kB view details)

Uploaded Python 3

File details

Details for the file docx_parser_converter-1.0.3.tar.gz.

File metadata

  • Download URL: docx_parser_converter-1.0.3.tar.gz
  • Upload date:
  • Size: 206.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: pdm/2.26.2 CPython/3.14.0 Darwin/25.1.0

File hashes

Hashes for docx_parser_converter-1.0.3.tar.gz
Algorithm Hash digest
SHA256 593429caab2e5eb1a0e43dfeff5b9db8ed17b8ebb376ba780f8a0850054fcf83
MD5 dbb1c1db6da546a24d014b828b4b706d
BLAKE2b-256 bbc23bb0c70002cff8e26e28559749c3255735d7eefd1368a3d530b0aa28c3b5

See more details on using hashes here.

File details

Details for the file docx_parser_converter-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for docx_parser_converter-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ca3198e4a35565131039bf08b6cd009a70e19f6d4eb96cb8091c5e34a6c162af
MD5 491c8caaf8126ed34b625a1ebdced6fa
BLAKE2b-256 1a7bd9d9ae69030fae6478757bcf1e5ac558441dcefd03a82a442175cd8e8daa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page