A library for converting DOCX files to HTML and plain text

These details have not been verified by PyPI

Project links

Project description

DOCX Parser Converter - Python

Python implementation of the DOCX parser and converter. Built with Python 3.10+, Pydantic models, and lxml.

For installation and quick start, see the main README.

⚠️ Breaking Changes in v1.0.0

Version 1.0.0 introduces a completely rewritten API. If you're upgrading from a previous version, please read the CHANGELOG.md for the full migration guide.

Quick Migration

Old API (deprecated):

from docx_parser_converter.docx_parsers.utils import read_binary_from_file_path
from docx_parser_converter.docx_to_html.docx_to_html_converter import DocxToHtmlConverter

docx_content = read_binary_from_file_path("document.docx")
converter = DocxToHtmlConverter(docx_content)
html = converter.convert_to_html()

New API (recommended):

from docx_parser_converter import docx_to_html

html = docx_to_html("document.docx")

The old API still works but emits deprecation warnings. It will be removed in a future version.

Configuration

Use ConversionConfig to customize the conversion:

from docx_parser_converter import ConversionConfig, docx_to_html, docx_to_text

# HTML conversion options
config = ConversionConfig(
    # HTML-specific options
    title="My Document",           # Document title in <title> tag
    language="en",                 # HTML lang attribute
    style_mode="inline",           # "inline", "class", or "none"
    use_semantic_tags=False,       # Use CSS spans (False) vs <strong>, <em> (True)
    fragment_only=False,           # Output just content without HTML wrapper
    custom_css="body { margin: 2em; }",  # Custom CSS to include
    responsive=True,               # Include viewport meta tag

    # Text-specific options
    text_formatting="plain",       # "plain" or "markdown"
    table_mode="auto",             # "auto", "ascii", "tabs", or "plain"
    paragraph_separator="\n\n",    # Separator between paragraphs
)

html = docx_to_html("document.docx", config=config)
text = docx_to_text("document.docx", config=config)

Configuration Options

HTML Options

Option	Type	Default	Description
`style_mode`	`"inline"` \| `"class"` \| `"none"`	`"inline"`	How to output CSS styles
`use_semantic_tags`	`bool`	`False`	Use semantic tags (`<strong>`, `<em>`) vs CSS spans
`preserve_whitespace`	`bool`	`False`	Preserve whitespace in content
`title`	`str`	`""`	Document title for HTML output
`language`	`str`	`"en"`	HTML `lang` attribute
`fragment_only`	`bool`	`False`	Output only content, no HTML wrapper
`custom_css`	`str \| None`	`None`	Custom CSS to include
`css_files`	`list[str]`	`[]`	External CSS files to reference
`responsive`	`bool`	`True`	Include viewport meta tag
`include_print_styles`	`bool`	`False`	Include print media query styles

Text Options

Option	Type	Default	Description
`text_formatting`	`"plain"` \| `"markdown"`	`"plain"`	Output format
`table_mode`	`"auto"` \| `"ascii"` \| `"tabs"` \| `"plain"`	`"auto"`	Table rendering mode
`paragraph_separator`	`str`	`"\n\n"`	Separator between paragraphs
`preserve_empty_paragraphs`	`bool`	`True`	Preserve empty paragraphs

Table Rendering Modes

auto: Automatically selects ASCII for tables with visible borders, tabs for others
ascii: ASCII box drawing characters (+, -, |)
tabs: Tab-separated columns
plain: Space-separated columns

Example ASCII table output:

+----------+----------+
| Header 1 | Header 2 |
+----------+----------+
| Cell 1   | Cell 2   |
+----------+----------+

Markdown Formatting

When using text_formatting="markdown", formatting is preserved:

config = ConversionConfig(text_formatting="markdown")
text = docx_to_text("document.docx", config=config)

# Output: "This is **bold** and *italic* text."

Input Types

The library accepts multiple input types:

from pathlib import Path
from io import BytesIO

# File path as string
html = docx_to_html("document.docx")

# File path as Path object
html = docx_to_html(Path("document.docx"))

# Bytes content
with open("document.docx", "rb") as f:
    content = f.read()
html = docx_to_html(content)

# File-like object
with open("document.docx", "rb") as f:
    html = docx_to_html(f)

# None returns empty output
html = docx_to_html(None)  # Returns empty HTML document
text = docx_to_text(None)  # Returns ""

Supported DOCX Elements

Text Formatting

Bold, italic, underline, strikethrough
Subscript, superscript
Highlight colors
Font family, size, and color
All caps, small caps
Various underline styles (single, double, dotted, dashed, wave, etc.) with color support

Paragraph Formatting

Alignment (left, center, right, justify)
Indentation (left, right, first line, hanging)
Spacing (before, after, line spacing)
Borders and shading
Keep with next, keep lines together, page break before

Lists and Numbering

Bullet lists
Numbered lists (decimal, roman, letters, ordinal)
Multi-level lists with various formats
List restart and override support

Tables

Simple and complex tables
Cell merging (horizontal and vertical)
Full border support (outer borders, inside grid lines, per-cell borders)
Cell-level border overrides (tcBorders override tblBorders)
Cell shading and backgrounds
Column widths and table alignment

Other Elements

Hyperlinks (external URLs resolved from relationships)
Line breaks and page breaks
Tab characters
Special characters (soft hyphen, non-breaking hyphen)

Error Handling

The library provides specific exceptions for different error cases:

from docx_parser_converter import docx_to_html

try:
    html = docx_to_html("document.docx")
except FileNotFoundError:
    print("File not found")
except ValueError as e:
    print(f"Invalid DOCX: {e}")
except Exception as e:
    print(f"Error: {e}")

Image Format Support

Images are extracted from DOCX files and embedded in HTML as base64 data URLs. Browser rendering support varies by format:

Format	Extensions	Browser Support
PNG	`.png`	✅ Full
JPEG	`.jpg`, `.jpeg`	✅ Full
GIF	`.gif`	✅ Full (including animation)
WebP	`.webp`	✅ Full
SVG	`.svg`	✅ Full
BMP	`.bmp`	✅ Full
TIFF	`.tif`, `.tiff`	⚠️ Safari only
EMF	`.emf`	❌ Not supported
WMF	`.wmf`	❌ Not supported

Notes:

TIFF images will only display in Safari; other browsers will show a broken image
EMF/WMF are Windows vector formats that browsers cannot render natively
Images in plain text output are skipped (no alt text placeholders)

Known Limitations

Not Currently Supported

Headers and footers: Document headers/footers are not included
Footnotes and endnotes: These are not extracted
Comments and track changes: Revision marks are not processed
OLE objects: Embedded Excel charts, etc. are not supported
Text boxes: Floating text boxes and shapes are not extracted
Complex field codes: Most field codes besides hyperlinks
RTL/BiDi text: Right-to-left text may not render correctly
Password-protected files: Encrypted documents cannot be opened

Partial Support

Styles: Style inheritance works but complex conditional formatting is limited
Themes: Theme colors and fonts are not resolved
Custom XML: Custom document properties are not extracted
Sections: Section properties (columns, page size) affect content but aren't fully rendered

Development

Setup

# Clone the repository
git clone https://github.com/omer-go/docx-parser-converter.git
cd docx-parser-converter/docx_parser_converter_python

# Install PDM (if not already installed)
pip install pdm

# Install dependencies
pdm install

# Install dev dependencies
pdm install -G dev

Running Tests

# Run all tests
pdm run pytest

# Run with coverage
pdm run pytest --cov

# Run specific test file
pdm run pytest tests/unit/test_api.py

Type Checking

pdm run pyright

Linting

pdm run ruff check .
pdm run ruff format .

Project Structure

docx_parser_converter_python/
├── api.py              # Public API (docx_to_html, docx_to_text, ConversionConfig)
├── core/               # Core utilities
│   ├── docx_reader.py  # DOCX file opening and validation
│   ├── xml_extractor.py # XML content extraction
│   ├── constants.py    # XML namespaces and paths
│   └── exceptions.py   # Custom exceptions
├── models/             # Pydantic models
│   ├── common/         # Shared models (Color, Border, Spacing, etc.)
│   ├── document/       # Document models (Paragraph, Run, Table, etc.)
│   ├── numbering/      # Numbering definitions
│   └── styles/         # Style definitions
├── parsers/            # XML to Pydantic conversion
│   ├── common/         # Common element parsers
│   ├── document/       # Document element parsers
│   ├── numbering/      # Numbering parsers
│   └── styles/         # Style parsers
├── converters/         # Model to output conversion
│   ├── common/         # Style resolution, numbering tracking
│   ├── html/           # HTML conversion
│   └── text/           # Text conversion
└── tests/              # Test suite
    ├── unit/           # Unit tests
    ├── integration/    # Integration tests
    └── fixtures/       # Test DOCX files

Architecture

The library follows a three-phase conversion process:

Parse: DOCX XML → Pydantic models
- Open and validate DOCX file
- Extract document.xml, styles.xml, numbering.xml
- Parse XML to strongly-typed Pydantic models
Resolve: Apply style inheritance
- Merge document defaults → style chain → direct formatting
- Track numbering counters for lists
Convert: Models → Output format
- HTML: Generate semantic HTML with CSS
- Text: Extract plain text with optional Markdown

License

MIT License

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md for guidelines.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.3

Jan 8, 2026

1.0.1

Jan 6, 2026

1.0.0

Jan 6, 2026

0.5.1.2

Aug 30, 2024

0.5.1.1

Aug 30, 2024

0.5.1

Aug 29, 2024

0.5

Jul 1, 2024

0.4

Jul 1, 2024

0.3.1

Jul 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx_parser_converter-1.0.3.tar.gz (206.3 kB view details)

Uploaded Jan 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docx_parser_converter-1.0.3-py3-none-any.whl (142.0 kB view details)

Uploaded Jan 8, 2026 Python 3

File details

Details for the file docx_parser_converter-1.0.3.tar.gz.

File metadata

Download URL: docx_parser_converter-1.0.3.tar.gz
Upload date: Jan 8, 2026
Size: 206.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.26.2 CPython/3.14.0 Darwin/25.1.0

File hashes

Hashes for docx_parser_converter-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`593429caab2e5eb1a0e43dfeff5b9db8ed17b8ebb376ba780f8a0850054fcf83`
MD5	`dbb1c1db6da546a24d014b828b4b706d`
BLAKE2b-256	`bbc23bb0c70002cff8e26e28559749c3255735d7eefd1368a3d530b0aa28c3b5`

See more details on using hashes here.

File details

Details for the file docx_parser_converter-1.0.3-py3-none-any.whl.

File metadata

Download URL: docx_parser_converter-1.0.3-py3-none-any.whl
Upload date: Jan 8, 2026
Size: 142.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: pdm/2.26.2 CPython/3.14.0 Darwin/25.1.0

File hashes

Hashes for docx_parser_converter-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca3198e4a35565131039bf08b6cd009a70e19f6d4eb96cb8091c5e34a6c162af`
MD5	`491c8caaf8126ed34b625a1ebdced6fa`
BLAKE2b-256	`1a7bd9d9ae69030fae6478757bcf1e5ac558441dcefd03a82a442175cd8e8daa`

See more details on using hashes here.

docx-parser-converter 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DOCX Parser Converter - Python

⚠️ Breaking Changes in v1.0.0

Quick Migration

Configuration

Configuration Options

HTML Options

Text Options

Table Rendering Modes

Markdown Formatting

Input Types

Supported DOCX Elements

Text Formatting

Paragraph Formatting

Lists and Numbering

Tables

Other Elements

Error Handling

Image Format Support

Known Limitations

Not Currently Supported

Partial Support

Development

Setup

Running Tests

Type Checking

Linting

Project Structure

Architecture

License

Contributing

Related Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes