Pure Python Word (DOCX) ↔ HTML conversion with guaranteed round-trip fidelity

These details have not been verified by PyPI

Project links

Project description

docwow

Pure Python Word (DOCX) ↔ HTML conversion with guaranteed round-trip fidelity.

docwow converts Word documents to a self-contained HTML representation and back again — without losing a single paragraph indent, table merge, list level, or inline image.

Why docwow?

Existing libraries solve half the problem:

Library	DOCX → HTML	HTML → DOCX	Round-trip
mammoth	good	—	—
python-docx	—	basic	—
docwow	yes	yes	guaranteed

The key insight: docwow embeds every piece of Word metadata into data-dw-* HTML attributes alongside the visual CSS. The browser renders the CSS; when you convert back to DOCX, docwow reads the data attributes and reconstructs the original Word XML exactly.

Install

pip install docwow

Quick Start

import docwow

# DOCX → HTML
html = docwow.to_html("document.docx")

# HTML → DOCX (round-trip)
docwow.to_docx(html, "output.docx")

# Or use the Document object for programmatic editing
doc = docwow.open("document.docx")
para = doc.paragraphs.add_paragraph()
para.runs.add_text("Hello world", bold=True)
doc.to_docx("output.docx")

Feature Support

✅ Supported

Feature	Notes
Paragraphs	Text, alignment, indentation, spacing, keep-together/with-next, page-break-before
Run formatting	Bold, italic, underline, strikethrough, small caps, all caps, font name/size, colour, highlight, superscript/subscript
Tab stops	Custom paragraph tab stops (`w:tabs`), tab character runs (`w:tab`), `set_tab_stops()` API, full round-trip
Cross-references	REF fields linking to named bookmarks; renders as `<a class="dw-xref">`, `MutableCrossRef` API, full round-trip
Multiple sections	Multiple `w:sectPr` with independent page size, margins, and break type; `MutableSectionBreak` API, full round-trip
Inline images	PNG, JPEG, GIF, BMP, TIFF, WebP, SVG, EMF, WMF
Tables	Column spans, row spans (vMerge), column/row widths, table-level styles; fully editable via programmatic API
Lists	Bullet and numbered, up to 9 nesting levels, decimal/lowerLetter/upperLetter/lowerRoman/upperRoman formats
Hyperlinks	External URLs, mailto links
Paragraph styles	Style ID round-trip, Heading 1–9 and custom styles
Page geometry	Page size, margins
Headers & footers	Text content, page number fields, default/first/even slots — see limitations below
Page breaks	Explicit page breaks parsed, written, and round-tripped
Footnotes & endnotes	Parse, render to HTML, HTML → DOCX round-trip, and programmatic API
Bookmarks	Parse `w:bookmarkStart`, render as `<a id="…">` anchors, full round-trip, `MutableBookmark` API
Table of Contents	Parse `w:sdt` TOC blocks, render as `<nav class="dw-toc">`, full round-trip, `MutableTableOfContents` API
Comments	Parse `word/comments.xml`, render as superscript markers with CSS hover popups in HTML, full round-trip, `MutableComment` API
Track changes	Parse `w:ins`/`w:del`, render as green underline / red strikethrough with hover popup (author, date, Accept/Reject buttons) in HTML, full round-trip, `MutableTrackedChange` API
Programmatic API	Open, edit, and save documents in pure Python

⚠️ Headers, Footers & Page Numbers — Known Limitations

Headers and footers are supported for DOCX round-trips and basic HTML rendering, but several aspects are incomplete. These are intentional deferments, not bugs.

What works

DOCX ↔ DOCX round-trip — all six slots (default/first/even × header/footer), page number fields (PAGE, NUMPAGES, SECTIONPAGES), and the title_pg flag survive a full write → parse cycle with no data loss.
HTML rendering — headers and footers with real text content are rendered as <header> / <footer> elements visible in the browser.
DOCX → HTML → DOCX round-trip — page-number-only paragraphs (e.g. "Page N of M") are kept as hidden elements in the HTML (display:none) so the fields survive the HTML → DOCX leg. The output DOCX will have a working page-number footer in Word.
Print / PDF export — render_document(doc, page_view=True) injects @media print + @page CSS with the correct paper size and margins, so Cmd+P / browser PDF export paginates correctly.

What does not work (and why)

1. Page numbers always show "1" in the browser

Page number fields render as a static placeholder 1. Correct values require knowing which page each element lands on, which requires measuring rendered element heights — something only the browser layout engine can do after paint. Without a layout measurement API or a third-party pagination library, this cannot be solved in a pre-render Python step. Possible approach for contributors: use a small post-render JS snippet that walks data-dw-field spans and updates them after the browser has laid out the page, combined with data-dw-page markers on page-break divs.

2. No visual page separation in the browser

Explicit page breaks are preserved as hidden <div class="dw-page-break" data-dw-page="N"> elements but produce no visible gap. Making pages visually discrete requires either CSS @page (print-only, not interactive) or JS that measures element heights to insert separators — again a layout-engine problem. Possible approach: a small inline JS block that reads the dw-page-break markers and inserts visual dividers, sizing each page section to the document's data-dw-page-height.

3. Header and footer appear once, not on every page

In Word, headers and footers repeat at the top/bottom of every page. In HTML there is a single <header> and <footer> element. Making them repeat requires knowing page boundaries (see point 2). Possible approach: same JS pagination pass — once page sections are created, clone the header/footer HTML into each section.

4. first-page and even-page slots not applied in HTML

The title_pg flag and even-page slots are preserved through DOCX round-trips but the HTML renderer emits all slots regardless. No CSS or JS selects the right slot per page. Possible approach: after the JS pagination pass, inspect the data-dw-title-pg attribute on the document div and apply header-first vs header-default to the appropriate page sections.

5. Page number start value not supported

DOCX allows <w:pgNumType w:start="N"/> to start numbering from a value other than 1. Not currently parsed or written. Possible approach: add page_num_start: int = 1 to the Document model and read/write it from w:sectPr.

🗓 Roadmap

The project follows a phased plan. Contributors are welcome at any level.

Phase 2 — General HTML → DOCX (next)

Best-effort conversion of arbitrary HTML (not just docwow HTML) into DOCX. This makes docwow useful as a general-purpose HTML-to-Word exporter.

Scope: h1–h6, p, b/strong, i/em, u, s, span[style], table, ul/ol/li, img, a, br. Map inline CSS properties (font-size, color, font-weight, etc.) to Word run formatting. Nested structures and reasonable edge cases.

Estimated effort: 1–2 days. Entry point: docwow/html_parser/.

Phase 2b — Floating images and text boxes

wp:anchor positioned images, text wrapping modes, and w:txbx inline text boxes. Currently anchored images are silently skipped.

Estimated effort: 4–6 hours.

Phase 3 — Tier 2 Word features

Individual features, each ~1–3 hours, all following the same 5-layer pattern (parser → renderer → html_parser → writer → API):

Feature	OOXML element	Notes
Paragraph borders	`w:pBdr`	Box, shadow, bar borders
Columns	`w:cols` in `w:sectPr`	Multi-column layouts
Field codes	`w:instrText`	DATE, AUTHOR, TITLE fields
Bidi / RTL text	`w:bidi`, `w:rtl`	Right-to-left paragraphs
Hidden text	`w:vanish`	`display:none` in HTML
Per-section headers/footers	`w:headerReference` in inline `w:sectPr`	Different header after section break
Page number start	`w:pgNumType w:start`	Section-level page number reset

How to contribute

Every feature follows the same 5-layer pattern — read CLAUDE.md for the full contributor guide. One branch per feature, all layers in one PR (parser + renderer + html_parser + writer + API + tests + docs).

Documentation

Full documentation at docwow.readthedocs.io.

Requirements

Python 3.10+
lxml
Pillow

Built with Claude Code

This library was vibe coded using Claude Code. Community suggestions, bug reports, and PRs are very welcome.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.2

Apr 22, 2026

1.0.1

Apr 19, 2026

1.0.0

Apr 18, 2026

0.9.0

Apr 18, 2026

This version

0.8.0

Apr 17, 2026

0.7.0

Apr 17, 2026

0.6.0

Apr 17, 2026

0.5.0

Apr 13, 2026

0.4.0

Apr 13, 2026

0.3.0

Apr 13, 2026

0.2.0

Apr 13, 2026

0.1.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docwow-0.8.0.tar.gz (94.5 kB view details)

Uploaded Apr 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docwow-0.8.0-py3-none-any.whl (107.5 kB view details)

Uploaded Apr 17, 2026 Python 3

File details

Details for the file docwow-0.8.0.tar.gz.

File metadata

Download URL: docwow-0.8.0.tar.gz
Upload date: Apr 17, 2026
Size: 94.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for docwow-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`b00c517ad3d30d8f7d4caed1866a14d22f7577139d479d2d7cb4f6e88ed0056a`
MD5	`4b3709fb65da51a53d4e0b94d06c9946`
BLAKE2b-256	`e8c0cd129008e8e82479a7860325cb179c2c8fdc075564699dce081ddd120786`

See more details on using hashes here.

File details

Details for the file docwow-0.8.0-py3-none-any.whl.

File metadata

Download URL: docwow-0.8.0-py3-none-any.whl
Upload date: Apr 17, 2026
Size: 107.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for docwow-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c7a726eb011297e8ab106c584dbb441134f0f11b82bbde99602a51b827a2ff75`
MD5	`d6c9d7cb7b593ab27cfb83296a7e4c47`
BLAKE2b-256	`1ccc5bf05fe26419cd2599ef3112590efd52829ff99de78b70b848e23289d9e1`

See more details on using hashes here.

docwow 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docwow

Why docwow?

Install

Quick Start

Feature Support

✅ Supported

⚠️ Headers, Footers & Page Numbers — Known Limitations

What works

What does not work (and why)

🗓 Roadmap

Phase 2 — General HTML → DOCX (next)

Phase 2b — Floating images and text boxes

Phase 3 — Tier 2 Word features

How to contribute

Documentation

Requirements

Built with Claude Code

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes