Skip to main content

Pure Python Word (DOCX) ↔ HTML conversion with guaranteed round-trip fidelity

Project description

docwow

Pure Python Word (DOCX) ↔ HTML conversion with guaranteed round-trip fidelity.

docwow converts Word documents to a self-contained HTML representation and back again — without losing a single paragraph indent, table merge, list level, or inline image.

Why docwow?

Existing libraries solve half the problem:

Library DOCX → HTML HTML → DOCX Round-trip
mammoth good
python-docx basic
docwow yes yes guaranteed

The key insight: docwow embeds every piece of Word metadata into data-dw-* HTML attributes alongside the visual CSS. The browser renders the CSS; when you convert back to DOCX, docwow reads the data attributes and reconstructs the original Word XML exactly.

Install

pip install docwow

Quick Start

import docwow

# DOCX → HTML
html = docwow.to_html("document.docx")

# HTML → DOCX (round-trip)
docwow.to_docx(html, "output.docx")

# Or use the Document object for programmatic editing
doc = docwow.open("document.docx")
para = doc.paragraphs.add_paragraph()
para.runs.add_text("Hello world", bold=True)
doc.to_docx("output.docx")

Feature Support

✅ Supported

Feature Notes
Paragraphs Text, alignment, indentation, spacing, keep-together/with-next, page-break-before
Run formatting Bold, italic, underline, strikethrough, font name/size, colour, highlight, superscript/subscript
Inline images PNG, JPEG, GIF, BMP, TIFF, WebP, SVG, EMF, WMF
Tables Column spans, row spans (vMerge), column/row widths, table-level styles
Lists Bullet and numbered, up to 9 nesting levels, decimal/lowerLetter/upperLetter/lowerRoman/upperRoman formats
Hyperlinks External URLs, mailto links
Paragraph styles Style ID round-trip, Heading 1–9 and custom styles
Page geometry Page size, margins
Headers & footers Text content, page number fields, default/first/even slots — see limitations below
Page breaks Explicit page breaks parsed, written, and round-tripped
Programmatic API Open, edit, and save documents in pure Python

⚠️ Headers, Footers & Page Numbers — Known Limitations

Headers and footers are supported for DOCX round-trips and basic HTML rendering, but several aspects are incomplete. These are intentional deferments, not bugs.

What works

  • DOCX ↔ DOCX round-trip — all six slots (default/first/even × header/footer), page number fields (PAGE, NUMPAGES, SECTIONPAGES), and the title_pg flag survive a full write → parse cycle with no data loss.
  • HTML rendering — headers and footers with real text content are rendered as <header> / <footer> elements visible in the browser.
  • DOCX → HTML → DOCX round-trip — page-number-only paragraphs (e.g. "Page N of M") are kept as hidden elements in the HTML (display:none) so the fields survive the HTML → DOCX leg. The output DOCX will have a working page-number footer in Word.
  • Print / PDF exportrender_document(doc, page_view=True) injects @media print + @page CSS with the correct paper size and margins, so Cmd+P / browser PDF export paginates correctly.

What does not work (and why)

1. Page numbers always show "1" in the browser

Page number fields render as a static placeholder 1. Correct values require knowing which page each element lands on, which requires measuring rendered element heights — something only the browser layout engine can do after paint. Without a layout measurement API or a third-party pagination library, this cannot be solved in a pre-render Python step. Possible approach for contributors: use a small post-render JS snippet that walks data-dw-field spans and updates them after the browser has laid out the page, combined with data-dw-page markers on page-break divs.

2. No visual page separation in the browser

Explicit page breaks are preserved as hidden <div class="dw-page-break" data-dw-page="N"> elements but produce no visible gap. Making pages visually discrete requires either CSS @page (print-only, not interactive) or JS that measures element heights to insert separators — again a layout-engine problem. Possible approach: a small inline JS block that reads the dw-page-break markers and inserts visual dividers, sizing each page section to the document's data-dw-page-height.

3. Header and footer appear once, not on every page

In Word, headers and footers repeat at the top/bottom of every page. In HTML there is a single <header> and <footer> element. Making them repeat requires knowing page boundaries (see point 2). Possible approach: same JS pagination pass — once page sections are created, clone the header/footer HTML into each section.

4. first-page and even-page slots not applied in HTML

The title_pg flag and even-page slots are preserved through DOCX round-trips but the HTML renderer emits all slots regardless. No CSS or JS selects the right slot per page. Possible approach: after the JS pagination pass, inspect the data-dw-title-pg attribute on the document div and apply header-first vs header-default to the appropriate page sections.

5. Single document section only

DOCX supports per-section headers/footers — a different header can appear after a manual section break mid-document. docwow parses only the last w:sectPr in w:body (the document-level section properties). Mid-document section changes are silently ignored. Possible approach: parse each w:sectPr in the body, associate it with the preceding paragraphs, and model sections explicitly in Document.

6. Page number start value not supported

DOCX allows <w:pgNumType w:start="N"/> to start numbering from a value other than 1. Not currently parsed or written. Possible approach: add page_num_start: int = 1 to the Document model and read/write it from w:sectPr.

🗓 Planned

Feature Notes
Table of contents Requires bookmark support
Bookmarks In-document anchor links and TOC targets
Comments Annotations / review marks
Track changes Accept/reject revision marks
Footnotes & endnotes
General HTML → DOCX Best-effort conversion of arbitrary HTML (not just docwow HTML)

Documentation

Full documentation at docwow.readthedocs.io.

Requirements

  • Python 3.10+
  • lxml
  • Pillow

Built with Claude Code

This library was vibe coded using Claude Code. Community suggestions, bug reports, and PRs are very welcome.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docwow-0.4.0.tar.gz (65.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docwow-0.4.0-py3-none-any.whl (74.0 kB view details)

Uploaded Python 3

File details

Details for the file docwow-0.4.0.tar.gz.

File metadata

  • Download URL: docwow-0.4.0.tar.gz
  • Upload date:
  • Size: 65.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for docwow-0.4.0.tar.gz
Algorithm Hash digest
SHA256 50abe1867a9d872e96a173cd68ec3d2e40307ccce06a9e1b24a2cb91e384a38f
MD5 3dae881de6e6ced106b194d0bc308414
BLAKE2b-256 31d1f9f811dff0dbb5ad56da5497d41a43c296f5773ce9ae83c7ffb6cf28346d

See more details on using hashes here.

File details

Details for the file docwow-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: docwow-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 74.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.13

File hashes

Hashes for docwow-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7f758d9fc3d1d1a944da8dd55c1e96efb912ed9efb2ea58c46f4d69186a13664
MD5 c7c186812fc0c1a07164b16dde9444c3
BLAKE2b-256 d930d3d9d1ef2b878c04247094dfc962cc437ec4541ca6932b6f64bb78dab6c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page