Skip to main content

A package to convert DOCX to HTML and HTML to DOCX with formatting preservation.

Project description


DOCX-HTML Converter

This package offers a seamless solution for converting DOCX documents to HTML and vice versa, with preservation of formatting such as tables, lists, paragraphs, and inline styles. Additionally, it supports in-memory conversions using BytesIO objects, allowing for efficient handling of DOCX and HTML data without needing to save files to disk.

Features

  • DOCX to HTML conversion: Preserve paragraphs, lists, tables, inline formatting (bold, italic), and more.
  • HTML to DOCX conversion: Supports lists, tables, paragraphs, and inline styles during reconversion.
  • In-memory processing: Use BytesIO to handle DOCX and HTML data in memory, suitable for server-side or real-time applications.
  • Preserve complex formatting: Handles text alignment, font styles, and indentation during conversions.
  • Binary input/output: Easily convert between DOCX binary and HTML string without needing intermediate files.

Installation

Install the package using pip after uploading it to PyPI:

pip install docxhtml-converter

Usage

1. Convert DOCX to HTML

Use the htmlifier function to convert a DOCX file into an HTML file:

from docxhtml_converter.docxhtml import htmlifier

docx_file_path = "input.docx"
html_output_file = "output.html"
htmlifier(docx_file_path, html_output_file)

2. Convert HTML to DOCX

Use the docxifier function to convert an HTML file back to a DOCX document:

from docxhtml_converter.htmldocx import docxifier

input_html_file = "output.html"
output_docx_file = "regenerated.docx"
docxifier(input_html_file, output_docx_file)

3. Convert DOCX Binary to HTML String

For in-memory operations, use get_html_from_docx_binary to convert a DOCX binary (like from a BytesIO object) into an HTML string:

from docxhtml_converter.docxhtml import get_html_from_docx_binary
from io import BytesIO

# Load DOCX binary data
with open("input.docx", "rb") as f:
    docx_binary = f.read()

# Convert to HTML string
html_string = get_html_from_docx_binary(BytesIO(docx_binary))
print(html_string[:500])  # Print first 500 characters for preview

4. Convert HTML String to DOCX Binary

To convert an HTML string into a DOCX binary (for example, for saving in-memory files), use docxifier_from_html_string:

from docxhtml_converter.htmldocx import docxifier_from_html_string

html_content = "<html><body><p>Hello, World!</p></body></html>"
docx_binary = docxifier_from_html_string(html_content)

# Save the DOCX binary output to a file
with open("output.docx", "wb") as f:
    f.write(docx_binary.read())

Example Script

Here is a complete example demonstrating file-based and in-memory conversions:

from io import BytesIO
from docxhtml_converter.docxhtml import htmlifier, get_html_from_docx_binary
from docxhtml_converter.htmldocx import docxifier, docxifier_from_html_string

# Step 1: Convert DOCX to HTML
docx_file = "input.docx"
html_file = "output.html"
htmlifier(docx_file, html_file)
print(f"Converted DOCX to HTML: {html_file}")

# Step 2: Convert HTML back to DOCX
regenerated_docx_file = "regenerated.docx"
docxifier(html_file, regenerated_docx_file)
print(f"Converted HTML back to DOCX: {regenerated_docx_file}")

# Step 3: Convert DOCX binary to HTML string
with open(docx_file, "rb") as f:
    docx_binary_data = f.read()

html_string = get_html_from_docx_binary(BytesIO(docx_binary_data))
print(f"Generated HTML string from DOCX binary: {html_string[:500]}")

# Step 4: Convert HTML string back to DOCX binary
docx_binary_output = docxifier_from_html_string(html_string)

# Save the DOCX binary to a file
final_docx_file = "final_output.docx"
with open(final_docx_file, "wb") as f:
    f.write(docx_binary_output.read())
print(f"Final DOCX saved at: {final_docx_file}")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docxhtml-converter-0.1.3.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

docxhtml_converter-0.1.3-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file docxhtml-converter-0.1.3.tar.gz.

File metadata

  • Download URL: docxhtml-converter-0.1.3.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for docxhtml-converter-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e20e8da032e9cad3bbd4397b07fd29bc52c3cb31aa533f156f5499d5f3e86170
MD5 9376300e9b461127a6db800ef25f1bb0
BLAKE2b-256 266c17051ee6a7932dc9dcafbe12f3378d4606987eb75a531f2276ff324e95f5

See more details on using hashes here.

File details

Details for the file docxhtml_converter-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for docxhtml_converter-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5471b60d21e6fd2ec727dbaf32d78e1af21a32e536fa1025e37d01c07e1e7f40
MD5 2d8f54864c6f31117892d9b6323a2d62
BLAKE2b-256 be88f322717d71dc9e02e919510d04c20a2e89d400574709cf7385432e81beb1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page