Skip to main content

AI-powered HWP/HWPX document processing library for Hamonize

Project description

airun-hwp

AI-powered HWP/HWPX document processing library for Hamonize

Python Version License PyPI Version

Features

  • HWPX Parsing: Parse HWPX files with full document structure preservation
  • HWP Text Extraction: Extract plain text from HWP files (structure not preserved)
  • Ordered Content Extraction: Maintain original document flow with mixed content types (HWPX only)
  • Image Extraction: Extract and save all images from documents
  • Table Processing: Extract tables with proper formatting (HWPX only)
  • Markdown Conversion: Convert documents to well-structured Markdown
  • PDF Export: Generate PDF files with embedded images (included by default)
  • CLI Tool: Easy-to-use command-line interface

Installation

pip install airun-hwp

Note: PDF export functionality is included by default.

Development Installation

git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"

Quick Start

Command Line Interface

# Convert to Markdown
airun-hwp convert document.hwpx --format markdown

# Convert to PDF
airun-hwp convert document.hwpx --format pdf --output ./results

# Process to both formats
airun-hwp process document.hwpx

Python API

from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
from airun_hwp.reader.hwpx_to_markdown import extract_text_from_file

# Parse HWPX file (full structure preserved)
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")

# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")

# Convert to Markdown with tables
markdown_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

# For HWP files (plain text only)
hwp_text = extract_text_from_file("document.hwp")
print(f"HWP text (tables not preserved): {len(hwp_text)} characters")

Advanced Usage

PDF Generation with Custom Styling

import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract images
document.extract_images("./output/images")

# Get Markdown content
md_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])

# Add custom CSS
css = """
<style>
    body { font-family: 'Malgun Gothic', Arial, sans-serif; }
    img { max-width: 100%; height: auto; }
    table { border-collapse: collapse; width: 100%; }
    th, td { border: 1px solid #333; padding: 8px; }
</style>
"""

# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")

Document Structure

The library processes HWPX documents using a token-stream approach that preserves the original document order:

  • Text Runs: Consecutive text segments
  • Images: Embedded images with proper positioning
  • Tables: Structured table data
  • Paragraph Breaks: Logical document divisions
  • Page Breaks: Document pagination

CLI Commands

Convert Command

Convert HWPX files to different formats:

airun-hwp convert <input_file> [options]

Options:
  --format {markdown,md,pdf}  Output format (default: markdown)
  --output, -o PATH           Output directory (default: ./output)

Process Command

Process document to multiple formats:

airun-hwp process <input_file> [options]

Options:
  --output, -o PATH           Output directory (default: ./output)

HWP vs HWPX: Important Differences

This library handles HWP and HWPX files differently due to their fundamental format differences:

HWPX Files (Recommended)

  • Format: XML-based, open standard
  • Structure: Preserves full document structure
  • Tables: ✅ Extracted with proper formatting
  • Images: ✅ Extracted with positioning
  • Layout: Maintains original document flow

HWP Files (Limited Support)

  • Format: Binary, proprietary format
  • Structure: Only plain text extraction available
  • Tables: ❌ Not preserved (extracted as plain text only)
  • Images: ❌ Cannot preserve original position/sequence
  • Layout: Original structure and order lost

Recommendation

For best results, use HWPX files. If you have HWP files:

  1. Convert HWP to HWPX in Hanword (한글) before processing
  2. Or use for plain text extraction only

Output Structure

When processing a document named document.hwpx:

output/
└── document/
    ├── images/
    │   ├── image1.png
    │   ├── image2.png
    │   └── ...
    ├── document.md
    └── document.pdf

Dependencies

  • pypandoc-hwpx>=0.1.0: HWPX file format support
  • PyYAML>=6.0: YAML configuration parsing
  • Pillow>=10.0.0: Image processing
  • weasyprint>=60.0: HTML to PDF conversion (included)
  • markdown>=3.5.0: Markdown processing (included)

Development

Running Tests

pytest

Code Coverage

pytest --cov=airun_hwp

Code Formatting

black airun_hwp/
ruff check airun_hwp/

Type Checking

mypy airun_hwp/

Building for Distribution

# Build source and wheel distributions
python -m build

# Build with twine
twine build dist/

Publishing to PyPI

# Upload to Test PyPI
twine upload --repository testpypi dist/*

# Upload to PyPI
twine upload dist/*

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Support

Changelog

Version 0.2.5

  • Fixed get_all_text() method to properly extract text from token stream
  • Improved text extraction to handle both tokens and paragraphs
  • Added deduplication to prevent duplicate text extraction
  • Updated documentation to clarify HWP vs HWPX limitations

Version 0.2.0

  • HWPX parsing support
  • Markdown conversion
  • PDF export functionality
  • CLI tool
  • Image extraction
  • Table processing

Version 0.1.0

  • Initial release

Made with ❤️ for the Hamonize project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airun_hwp-0.2.7.tar.gz (59.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airun_hwp-0.2.7-py3-none-any.whl (46.4 kB view details)

Uploaded Python 3

File details

Details for the file airun_hwp-0.2.7.tar.gz.

File metadata

  • Download URL: airun_hwp-0.2.7.tar.gz
  • Upload date:
  • Size: 59.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for airun_hwp-0.2.7.tar.gz
Algorithm Hash digest
SHA256 cc13914b32ab0d33f71b3529b816010a6e99e0a859e4b344d6bf9a9712ac96b4
MD5 daca279a9ac1ac1fe2e80c967cd5029f
BLAKE2b-256 08c221a9d6d3a697bcd528b3f284d0aae30ac3b2d9520d05b506c821a80f7c3a

See more details on using hashes here.

Provenance

The following attestation bundles were made for airun_hwp-0.2.7.tar.gz:

Publisher: publish-to-pypi.yml on chaeya/airun-hwp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file airun_hwp-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: airun_hwp-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 46.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for airun_hwp-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 9732355df3d946f2cf3a6da6027adbda9f2ee04b901e7d6171c4be299f3e0dab
MD5 9417abc9ab94cef71b9659da48cd7bf4
BLAKE2b-256 7ea16bd40fe5767c8133686ef9b7ca6bcde83f33069869f2d945395d1269be91

See more details on using hashes here.

Provenance

The following attestation bundles were made for airun_hwp-0.2.7-py3-none-any.whl:

Publisher: publish-to-pypi.yml on chaeya/airun-hwp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page