Skip to main content

AI-powered HWP/HWPX document processing library for Hamonize

Project description

airun-hwp

AI-powered HWP/HWPX document processing library for Hamonize

Python Version License PyPI Version

Features

  • HWPX Parsing: Parse HWPX files with full document structure preservation
  • Ordered Content Extraction: Maintain original document flow with mixed content types
  • Image Extraction: Extract and save all images from documents
  • Table Processing: Extract tables with proper formatting
  • Markdown Conversion: Convert documents to well-structured Markdown
  • PDF Export: Generate PDF files with embedded images
  • CLI Tool: Easy-to-use command-line interface

Installation

Basic Installation

pip install airun-hwp

With PDF Export Support

pip install airun-hwp[pdf]

Development Installation

git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"

Quick Start

Command Line Interface

# Convert to Markdown
airun-hwp convert document.hwpx --format markdown

# Convert to PDF
airun-hwp convert document.hwpx --format pdf --output ./results

# Process to both formats
airun-hwp process document.hwpx

Python API

from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse HWPX file
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")

# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")

# Convert to Markdown
markdown_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Advanced Usage

PDF Generation with Custom Styling

import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract images
document.extract_images("./output/images")

# Get Markdown content
md_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])

# Add custom CSS
css = """
<style>
    body { font-family: 'Malgun Gothic', Arial, sans-serif; }
    img { max-width: 100%; height: auto; }
    table { border-collapse: collapse; width: 100%; }
    th, td { border: 1px solid #333; padding: 8px; }
</style>
"""

# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")

Document Structure

The library processes HWPX documents using a token-stream approach that preserves the original document order:

  • Text Runs: Consecutive text segments
  • Images: Embedded images with proper positioning
  • Tables: Structured table data
  • Paragraph Breaks: Logical document divisions
  • Page Breaks: Document pagination

CLI Commands

Convert Command

Convert HWPX files to different formats:

airun-hwp convert <input_file> [options]

Options:
  --format {markdown,md,pdf}  Output format (default: markdown)
  --output, -o PATH           Output directory (default: ./output)

Process Command

Process document to multiple formats:

airun-hwp process <input_file> [options]

Options:
  --output, -o PATH           Output directory (default: ./output)

Output Structure

When processing a document named document.hwpx:

output/
└── document/
    ├── images/
    │   ├── image1.png
    │   ├── image2.png
    │   └── ...
    ├── document.md
    └── document.pdf

Dependencies

Core Dependencies

  • pypandoc-hwpx>=0.1.0: HWPX file format support
  • PyYAML>=6.0: YAML configuration parsing
  • Pillow>=10.0.0: Image processing

Optional Dependencies (PDF Export)

  • weasyprint>=60.0: HTML to PDF conversion
  • markdown>=3.5.0: Markdown processing

Development

Running Tests

pytest

Code Coverage

pytest --cov=airun_hwp

Code Formatting

black airun_hwp/
ruff check airun_hwp/

Type Checking

mypy airun_hwp/

Building for Distribution

# Build source and wheel distributions
python -m build

# Build with twine
twine build dist/

Publishing to PyPI

# Upload to Test PyPI
twine upload --repository testpypi dist/*

# Upload to PyPI
twine upload dist/*

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Support

Changelog

Version 0.1.0

  • Initial release
  • HWPX parsing support
  • Markdown conversion
  • PDF export functionality
  • CLI tool
  • Image extraction
  • Table processing

Made with ❤️ for the Hamonize project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airun_hwp-0.1.0.tar.gz (57.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airun_hwp-0.1.0-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file airun_hwp-0.1.0.tar.gz.

File metadata

  • Download URL: airun_hwp-0.1.0.tar.gz
  • Upload date:
  • Size: 57.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for airun_hwp-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b317e9a07793021ecd8a921108449e913cdd7f35ff03a8679e2de4f0c86beb89
MD5 c78d34062434a05a1efc785751c18d4a
BLAKE2b-256 a886f1e57062441cb6b96b67c0788e3c6e3b6a0f8d317612813a139e01393c1f

See more details on using hashes here.

File details

Details for the file airun_hwp-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: airun_hwp-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for airun_hwp-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f18e53c547e980a6a5ae86075ce4fc6c5f68f0608bc3df24c0e20993b529f21
MD5 7dff002033c0101030a2b26dd6198cc9
BLAKE2b-256 595c8195c25b5d1e722a17435e9a273aa7bd8cf5346fbe53a43db292f67351e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page