AI-powered HWP/HWPX document processing library for Hamonize

These details have not been verified by PyPI

Project links

Project description

airun-hwp

AI-powered HWP/HWPX document processing library for Hamonize

Features

HWPX Parsing: Parse HWPX files with full document structure preservation
Ordered Content Extraction: Maintain original document flow with mixed content types
Image Extraction: Extract and save all images from documents
Table Processing: Extract tables with proper formatting
Markdown Conversion: Convert documents to well-structured Markdown
PDF Export: Generate PDF files with embedded images (included by default)
CLI Tool: Easy-to-use command-line interface

Installation

pip install airun-hwp

Note: PDF export functionality is included by default.

Development Installation

git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"

Quick Start

Command Line Interface

# Convert to Markdown
airun-hwp convert document.hwpx --format markdown

# Convert to PDF
airun-hwp convert document.hwpx --format pdf --output ./results

# Process to both formats
airun-hwp process document.hwpx

Python API

from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse HWPX file
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")

# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")

# Convert to Markdown
markdown_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

Advanced Usage

PDF Generation with Custom Styling

import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract images
document.extract_images("./output/images")

# Get Markdown content
md_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])

# Add custom CSS
css = """
<style>
    body { font-family: 'Malgun Gothic', Arial, sans-serif; }
    img { max-width: 100%; height: auto; }
    table { border-collapse: collapse; width: 100%; }
    th, td { border: 1px solid #333; padding: 8px; }
</style>
"""

# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")

Document Structure

The library processes HWPX documents using a token-stream approach that preserves the original document order:

Text Runs: Consecutive text segments
Images: Embedded images with proper positioning
Tables: Structured table data
Paragraph Breaks: Logical document divisions
Page Breaks: Document pagination

CLI Commands

Convert Command

Convert HWPX files to different formats:

airun-hwp convert <input_file> [options]

Options:
  --format {markdown,md,pdf}  Output format (default: markdown)
  --output, -o PATH           Output directory (default: ./output)

Process Command

Process document to multiple formats:

airun-hwp process <input_file> [options]

Options:
  --output, -o PATH           Output directory (default: ./output)

Output Structure

When processing a document named document.hwpx:

output/
└── document/
    ├── images/
    │   ├── image1.png
    │   ├── image2.png
    │   └── ...
    ├── document.md
    └── document.pdf

Dependencies

pypandoc-hwpx>=0.1.0: HWPX file format support
PyYAML>=6.0: YAML configuration parsing
Pillow>=10.0.0: Image processing
weasyprint>=60.0: HTML to PDF conversion (included)
markdown>=3.5.0: Markdown processing (included)

Development

Running Tests

pytest

Code Coverage

pytest --cov=airun_hwp

Code Formatting

black airun_hwp/
ruff check airun_hwp/

Type Checking

mypy airun_hwp/

Building for Distribution

# Build source and wheel distributions
python -m build

# Build with twine
twine build dist/

Publishing to PyPI

# Upload to Test PyPI
twine upload --repository testpypi dist/*

# Upload to PyPI
twine upload dist/*

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Support

📧 Email: team@hamonize.com
🐛 Issues: GitHub Issues
📖 Documentation: GitHub Wiki

Changelog

Version 0.1.0

Initial release
HWPX parsing support
Markdown conversion
PDF export functionality
CLI tool
Image extraction
Table processing

Made with ❤️ for the Hamonize project

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Jan 6, 2026

0.3.0

Dec 22, 2025

0.2.9

Dec 22, 2025

0.2.8

Dec 22, 2025

0.2.7

Dec 22, 2025

0.2.6

Dec 20, 2025

0.2.4

Dec 20, 2025

This version

0.2.0

Dec 20, 2025

0.1.0

Dec 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airun_hwp-0.2.0.tar.gz (57.7 kB view details)

Uploaded Dec 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

airun_hwp-0.2.0-py3-none-any.whl (44.9 kB view details)

Uploaded Dec 20, 2025 Python 3

File details

Details for the file airun_hwp-0.2.0.tar.gz.

File metadata

Download URL: airun_hwp-0.2.0.tar.gz
Upload date: Dec 20, 2025
Size: 57.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for airun_hwp-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7e0dbe1e986fd6891223dd87ee8c018ddaf5d29faac00d93f6e3160dfa687546`
MD5	`1ff22500e5157428c673da7bdea790c6`
BLAKE2b-256	`2f4295758cf77f380f2fe9b37e9fab0232a0d204d24cf598b4a2f264f815df43`

See more details on using hashes here.

File details

Details for the file airun_hwp-0.2.0-py3-none-any.whl.

File metadata

Download URL: airun_hwp-0.2.0-py3-none-any.whl
Upload date: Dec 20, 2025
Size: 44.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for airun_hwp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a36be0b7ba05021c03df7867ddc18f87bed0ee33fb6219f85cfbfb0bf677cd9d`
MD5	`1d94094be8c208ffa54d87fac7f5f0a8`
BLAKE2b-256	`9eae921561f82bc201c328688c3009d23384dec74a5b3914d8f931f0bd3b3ebe`

See more details on using hashes here.

airun-hwp 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

airun-hwp

Features

Installation

Development Installation

Quick Start

Command Line Interface

Python API

Advanced Usage

PDF Generation with Custom Styling

Document Structure

CLI Commands

Convert Command

Process Command

Output Structure

Dependencies

Development

Running Tests

Code Coverage

Code Formatting

Type Checking

Building for Distribution

Publishing to PyPI

License

Contributing

Support

Changelog

Version 0.1.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes