AI-powered HWP/HWPX document processing library for Hamonize

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hckim

These details have not been verified by PyPI

Project description

airun-hwp

AI-powered HWP/HWPX document processing library for Hamonize

Convert HWP/HWPX documents with ease! 📄

airun-hwp is a powerful tool that converts Hancom Office HWP/HWPX files to Markdown and PDF formats.

⚡ Quick Start

# Install
pip install airun-hwp

# Convert (creates both Markdown and PDF)
airun-hwp document.hwpx

# Convert to PDF only
airun-hwp document.hwpx --format pdf

✨ Features

HWP Document Conversion: Convert HWPX files to Markdown, PDF
Image Extraction: Automatically extract images from documents
Table Processing: Preserve table structure during conversion
Simple CLI: Easy-to-use command-line interface
Auto-completion: Tab completion support in bash/zsh

📦 Installation

pip install airun-hwp

PDF export functionality is included by default.

Installation from Source

For the latest features before PyPI release:

git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install .

For Developers

For contributors:

git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"

🚀 Usage

Command Line Interface

# Basic usage (creates both Markdown and PDF)
airun-hwp document.hwpx

# Convert to specific format
airun-hwp document.hwpx --format pdf
airun-hwp document.hwpx -f markdown

# Specify output directory
airun-hwp document.hwpx --format pdf --output ./results
airun-hwp document.hwpx -o ./output_folder

# Auto-detect best PDF engine (recommended)
airun-hwp document.hwpx --format pdf --pdf-engine auto

# Use LibreOffice for PDF conversion (preserves original formatting)
airun-hwp document.hwpx --format pdf --pdf-engine libreoffice

# Use WeasyPrint (fast, but limited formatting)
airun-hwp document.hwpx --format pdf --pdf-engine weasyprint

# Adjust LibreOffice conversion timeout (default: 30 seconds)
airun-hwp large_document.hwpx --format pdf --pdf-engine libreoffice --timeout 60

# Get help
airun-hwp --help

PDF Conversion Engines

airun-hwp supports three PDF conversion engine modes:

Mode	Description	Formatting Preservation	Speed	LibreOffice Required
auto (default)	Auto-detects best engine	Auto-selects	Auto	Auto
libreoffice	Direct HWPX → PDF	Excellent (⭐⭐⭐⭐)	Moderate	✅ Yes
weasyprint	Markdown → HTML → PDF	Basic (⭐⭐)	Fast	❌ No

Auto Mode (Recommended)

✅ Automatically detects LibreOffice availability
✅ Falls back to WeasyPrint if LibreOffice not found
✅ Best of both worlds
Best for: All scenarios, lets the tool decide

WeasyPrint Mode

✅ Fast conversion
✅ No external dependencies
✅ Pure Python solution
⚠️ Limited formatting preservation
Best for: Simple documents, quick conversions, environments without LibreOffice

LibreOffice Mode

✅ Preserves original formatting
✅ Better font support
✅ Accurate page layout
✅ Superior table rendering
Requires: LibreOffice with HWP support (usually pre-installed on Linux)
Best for: Complex documents, official files, formatting-critical tasks

Note: Use --pdf-engine auto to let the tool automatically choose the best available engine.

🔧 Shell Auto-completion

The CLI supports tab completion for bash, zsh, and fish shells. This makes it easier to use the command-line interface without remembering all options.

Automatic Installation (Recommended)

Run the completion installer after installing the package:

# Install completion automatically (detects your shell)
airun-hwp completion install

# Or manually run:
python -c "from airun_hwp.cli_simple import install; install()"

The installer will:

Detect your current shell (bash or zsh)
Add completion script to your shell configuration file
Show you how to activate it

Manual Setup

Bash

Add this line to your ~/.bashrc:

eval "$(_AIRUN_HWP_COMPLETE=bash_source airun-hwp)"

Then reload your shell:

source ~/.bashrc

Zsh

Add this line to your ~/.zshrc:

eval "$(_AIRUN_HWP_COMPLETE=zsh_source airun-hwp)"

Then reload your shell:

source ~/.zshrc

Fish

Create a completion file:

mkdir -p ~/.config/fish/completions
airun-hwp --completion=bash > ~/.config/fish/completions/airun-hwp.fish

Using Completion

Once enabled, you can use tab completion:

# File completion
airun-hwp doc<TAB>
# document.hwpx  report.hwpx  ...

# Option completion
airun-hwp <TAB>
# --format  --help  --output

# Option value completion
airun-hwp --format <TAB>
# all  markdown  md  pdf

📘 Python API

from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
from airun_hwp.reader.hwpx_to_markdown import extract_text_from_file

# Parse HWPX file (full structure preserved)
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")

# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")

# Convert to Markdown with tables
markdown_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

# For HWP files (plain text only)
hwp_text = extract_text_from_file("document.hwp")
print(f"HWP text (tables not preserved): {len(hwp_text)} characters")

Advanced Usage

PDF Generation with Custom Styling

import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract images
document.extract_images("./output/images")

# Get Markdown content
md_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])

# Add custom CSS
css = """
<style>
    body { font-family: 'Malgun Gothic', Arial, sans-serif; }
    img { max-width: 100%; height: auto; }
    table { border-collapse: collapse; width: 100%; }
    th, td { border: 1px solid #333; padding: 8px; }
</style>
"""

# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")

Document Structure

The library processes HWPX documents using a token-stream approach that preserves the original document order:

Text Runs: Consecutive text segments
Images: Embedded images with proper positioning
Tables: Structured table data
Paragraph Breaks: Logical document divisions
Page Breaks: Document pagination

HWP vs HWPX: Important Differences

This library handles HWP and HWPX files differently due to their fundamental format differences:

HWPX Files (Recommended)

Format: XML-based, open standard
Structure: Preserves full document structure
Tables: ✅ Extracted with proper formatting
Images: ✅ Extracted with positioning
Layout: Maintains original document flow

HWP Files (Limited Support)

Format: Binary, proprietary format
Structure: Only plain text extraction available
Tables: ❌ Not preserved (extracted as plain text only)
Images: ❌ Cannot preserve original position/sequence
Layout: Original structure and order lost

Recommendation

For best results, use HWPX files. If you have HWP files:

Convert HWP to HWPX in Hanword (한글) before processing
Or use for plain text extraction only

Output Structure

When processing a document named document.hwpx:

output/
└── document/
    ├── images/
    │   ├── image1.png
    │   ├── image2.png
    │   └── ...
    ├── document.md
    └── document.pdf

Dependencies

pypandoc-hwpx>=0.1.0: HWPX file format support
PyYAML>=6.0: YAML configuration parsing
Pillow>=10.0.0: Image processing
weasyprint>=60.0: HTML to PDF conversion (included)
markdown>=3.5.0: Markdown processing (included)
click>=8.0.0: Command-line interface with auto-completion support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🆘 Support

📧 Email: chaeya@gmail.com (Kevin Kim)
🐛 Report bugs: GitHub Issues
📖 Documentation: GitHub Wiki

Changelog

Version 0.3.0

Simplified CLI interface: use airun-hwp document.hwpx directly
Removed subcommands (convert, process) for cleaner UX
Default behavior creates both Markdown and PDF
Added shell auto-completion support (bash, zsh, fish)
Cleaner, more user-focused README

Version 0.2.9

Simplified CLI interface: removed subcommands for direct usage
Now use airun-hwp document.hwpx instead of airun-hwp convert document.hwpx
Default behavior creates both Markdown and PDF outputs
Added --format all option (default) for creating both formats
Maintained backward compatibility with deprecated subcommands
Cleaner and more intuitive command-line experience

Version 0.2.8

Added shell auto-completion support for bash, zsh, and fish
Migrated CLI from argparse to Click for better user experience
Added automatic completion installer (airun-hwp-completion)
Enhanced CLI with tab completion for commands and options
Improved error messages with Click's formatting

Version 0.2.7

Fixed PyPI publishing workflow with Trusted Publishing
Fixed license format for Python 3.8 compatibility
Updated build configuration

Version 0.2.5

Fixed get_all_text() method to properly extract text from token stream
Improved text extraction to handle both tokens and paragraphs
Added deduplication to prevent duplicate text extraction
Updated documentation to clarify HWP vs HWPX limitations

Version 0.2.0

HWPX parsing support
Markdown conversion
PDF export functionality
CLI tool
Image extraction
Table processing

Version 0.1.0

Initial release

Made with ❤️ for the Hamonize project

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hckim

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.1

Jan 6, 2026

0.3.0

Dec 22, 2025

0.2.9

Dec 22, 2025

0.2.8

Dec 22, 2025

0.2.7

Dec 22, 2025

0.2.6

Dec 20, 2025

0.2.4

Dec 20, 2025

0.2.0

Dec 20, 2025

0.1.0

Dec 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airun_hwp-0.3.1.tar.gz (67.2 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

airun_hwp-0.3.1-py3-none-any.whl (58.0 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file airun_hwp-0.3.1.tar.gz.

File metadata

Download URL: airun_hwp-0.3.1.tar.gz
Upload date: Jan 6, 2026
Size: 67.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for airun_hwp-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`2b2e18b990d16227f0097c195ef231b5d2c9ac0b955ab62e3454915fc5763086`
MD5	`3ebbc3a02d956813411eb5ec1d2d7ba9`
BLAKE2b-256	`44f1c4ca6ecb3eaaa653771702aa2498dffad4055345c3191b317cda07397455`

See more details on using hashes here.

Provenance

The following attestation bundles were made for airun_hwp-0.3.1.tar.gz:

Publisher: publish-to-pypi.yml on chaeya/airun-hwp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: airun_hwp-0.3.1.tar.gz
- Subject digest: 2b2e18b990d16227f0097c195ef231b5d2c9ac0b955ab62e3454915fc5763086
- Sigstore transparency entry: 797422235
- Sigstore integration time: Jan 6, 2026
Source repository:
- Permalink: chaeya/airun-hwp@050a5ad7adaa1ee7290333574ef37f1ef0ba302a
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/chaeya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@050a5ad7adaa1ee7290333574ef37f1ef0ba302a
- Trigger Event: push

File details

Details for the file airun_hwp-0.3.1-py3-none-any.whl.

File metadata

Download URL: airun_hwp-0.3.1-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 58.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for airun_hwp-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4869dca984a005987ad3ddad42129a051ebfbcbbb8d18a2d5738255812d1cc1d`
MD5	`d74d09c62e1548530717112fcdf94a4d`
BLAKE2b-256	`fcecec6df146daf8edc3f01c893e8f2a6db943e27feab598acf3527c09c1861c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for airun_hwp-0.3.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on chaeya/airun-hwp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: airun_hwp-0.3.1-py3-none-any.whl
- Subject digest: 4869dca984a005987ad3ddad42129a051ebfbcbbb8d18a2d5738255812d1cc1d
- Sigstore transparency entry: 797422236
- Sigstore integration time: Jan 6, 2026
Source repository:
- Permalink: chaeya/airun-hwp@050a5ad7adaa1ee7290333574ef37f1ef0ba302a
- Branch / Tag: refs/tags/v0.3.1
- Owner: https://github.com/chaeya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@050a5ad7adaa1ee7290333574ef37f1ef0ba302a
- Trigger Event: push

airun-hwp 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

airun-hwp

⚡ Quick Start

✨ Features

📦 Installation

Installation from Source

For Developers

🚀 Usage

Command Line Interface

PDF Conversion Engines

Auto Mode (Recommended)

WeasyPrint Mode

LibreOffice Mode

🔧 Shell Auto-completion

Automatic Installation (Recommended)

Manual Setup

Bash

Zsh

Fish

Using Completion

📘 Python API

Advanced Usage

PDF Generation with Custom Styling

Document Structure

HWP vs HWPX: Important Differences

HWPX Files (Recommended)

HWP Files (Limited Support)

Recommendation

Output Structure

Dependencies

📄 License

🤝 Contributing

🆘 Support

Changelog

Version 0.3.0

Version 0.2.9

Version 0.2.8

Version 0.2.7

Version 0.2.5

Version 0.2.0

Version 0.1.0

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance