AI-powered HWP/HWPX document processing library for Hamonize
Project description
airun-hwp
AI-powered HWP/HWPX document processing library for Hamonize
Convert HWP/HWPX documents with ease! 📄
airun-hwp is a powerful tool that converts Hancom Office HWP/HWPX files to Markdown and PDF formats.
⚡ Quick Start
# Install
pip install airun-hwp
# Convert (creates both Markdown and PDF)
airun-hwp document.hwpx
# Convert to PDF only
airun-hwp document.hwpx --format pdf
✨ Features
- HWP Document Conversion: Convert HWPX files to Markdown, PDF
- Image Extraction: Automatically extract images from documents
- Table Processing: Preserve table structure during conversion
- Simple CLI: Easy-to-use command-line interface
- Auto-completion: Tab completion support in bash/zsh
📦 Installation
pip install airun-hwp
PDF export functionality is included by default.
Installation from Source
For the latest features before PyPI release:
git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install .
For Developers
For contributors:
git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"
🚀 Usage
Command Line Interface
# Basic usage (creates both Markdown and PDF)
airun-hwp document.hwpx
# Convert to specific format
airun-hwp document.hwpx --format pdf
airun-hwp document.hwpx -f markdown
# Specify output directory
airun-hwp document.hwpx --format pdf --output ./results
airun-hwp document.hwpx -o ./output_folder
# Auto-detect best PDF engine (recommended)
airun-hwp document.hwpx --format pdf --pdf-engine auto
# Use LibreOffice for PDF conversion (preserves original formatting)
airun-hwp document.hwpx --format pdf --pdf-engine libreoffice
# Use WeasyPrint (fast, but limited formatting)
airun-hwp document.hwpx --format pdf --pdf-engine weasyprint
# Adjust LibreOffice conversion timeout (default: 30 seconds)
airun-hwp large_document.hwpx --format pdf --pdf-engine libreoffice --timeout 60
# Get help
airun-hwp --help
PDF Conversion Engines
airun-hwp supports three PDF conversion engine modes:
| Mode | Description | Formatting Preservation | Speed | LibreOffice Required |
|---|---|---|---|---|
| auto (default) | Auto-detects best engine | Auto-selects | Auto | Auto |
| libreoffice | Direct HWPX → PDF | Excellent (⭐⭐⭐⭐) | Moderate | ✅ Yes |
| weasyprint | Markdown → HTML → PDF | Basic (⭐⭐) | Fast | ❌ No |
Auto Mode (Recommended)
- ✅ Automatically detects LibreOffice availability
- ✅ Falls back to WeasyPrint if LibreOffice not found
- ✅ Best of both worlds
- Best for: All scenarios, lets the tool decide
WeasyPrint Mode
- ✅ Fast conversion
- ✅ No external dependencies
- ✅ Pure Python solution
- ⚠️ Limited formatting preservation
- Best for: Simple documents, quick conversions, environments without LibreOffice
LibreOffice Mode
- ✅ Preserves original formatting
- ✅ Better font support
- ✅ Accurate page layout
- ✅ Superior table rendering
- Requires: LibreOffice with HWP support (usually pre-installed on Linux)
- Best for: Complex documents, official files, formatting-critical tasks
Note: Use --pdf-engine auto to let the tool automatically choose the best available engine.
🔧 Shell Auto-completion
The CLI supports tab completion for bash, zsh, and fish shells. This makes it easier to use the command-line interface without remembering all options.
Automatic Installation (Recommended)
Run the completion installer after installing the package:
# Install completion automatically (detects your shell)
airun-hwp completion install
# Or manually run:
python -c "from airun_hwp.cli_simple import install; install()"
The installer will:
- Detect your current shell (bash or zsh)
- Add completion script to your shell configuration file
- Show you how to activate it
Manual Setup
Bash
Add this line to your ~/.bashrc:
eval "$(_AIRUN_HWP_COMPLETE=bash_source airun-hwp)"
Then reload your shell:
source ~/.bashrc
Zsh
Add this line to your ~/.zshrc:
eval "$(_AIRUN_HWP_COMPLETE=zsh_source airun-hwp)"
Then reload your shell:
source ~/.zshrc
Fish
Create a completion file:
mkdir -p ~/.config/fish/completions
airun-hwp --completion=bash > ~/.config/fish/completions/airun-hwp.fish
Using Completion
Once enabled, you can use tab completion:
# File completion
airun-hwp doc<TAB>
# document.hwpx report.hwpx ...
# Option completion
airun-hwp <TAB>
# --format --help --output
# Option value completion
airun-hwp --format <TAB>
# all markdown md pdf
📘 Python API
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
from airun_hwp.reader.hwpx_to_markdown import extract_text_from_file
# Parse HWPX file (full structure preserved)
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")
# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")
# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")
# Convert to Markdown with tables
markdown_content = document.to_markdown_ordered(
include_metadata=True,
images_dir="./output/images"
)
# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
# For HWP files (plain text only)
hwp_text = extract_text_from_file("document.hwp")
print(f"HWP text (tables not preserved): {len(hwp_text)} characters")
Advanced Usage
PDF Generation with Custom Styling
import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")
# Extract images
document.extract_images("./output/images")
# Get Markdown content
md_content = document.to_markdown_ordered(
include_metadata=True,
images_dir="./output/images"
)
# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])
# Add custom CSS
css = """
<style>
body { font-family: 'Malgun Gothic', Arial, sans-serif; }
img { max-width: 100%; height: auto; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #333; padding: 8px; }
</style>
"""
# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")
Document Structure
The library processes HWPX documents using a token-stream approach that preserves the original document order:
- Text Runs: Consecutive text segments
- Images: Embedded images with proper positioning
- Tables: Structured table data
- Paragraph Breaks: Logical document divisions
- Page Breaks: Document pagination
HWP vs HWPX: Important Differences
This library handles HWP and HWPX files differently due to their fundamental format differences:
HWPX Files (Recommended)
- Format: XML-based, open standard
- Structure: Preserves full document structure
- Tables: ✅ Extracted with proper formatting
- Images: ✅ Extracted with positioning
- Layout: Maintains original document flow
HWP Files (Limited Support)
- Format: Binary, proprietary format
- Structure: Only plain text extraction available
- Tables: ❌ Not preserved (extracted as plain text only)
- Images: ❌ Cannot preserve original position/sequence
- Layout: Original structure and order lost
Recommendation
For best results, use HWPX files. If you have HWP files:
- Convert HWP to HWPX in Hanword (한글) before processing
- Or use for plain text extraction only
Output Structure
When processing a document named document.hwpx:
output/
└── document/
├── images/
│ ├── image1.png
│ ├── image2.png
│ └── ...
├── document.md
└── document.pdf
Dependencies
pypandoc-hwpx>=0.1.0: HWPX file format supportPyYAML>=6.0: YAML configuration parsingPillow>=10.0.0: Image processingweasyprint>=60.0: HTML to PDF conversion (included)markdown>=3.5.0: Markdown processing (included)click>=8.0.0: Command-line interface with auto-completion support
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
🆘 Support
- 📧 Email: chaeya@gmail.com (Kevin Kim)
- 🐛 Report bugs: GitHub Issues
- 📖 Documentation: GitHub Wiki
Changelog
Version 0.3.0
- Simplified CLI interface: use
airun-hwp document.hwpxdirectly - Removed subcommands (
convert,process) for cleaner UX - Default behavior creates both Markdown and PDF
- Added shell auto-completion support (bash, zsh, fish)
- Cleaner, more user-focused README
Version 0.2.9
- Simplified CLI interface: removed subcommands for direct usage
- Now use
airun-hwp document.hwpxinstead ofairun-hwp convert document.hwpx - Default behavior creates both Markdown and PDF outputs
- Added
--format alloption (default) for creating both formats - Maintained backward compatibility with deprecated subcommands
- Cleaner and more intuitive command-line experience
Version 0.2.8
- Added shell auto-completion support for bash, zsh, and fish
- Migrated CLI from argparse to Click for better user experience
- Added automatic completion installer (
airun-hwp-completion) - Enhanced CLI with tab completion for commands and options
- Improved error messages with Click's formatting
Version 0.2.7
- Fixed PyPI publishing workflow with Trusted Publishing
- Fixed license format for Python 3.8 compatibility
- Updated build configuration
Version 0.2.5
- Fixed
get_all_text()method to properly extract text from token stream - Improved text extraction to handle both tokens and paragraphs
- Added deduplication to prevent duplicate text extraction
- Updated documentation to clarify HWP vs HWPX limitations
Version 0.2.0
- HWPX parsing support
- Markdown conversion
- PDF export functionality
- CLI tool
- Image extraction
- Table processing
Version 0.1.0
- Initial release
Made with ❤️ for the Hamonize project
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file airun_hwp-0.3.1.tar.gz.
File metadata
- Download URL: airun_hwp-0.3.1.tar.gz
- Upload date:
- Size: 67.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b2e18b990d16227f0097c195ef231b5d2c9ac0b955ab62e3454915fc5763086
|
|
| MD5 |
3ebbc3a02d956813411eb5ec1d2d7ba9
|
|
| BLAKE2b-256 |
44f1c4ca6ecb3eaaa653771702aa2498dffad4055345c3191b317cda07397455
|
Provenance
The following attestation bundles were made for airun_hwp-0.3.1.tar.gz:
Publisher:
publish-to-pypi.yml on chaeya/airun-hwp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
airun_hwp-0.3.1.tar.gz -
Subject digest:
2b2e18b990d16227f0097c195ef231b5d2c9ac0b955ab62e3454915fc5763086 - Sigstore transparency entry: 797422235
- Sigstore integration time:
-
Permalink:
chaeya/airun-hwp@050a5ad7adaa1ee7290333574ef37f1ef0ba302a -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/chaeya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@050a5ad7adaa1ee7290333574ef37f1ef0ba302a -
Trigger Event:
push
-
Statement type:
File details
Details for the file airun_hwp-0.3.1-py3-none-any.whl.
File metadata
- Download URL: airun_hwp-0.3.1-py3-none-any.whl
- Upload date:
- Size: 58.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4869dca984a005987ad3ddad42129a051ebfbcbbb8d18a2d5738255812d1cc1d
|
|
| MD5 |
d74d09c62e1548530717112fcdf94a4d
|
|
| BLAKE2b-256 |
fcecec6df146daf8edc3f01c893e8f2a6db943e27feab598acf3527c09c1861c
|
Provenance
The following attestation bundles were made for airun_hwp-0.3.1-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on chaeya/airun-hwp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
airun_hwp-0.3.1-py3-none-any.whl -
Subject digest:
4869dca984a005987ad3ddad42129a051ebfbcbbb8d18a2d5738255812d1cc1d - Sigstore transparency entry: 797422236
- Sigstore integration time:
-
Permalink:
chaeya/airun-hwp@050a5ad7adaa1ee7290333574ef37f1ef0ba302a -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/chaeya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@050a5ad7adaa1ee7290333574ef37f1ef0ba302a -
Trigger Event:
push
-
Statement type: