AI-powered HWP/HWPX document processing library for Hamonize
Project description
airun-hwp
AI-powered HWP/HWPX document processing library for Hamonize
Features
- HWPX Parsing: Parse HWPX files with full document structure preservation
- HWP Text Extraction: Extract plain text from HWP files (structure not preserved)
- Ordered Content Extraction: Maintain original document flow with mixed content types (HWPX only)
- Image Extraction: Extract and save all images from documents
- Table Processing: Extract tables with proper formatting (HWPX only)
- Markdown Conversion: Convert documents to well-structured Markdown
- PDF Export: Generate PDF files with embedded images (included by default)
- CLI Tool: Easy-to-use command-line interface
Installation
pip install airun-hwp
Note: PDF export functionality is included by default.
Development Installation
git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"
Quick Start
Command Line Interface
# Convert to Markdown
airun-hwp convert document.hwpx --format markdown
# Convert to PDF
airun-hwp convert document.hwpx --format pdf --output ./results
# Process to both formats
airun-hwp process document.hwpx
Python API
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
from airun_hwp.reader.hwpx_to_markdown import extract_text_from_file
# Parse HWPX file (full structure preserved)
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")
# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")
# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")
# Convert to Markdown with tables
markdown_content = document.to_markdown_ordered(
include_metadata=True,
images_dir="./output/images"
)
# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
f.write(markdown_content)
# For HWP files (plain text only)
hwp_text = extract_text_from_file("document.hwp")
print(f"HWP text (tables not preserved): {len(hwp_text)} characters")
Advanced Usage
PDF Generation with Custom Styling
import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")
# Extract images
document.extract_images("./output/images")
# Get Markdown content
md_content = document.to_markdown_ordered(
include_metadata=True,
images_dir="./output/images"
)
# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])
# Add custom CSS
css = """
<style>
body { font-family: 'Malgun Gothic', Arial, sans-serif; }
img { max-width: 100%; height: auto; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #333; padding: 8px; }
</style>
"""
# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")
Document Structure
The library processes HWPX documents using a token-stream approach that preserves the original document order:
- Text Runs: Consecutive text segments
- Images: Embedded images with proper positioning
- Tables: Structured table data
- Paragraph Breaks: Logical document divisions
- Page Breaks: Document pagination
CLI Commands
Convert Command
Convert HWPX files to different formats:
airun-hwp convert <input_file> [options]
Options:
--format {markdown,md,pdf} Output format (default: markdown)
--output, -o PATH Output directory (default: ./output)
Process Command
Process document to multiple formats:
airun-hwp process <input_file> [options]
Options:
--output, -o PATH Output directory (default: ./output)
HWP vs HWPX: Important Differences
This library handles HWP and HWPX files differently due to their fundamental format differences:
HWPX Files (Recommended)
- Format: XML-based, open standard
- Structure: Preserves full document structure
- Tables: ✅ Extracted with proper formatting
- Images: ✅ Extracted with positioning
- Layout: Maintains original document flow
HWP Files (Limited Support)
- Format: Binary, proprietary format
- Structure: Only plain text extraction available
- Tables: ❌ Not preserved (extracted as plain text only)
- Images: ❌ Cannot preserve original position/sequence
- Layout: Original structure and order lost
Recommendation
For best results, use HWPX files. If you have HWP files:
- Convert HWP to HWPX in Hanword (한글) before processing
- Or use for plain text extraction only
Output Structure
When processing a document named document.hwpx:
output/
└── document/
├── images/
│ ├── image1.png
│ ├── image2.png
│ └── ...
├── document.md
└── document.pdf
Dependencies
pypandoc-hwpx>=0.1.0: HWPX file format supportPyYAML>=6.0: YAML configuration parsingPillow>=10.0.0: Image processingweasyprint>=60.0: HTML to PDF conversion (included)markdown>=3.5.0: Markdown processing (included)
Development
Running Tests
pytest
Code Coverage
pytest --cov=airun_hwp
Code Formatting
black airun_hwp/
ruff check airun_hwp/
Type Checking
mypy airun_hwp/
Building for Distribution
# Build source and wheel distributions
python -m build
# Build with twine
twine build dist/
Publishing to PyPI
# Upload to Test PyPI
twine upload --repository testpypi dist/*
# Upload to PyPI
twine upload dist/*
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Support
- 📧 Email: chaeya@gmail.com (Kevin Kim)
- 🐛 Issues: GitHub Issues
- 📖 Documentation: GitHub Wiki
Changelog
Version 0.2.5
- Fixed
get_all_text()method to properly extract text from token stream - Improved text extraction to handle both tokens and paragraphs
- Added deduplication to prevent duplicate text extraction
- Updated documentation to clarify HWP vs HWPX limitations
Version 0.2.0
- HWPX parsing support
- Markdown conversion
- PDF export functionality
- CLI tool
- Image extraction
- Table processing
Version 0.1.0
- Initial release
Made with ❤️ for the Hamonize project
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file airun_hwp-0.2.7.tar.gz.
File metadata
- Download URL: airun_hwp-0.2.7.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc13914b32ab0d33f71b3529b816010a6e99e0a859e4b344d6bf9a9712ac96b4
|
|
| MD5 |
daca279a9ac1ac1fe2e80c967cd5029f
|
|
| BLAKE2b-256 |
08c221a9d6d3a697bcd528b3f284d0aae30ac3b2d9520d05b506c821a80f7c3a
|
Provenance
The following attestation bundles were made for airun_hwp-0.2.7.tar.gz:
Publisher:
publish-to-pypi.yml on chaeya/airun-hwp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
airun_hwp-0.2.7.tar.gz -
Subject digest:
cc13914b32ab0d33f71b3529b816010a6e99e0a859e4b344d6bf9a9712ac96b4 - Sigstore transparency entry: 775171904
- Sigstore integration time:
-
Permalink:
chaeya/airun-hwp@4031d9a887ced5091d9988f245163d8be826d336 -
Branch / Tag:
refs/tags/v0.2.7 - Owner: https://github.com/chaeya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@4031d9a887ced5091d9988f245163d8be826d336 -
Trigger Event:
push
-
Statement type:
File details
Details for the file airun_hwp-0.2.7-py3-none-any.whl.
File metadata
- Download URL: airun_hwp-0.2.7-py3-none-any.whl
- Upload date:
- Size: 46.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9732355df3d946f2cf3a6da6027adbda9f2ee04b901e7d6171c4be299f3e0dab
|
|
| MD5 |
9417abc9ab94cef71b9659da48cd7bf4
|
|
| BLAKE2b-256 |
7ea16bd40fe5767c8133686ef9b7ca6bcde83f33069869f2d945395d1269be91
|
Provenance
The following attestation bundles were made for airun_hwp-0.2.7-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on chaeya/airun-hwp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
airun_hwp-0.2.7-py3-none-any.whl -
Subject digest:
9732355df3d946f2cf3a6da6027adbda9f2ee04b901e7d6171c4be299f3e0dab - Sigstore transparency entry: 775171912
- Sigstore integration time:
-
Permalink:
chaeya/airun-hwp@4031d9a887ced5091d9988f245163d8be826d336 -
Branch / Tag:
refs/tags/v0.2.7 - Owner: https://github.com/chaeya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@4031d9a887ced5091d9988f245163d8be826d336 -
Trigger Event:
push
-
Statement type: