High-performance HWP/HWPX document extraction library
Project description
unhwp
High-performance Python library for extracting HWP/HWPX Korean word processor documents to Markdown.
Installation
pip install unhwp
Quick Start
import unhwp
# Simple conversion
markdown = unhwp.to_markdown("document.hwp")
print(markdown)
# Extract plain text
text = unhwp.extract_text("document.hwp")
# Full parsing with images
with unhwp.parse("document.hwp") as result:
print(result.markdown)
print(f"Sections: {result.section_count}")
print(f"Paragraphs: {result.paragraph_count}")
# Save images
for img in result.images:
img.save(f"output/{img.name}")
Features
- Fast: Native Rust library with zero-copy parsing
- Complete: Extracts text, tables, images, and document structure
- Clean Output: Optional cleanup pipeline for polished Markdown
- Format Support: HWP 5.0, HWPX, and HWP 3.x (legacy)
API Reference
Functions
to_markdown(path) -> str
Convert an HWP/HWPX document to Markdown.
markdown = unhwp.to_markdown("document.hwp")
to_markdown_with_cleanup(path, cleanup_options=None) -> str
Convert with optional cleanup.
markdown = unhwp.to_markdown_with_cleanup(
"document.hwp",
cleanup_options=unhwp.CleanupOptions.aggressive()
)
extract_text(path) -> str
Extract plain text content.
text = unhwp.extract_text("document.hwp")
parse(path, render_options=None) -> ParseResult
Parse a document with full access to content and images.
with unhwp.parse("document.hwp") as result:
print(result.markdown)
print(result.text)
for img in result.images:
print(img.name, len(img.data))
detect_format(path) -> int
Detect the document format.
fmt = unhwp.detect_format("document.hwp")
if fmt == unhwp.FORMAT_HWP5:
print("HWP 5.0 format")
elif fmt == unhwp.FORMAT_HWPX:
print("HWPX format")
Classes
RenderOptions
Options for Markdown rendering.
opts = unhwp.RenderOptions(
include_frontmatter=True,
image_path_prefix="images/",
preserve_line_breaks=False,
)
CleanupOptions
Options for output cleanup.
# Presets
opts = unhwp.CleanupOptions.minimal()
opts = unhwp.CleanupOptions.default()
opts = unhwp.CleanupOptions.aggressive()
opts = unhwp.CleanupOptions.disabled()
# Custom
opts = unhwp.CleanupOptions(
enabled=True,
preset=1,
detect_mojibake=True,
)
Constants
FORMAT_UNKNOWN- Unknown formatFORMAT_HWP5- HWP 5.0 binary formatFORMAT_HWPX- HWPX XML formatFORMAT_HWP3- HWP 3.x legacy format
Platform Support
- Windows (x64)
- Linux (x64)
- macOS (x64, ARM64)
License
MIT License - see LICENSE for details.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unhwp-0.1.16.tar.gz.
File metadata
- Download URL: unhwp-0.1.16.tar.gz
- Upload date:
- Size: 5.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1a6d783642d73b6745dece908edd41737594353cd16f535cfbdc150f40a9e20
|
|
| MD5 |
22ec6cc53cd860e125a8844267f01e13
|
|
| BLAKE2b-256 |
fe0f397a3027aad7ae212421f90d2fbddfcd9b703f9527bad60ffc11ca48088f
|
File details
Details for the file unhwp-0.1.16-py3-none-any.whl.
File metadata
- Download URL: unhwp-0.1.16-py3-none-any.whl
- Upload date:
- Size: 6.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab768e6c322c9e1124a9b35c3f71e0beaefca3721f07dbc540a37fab5891cd96
|
|
| MD5 |
217f9fcac03a1c51115760f98f05b3eb
|
|
| BLAKE2b-256 |
34ffb5f8e63e2f3b008704dfce9fd07a5b2f69e94aa1ea0e38e6974c0c8d422b
|