A pure Python parser for the HWP (Hangul Word Processor) 5.0 binary file format.
Project description
hwplib-py
A pure Python parser for the HWP (Hangul Word Processor) 5.0 binary file format.
This library allows you to read and extract text, metadata, and control objects (Tables, Pictures, Shapes, Equations) from HWP files without needing the official software. It is built from scratch based on the official file format specifications.
Table of Contents
Background
The HWP format is the standard word processing format in South Korea. While there are existing tools, many rely on the official ole automation or are incomplete. hwplib-py aims to provide a robust, cross-platform, pythonic interface to deep-dive into HWP binaries (OLE2 + Zlib structure).
Install
pip install hwplib-py
(Note: Not yet on PyPI, clone and install locally)
git clone https://github.com/minseo0388/hwplib-py.git
cd hwplib-py
pip install .
Usage
from hwplib.hwp5.api import load
# Load an HWP file
doc = load("document.hwp")
# Print document metadata
print(f"Version: {doc.header.version_str}")
print(f"Compressed: {doc.header.is_compressed}")
# Extract plain text (including tables and hidden controls)
print(doc.get_text())
# Access sections and paragraphs directly
for section in doc.sections:
for paragraph in section.paragraphs:
print(paragraph.text)
# Access embedded controls
for ctrl in paragraph.controls:
if ctrl.ctrl_id == 'tbl':
print(f"Found Table with {len(ctrl.rows)} rows")
Features
-
Pure Python: Zero-dependency on Windows libraries. Uses
olefilefor OLE2 storage parsing and standardzlibfor decompression. Cross-platform compatible. -
Core Engine (HWP 5.0):
- Header: Version validation, encryption flags checking.
- DocInfo: Complete parsing of document metadata including:
- FaceNames (Font information)
- Border/Fill
- CharShapes & ParaShapes
- Styles
- BodyText: Section-based parsing of Paragraphs with support for high-throughput text extraction.
-
Control Objects (The "Organs"):
- Tables: Full object model with
RowandCellstructures. Supports recursive text extraction from cells (converting table layout to tab-separated text). - Equations: Parses equation controls and extracts the raw LaTeX-like script (e.g.,
y = ax + b). - Shapes (GSO):
ControlLine: Start/End coordinates.ControlRect&ControlEllipse: Width, Height, Attributes.ControlPolygon: List of vertices.
- Pictures: Meta-information parsing (Width, Height, BinData reference).
- Tables: Full object model with
-
Specialized Modules:
- Chart: Binary parser for HWP chart objects.
- Distribution Document: Detection logic for DistributeDoc (protected) files and crypto skeletons (AES-128).
- Legacy: Partial support for HWP 3.0 records.
-
Export & Integration:
- JSON Export: convert the entire
HwpDocumentobject graph to JSON for easy integration with web services or NoSQL databases. - API: Simple
load()anddoc.get_text()interface for immediate productivity.
- JSON Export: convert the entire
Maintainers
Contributing
PRs accepted.
Small note: If editing the Readme, please conform to the standard-readme specification.
License
Apache-2.0 © 2026 Choi Minseo
Legal Notice
본 제품은 (주)한글과컴퓨터의 한글 문서 파일(.hwp) 공개 문서를 참고하여 개발하였습니다.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hwplib_py-1.0.0.tar.gz.
File metadata
- Download URL: hwplib_py-1.0.0.tar.gz
- Upload date:
- Size: 27.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7694773a325fd5a2f9b8669b4b0cf592c8e5c9be9c65c6fd2c42acf083d18061
|
|
| MD5 |
a2fc8ba373d89556bdb21a685bd8afa6
|
|
| BLAKE2b-256 |
0317075559fa5564bcc8b5a842345346c08f4274357d66460a131b2a9f3b2bf2
|
File details
Details for the file hwplib_py-1.0.0-py3-none-any.whl.
File metadata
- Download URL: hwplib_py-1.0.0-py3-none-any.whl
- Upload date:
- Size: 30.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48c91004931fe16b83fa83dd7ea98b58432ede55bd93b926bc7bdfe8142948f8
|
|
| MD5 |
15e210bf0ec0984cd07c20026dae5c93
|
|
| BLAKE2b-256 |
e6af47ffd83fe33e2ad78730cda7fa0a5ce3e2579f39f94734712eeb2e52ec4a
|