Skip to main content

A pure Python parser for the HWP (Hangul Word Processor) 5.0 binary file format.

Project description

hwplib-py

standard-readme compliant License

A pure Python parser for the HWP (Hangul Word Processor) 5.0 binary file format.

This library allows you to read and extract text, metadata, and control objects (Tables, Pictures, Shapes, Equations) from HWP files without needing the official software. It is built from scratch based on the official file format specifications.

Table of Contents

Background

The HWP format is the standard word processing format in South Korea. While there are existing tools, many rely on the official ole automation or are incomplete. hwplib-py aims to provide a robust, cross-platform, pythonic interface to deep-dive into HWP binaries (OLE2 + Zlib structure).

Install

pip install hwplib-py

(Note: Not yet on PyPI, clone and install locally)

git clone https://github.com/minseo0388/hwplib-py.git
cd hwplib-py
pip install .

Usage

from hwplib.hwp5.api import load

# Load an HWP file
doc = load("document.hwp")

# Print document metadata
print(f"Version: {doc.header.version_str}")
print(f"Compressed: {doc.header.is_compressed}")

# Extract plain text (including tables and hidden controls)
print(doc.get_text())

# Access sections and paragraphs directly
for section in doc.sections:
    for paragraph in section.paragraphs:
        print(paragraph.text)
        
        # Access embedded controls
        for ctrl in paragraph.controls:
            if ctrl.ctrl_id == 'tbl':
                print(f"Found Table with {len(ctrl.rows)} rows")

Features

  • Pure Python: Zero-dependency on Windows libraries. Uses olefile for OLE2 storage parsing and standard zlib for decompression. Cross-platform compatible.

  • Core Engine (HWP 5.0):

    • Header: Version validation, encryption flags checking.
    • DocInfo: Complete parsing of document metadata including:
      • FaceNames (Font information)
      • Border/Fill
      • CharShapes & ParaShapes
      • Styles
    • BodyText: Section-based parsing of Paragraphs with support for high-throughput text extraction.
  • Control Objects (The "Organs"):

    • Tables: Full object model with Row and Cell structures. Supports recursive text extraction from cells (converting table layout to tab-separated text).
    • Equations: Parses equation controls and extracts the raw LaTeX-like script (e.g., y = ax + b).
    • Shapes (GSO):
      • ControlLine: Start/End coordinates.
      • ControlRect & ControlEllipse: Width, Height, Attributes.
      • ControlPolygon: List of vertices.
    • Pictures: Meta-information parsing (Width, Height, BinData reference).
  • Specialized Modules:

    • Chart: Binary parser for HWP chart objects.
    • Distribution Document: Detection logic for DistributeDoc (protected) files and crypto skeletons (AES-128).
    • Legacy: Partial support for HWP 3.0 records.
  • Export & Integration:

    • JSON Export: convert the entire HwpDocument object graph to JSON for easy integration with web services or NoSQL databases.
    • API: Simple load() and doc.get_text() interface for immediate productivity.

Maintainers

@ Choi Minseo

Contributing

PRs accepted.

Small note: If editing the Readme, please conform to the standard-readme specification.

License

Apache-2.0 © 2026 Choi Minseo

Legal Notice

본 제품은 (주)한글과컴퓨터의 한글 문서 파일(.hwp) 공개 문서를 참고하여 개발하였습니다.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hwplib_py-1.0.0.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hwplib_py-1.0.0-py3-none-any.whl (30.3 kB view details)

Uploaded Python 3

File details

Details for the file hwplib_py-1.0.0.tar.gz.

File metadata

  • Download URL: hwplib_py-1.0.0.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for hwplib_py-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7694773a325fd5a2f9b8669b4b0cf592c8e5c9be9c65c6fd2c42acf083d18061
MD5 a2fc8ba373d89556bdb21a685bd8afa6
BLAKE2b-256 0317075559fa5564bcc8b5a842345346c08f4274357d66460a131b2a9f3b2bf2

See more details on using hashes here.

File details

Details for the file hwplib_py-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: hwplib_py-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for hwplib_py-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48c91004931fe16b83fa83dd7ea98b58432ede55bd93b926bc7bdfe8142948f8
MD5 15e210bf0ec0984cd07c20026dae5c93
BLAKE2b-256 e6af47ffd83fe33e2ad78730cda7fa0a5ce3e2579f39f94734712eeb2e52ec4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page