Edit HWP 5.0 (Hancom Office) files: inject, swap, and replace paragraph text without corrupting the document.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

hwpkit

Read, fill, and edit Korean HWP (Hancom Office) documents in Python. Extract text for LLM / RAG pipelines, fill government & university forms programmatically, and rewrite the binary without corrupting it.

Korean government, universities, and most Korean enterprises run on .hwp — the binary format Hancom Office uses. If you need to ingest Korean enterprise documents into an LLM, automate form filling at scale, or just edit an HWP file without manually clicking through Hancom, hwpkit is the missing piece.

Under the hood, HWP 5.0 is a Microsoft Compound File Binary (MS-CFB) container holding a DocInfo stream and one or more Section streams (raw deflate). The standard olefile library can only rewrite a stream if it stays the same byte length, which is rarely true when you're inserting Korean text. hwpkit rewrites the whole CFB container while preserving the directory tree topology Hancom validates on open.

Scope: targets HWP 5.0 (the binary .hwp format Hancom Office has shipped since 2010). The newer XML-based .hwpx format is not covered — for .hwpx you can edit the inner OWPML XML directly with any zip

XML library.

Install

Python 3.9 or newer. Not yet on PyPI; install from source:

pip install git+https://github.com/psychofict/hwpkit

Quickstart

from hwpkit import fill_hwp, inject_text, swap_in_para_text, replace_text

def edit(records):
    inject_text(records, 24, "홍길동")                      # fill empty cell
    swap_in_para_text(records, 40, "□ 석사", "☑ 석사")      # tick checkbox
    replace_text(records, 75, "2026. 05. 19.")            # rewrite a cell

fill_hwp("template.hwp", "out.hwp", edit)

Finding paragraph indices

hwpkit-inspect template.hwp

Prints one line per record with a text preview, so you can identify which paragraph index is which form cell.

Extracting plain text

hwpkit-text file.hwp

Walks every section, strips inline controls (tables, images, footnote refs, etc.) and prints just the literal character content. From Python:

from hwpkit import extract_text_from_hwp
print(extract_text_from_hwp("file.hwp"))

For semantic HWP → XML (OWPML) conversion, use pyhwp — that's a much bigger job.

For LLM / RAG pipelines

Korean enterprises ship contracts, policies, regulations, government notices, internal memos, and academic papers as .hwp. If your retrieval / RAG pipeline can't read HWP, it can't index Korean enterprise data. The standard text-extraction stack (pdfplumber, python-docx, unstructured) doesn't cover HWP — they all need a preprocessing step.

hwpkit is that step. The library has no LLM dependencies; it's just a clean Korean-text source you can plug into anything:

# Index every HWP in a directory tree as documents for a vector DB
import glob
from hwpkit import extract_text_from_hwp

for path in glob.glob("corpus/**/*.hwp", recursive=True):
    text = extract_text_from_hwp(path)
    vector_db.add(doc_id=path, content=text)

# One-shot: pipe an HWP into any LLM CLI
hwpkit-text contract.hwp | llm "Summarize the key obligations in Korean"

# Bulk: convert a folder of HWPs to .txt for downstream tooling
for f in *.hwp; do hwpkit-text "$f" > "${f%.hwp}.txt"; done

The extractor walks every Section* stream, decodes UTF-16LE, and strips inline controls (tables, images, footnote refs, autonumbers, page-number ctrls, bookmarks) so what you get is clean text — usable directly as input to chunkers, embeddings, or any LLM context.

Edit operations

Function	When to use	What it does
`inject_text(records, i, text)`	The paragraph is empty (cell on a blank template)	Adds a PARA_TEXT record, updates the char count, and dummies the cached layout
`swap_in_para_text(records, i, old, new)`	Same-length substring swap (checkboxes □ → ☑, single-char rewrites)	Pure byte replace; keeps the cached layout intact
`replace_text(records, i, text)`	Paragraph has existing text you want to overwrite entirely	Rewrites PARA_TEXT, updates char count, dummies layout if length changed
`charshape.flatten_to_face(rec, face_id)`	Mixed-script paragraph (Korean + English) won't pick up font changes	Sets all 7 per-script CharShape slots to the same face — see GOTCHAS §3

What's tricky about HWP

See docs/GOTCHAS.md. The short version:

PARA_LINE_SEG cache — when a paragraph grows, the cached layout record must be replaced with 36 zero bytes. Anything else (keep, delete, fake multi-segment) either trips Hancom's corruption check or makes text render on a single smashed line.
CharShape has seven font slots — Hangul / Latin / Hanja / Japanese / Symbol / User / Other. Hancom's font dropdown typically only changes the Hangul slot, so mixed-script paragraphs need explicit per-slot control via hwpkit.charshape.
replace_text("") corrupts the file — wiping a paragraph to empty produces a (chars=1, PARA_TEXT="\r") state that opens fine alone but fails Hancom's checks when combined with other edits. Use a space or em-dash placeholder.
Naive CFB writers fail RB-tree validation — Hancom validates the red-black-tree directory invariants on open. hwpkit.cfb reads the original tree pointers byte-for-byte and reuses them.

Comparison

	`pyhwp`	`olefile`	`hwpkit`
Extract plain text	✅	❌	✅
Convert HWP → XML / OWPML (semantic)	✅	❌	❌
Read raw streams	✅	✅	✅
Rewrite same-size stream	❌	✅	✅
Rewrite stream that grew/shrank	❌	❌	✅
Hancom accepts the output	n/a	only if same-size	✅

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

psychofict

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

May 19, 2026

0.1.3

May 19, 2026

0.1.2

May 19, 2026

0.1.1

May 19, 2026

This version

0.1.0

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hwpkit-0.1.0.tar.gz (22.2 kB view details)

Uploaded May 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hwpkit-0.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded May 19, 2026 Python 3

File details

Details for the file hwpkit-0.1.0.tar.gz.

File metadata

Download URL: hwpkit-0.1.0.tar.gz
Upload date: May 19, 2026
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hwpkit-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6598388b08732cb2bdc2baa40cd73c8eca56a922516a12ce4a1ad8a419ad2c58`
MD5	`4b90c104f5d33c41512f8f8fd90891ca`
BLAKE2b-256	`da7ec90f91e3327b8ea8a6035b9040b0695399eced1edbda5a1d3af1ea4b0145`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hwpkit-0.1.0.tar.gz:

Publisher: publish.yml on psychofict/hwpkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hwpkit-0.1.0.tar.gz
- Subject digest: 6598388b08732cb2bdc2baa40cd73c8eca56a922516a12ce4a1ad8a419ad2c58
- Sigstore transparency entry: 1571990264
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: psychofict/hwpkit@aaa6841a80438eb0e389f3a5aee96a3cdde95ec8
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/psychofict
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@aaa6841a80438eb0e389f3a5aee96a3cdde95ec8
- Trigger Event: release

File details

Details for the file hwpkit-0.1.0-py3-none-any.whl.

File metadata

Download URL: hwpkit-0.1.0-py3-none-any.whl
Upload date: May 19, 2026
Size: 18.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hwpkit-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c68a6069d66228273a4c12991cfe980979a44a7b064d4c9b4062257787b66fb`
MD5	`26d8918f86576f8bffb4324f88d12661`
BLAKE2b-256	`174f1587ba25eb1e1615f158128c83ef6b867f9f9ece034bd136a317c0f2c5cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hwpkit-0.1.0-py3-none-any.whl:

Publisher: publish.yml on psychofict/hwpkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hwpkit-0.1.0-py3-none-any.whl
- Subject digest: 4c68a6069d66228273a4c12991cfe980979a44a7b064d4c9b4062257787b66fb
- Sigstore transparency entry: 1571990289
- Sigstore integration time: May 19, 2026
Source repository:
- Permalink: psychofict/hwpkit@aaa6841a80438eb0e389f3a5aee96a3cdde95ec8
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/psychofict
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@aaa6841a80438eb0e389f3a5aee96a3cdde95ec8
- Trigger Event: release

hwpkit 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

hwpkit

Install

Quickstart

Finding paragraph indices

Extracting plain text

For LLM / RAG pipelines

Edit operations

What's tricky about HWP

Comparison

See also

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance