Skip to main content

Read Korean HWP/HWPX documents in Python; edit paragraphs and table cells in HWPX; natural-language edits via Claude. AI-friendly API.

Project description

master-of-hwp

PyPI version Python License: MIT Tests

Read Korean HWP/HWPX documents in Python, edit paragraphs in HWPX, and expose structure to AI workflows.

master-of-hwp is a Python-first library for opening real .hwp and .hwpx files, inspecting sections / paragraphs / tables, querying content, and performing immutable paragraph edits. The API is designed to be LLM-friendly: results are plain Python data structures, every mutation returns a new document, and a round-trip fidelity harness validates that edits preserve document structure.

Why this exists

Korean government, education, and enterprise workflows rely on HWP documents. Most AI tooling can't touch them directly — they get round-tripped through DOCX, shredding tables and formatting. master-of-hwp reads the real format, exposes the structure AI needs, and keeps edits byte-level honest.

30-Second Quickstart

pip install master-of-hwp
from master_of_hwp import HwpDocument

doc = HwpDocument.open("report.hwpx")

# Inspect
print(f"{doc.sections_count} sections, {len(list(doc.iter_paragraphs()))} paragraphs")
print(doc.summary())

# Query
for section, paragraph, text in doc.find_paragraphs("보도자료"):
    print(f{section}.{paragraph}: {text}")

# Edit (HWPX) — immutable: returns a new document
edited = doc.replace_paragraph(0, 0, "New intro text")
edited.path.with_suffix(".edited.hwpx").write_bytes(edited.raw_bytes)

API at a Glance

API Purpose
HwpDocument.open(path) Open .hwp / .hwpx as an immutable document
.sections_count Number of sections
.byte_size Size of raw bytes
.section_texts Plain text per section
.section_paragraphs Paragraphs per section (nested list)
.section_tables Tables: [section][table][row][cell][paragraph]
.plain_text All sections concatenated, format-agnostic normalization
.iter_paragraphs() Yield (section, paragraph, text) tuples
.find_paragraphs(query, regex=, case_sensitive=) Substring or regex search
.summary() Compact JSON-serializable overview for LLM context
.replace_paragraph(s, p, text) Return a new document with one paragraph replaced
.replace_table_cell_paragraph(s, t, r, c, p, text) Edit a paragraph inside a table cell (HWPX)
.ai_edit(request, provider=, dry_run=) Natural-language edit pipeline (intent → locate → apply → verify)

Supported Formats

Capability HWP 5.0 (.hwp) HWPX (.hwpx)
Open document
Count sections
Extract section text
Enumerate paragraphs
Enumerate tables Best effort*
Replace paragraph Same-length only**
Replace table cell paragraph ❌ (v0.3)
Insert / delete ❌ (v0.3) ❌ (v0.3)

* Minimal heuristic anchored on the TABLE(0x5B) record; exact row/cell recovery is pending a richer record-level parser. ** Different-length HWP 5.0 edits require a CFBF stream resize writer, scheduled for v0.3.

Natural-Language Editing

pip install master-of-hwp[ai]  # adds anthropic SDK
export ANTHROPIC_API_KEY=sk-ant-...
from master_of_hwp import HwpDocument
from master_of_hwp.ai.providers import AnthropicProvider

doc = HwpDocument.open("가정통신문.hwpx")
result = doc.ai_edit(
    "첫 번째 문단의 '급식비'를 '수업료'로 바꿔줘",
    provider=AnthropicProvider(),
)
if result.status == "applied":
    result.new_doc.path.with_suffix(".edited.hwpx").write_bytes(result.new_doc.raw_bytes)
else:
    print(result.message)  # refused / failed explanation

Without an API key, a rule-based fallback parser handles simple patterns (바꿔, 변경, keyword matches). See master_of_hwp.ai.providers for the LLMProvider Protocol — plug in OpenAI, local Ollama, etc.

Studio (Non-developer GUI)

For teachers / office workers who want a one-click experience:

pip install master-of-hwp-studio
mohwp studio                    # launches web GUI + MCP server
mohwp mcp-config                # prints Claude Desktop config snippet

See studio/README.md.

Fidelity Harness

from master_of_hwp.fidelity.harness import verify_replace_roundtrip
from master_of_hwp.core.document import SourceFormat

report = verify_replace_roundtrip(
    raw_bytes, SourceFormat.HWPX, section_index=0, paragraph_index=5, new_text="New content"
)
assert report.structural_equal
assert report.edited_paragraph_applied

Examples

python examples/01_read_sections.py  samples/public-official/table-vpos-01.hwpx
python examples/02_extract_tables.py samples/public-official/table-vpos-01.hwpx
python examples/03_edit_paragraph.py samples/public-official/table-vpos-01.hwpx outputs/edited.hwpx

Roadmap

  • v0.1 ✅ — Read path, HWPX paragraph replacement, fidelity harness, AI scaffold
  • v0.2 — HWP 5.0 resize writer, paragraph insert/delete, table cell edit
  • v0.3 — Full agentic edit loop (intent → locate → operate → verify → rollback)
  • v1.0 — API compatibility contract starts

Details: docs/ROADMAP.md, docs/ARCHITECTURE.md.

Philosophy

  • Platform-first — infrastructure, not a template app.
  • Round-trip fidelity is the contract — opening and saving must not corrupt structure; proved by a benchmark, not a hope.
  • Agentic document intelligence — documents should understand themselves.
  • Solo OSS · no commercial pressure · quality first — take the time it needs.

Contributing

Contributions welcome. See CONTRIBUTING.md for development setup, test expectations, and scope.

License

MIT — see LICENSE.

한국어 개요

프로젝트의 한국어 소개는 README.ko.md 를 참고하세요.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

master_of_hwp-0.2.0.tar.gz (8.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

master_of_hwp-0.2.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file master_of_hwp-0.2.0.tar.gz.

File metadata

  • Download URL: master_of_hwp-0.2.0.tar.gz
  • Upload date:
  • Size: 8.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for master_of_hwp-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a3de5c82ad60cf32875ef94cdb76905bcfd068118c4a97ada978c015e9174e3c
MD5 965563f1962065a32441cb6033021ce8
BLAKE2b-256 850978c38bc17d742acf97984c4af7e42a42ae2a1aa28e6465b1fba51f7290d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for master_of_hwp-0.2.0.tar.gz:

Publisher: release.yml on reallygood83/master-of-hwp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file master_of_hwp-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: master_of_hwp-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for master_of_hwp-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e42f459be581fa33da9c6bc4fa9e17b4ca0b4a2b6f475590d438a2402365582
MD5 c85c7a2db09f88abd77e1795a6d9d720
BLAKE2b-256 6c49aed2ec6da1a3cac7fb1d86d98c34f324f8fe929d0a18769f880e19bfe5fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for master_of_hwp-0.2.0-py3-none-any.whl:

Publisher: release.yml on reallygood83/master-of-hwp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page