Skip to main content

Build LLM-friendly Markdown artifact bundles from HWPX and HWP documents.

Project description

hwpx2md

Convert Korean .hwpx or .hwp documents into an LLM-readable Markdown artifact bundle.

Use this when a model or script needs the document body, memos/comments, tables, and extracted assets in plain files instead of a binary office document.

Quick start

Run directly from PyPI with uvx:

uvx hwpx2md "input.hwpx" -o "input_bundle" --overwrite
uvx hwpx2md "input.hwp" -o "input_bundle" --overwrite

If -o is omitted, hwpx2md writes the bundle next to the source file as <input_stem>_llm_bundle.

Use quotes around Windows paths, Korean filenames, and paths containing spaces.

What you get

After conversion, read the files in this order:

  1. document.md: main body text. Memo, table, and asset references are linked inline.
  2. memos.md: all memo/comment text. Each memo includes the document anchor, paragraph text, and nearby context.
  3. chunks.jsonl: retrieval-friendly chunks. Each JSON line contains text plus related memo, table, and asset IDs.
  4. manifest.json: machine-readable inventory with backend, counts, warnings, memo metadata, table metadata, and asset metadata.
  5. tables/: one set of artifacts per table: full Markdown, CSV, compact Markdown, and cell JSON.
  6. assets/: copied embedded or preview assets when the backend exposes them.

The CLI prints a JSON summary with the output directory, backend, counts, files to read first, and warning codes:

{
  "output": "C:\\work\\input_bundle",
  "format": "hwpx",
  "backend": "native-hwpx",
  "counts": {
    "sections": 1,
    "paragraphs": 96,
    "tables": 5,
    "memos": 2,
    "assets": 1,
    "asset_refs": 0,
    "chunks": 2,
    "warnings": 1
  },
  "read_first": [
    "C:\\work\\input_bundle\\document.md",
    "C:\\work\\input_bundle\\memos.md",
    "C:\\work\\input_bundle\\chunks.jsonl",
    "C:\\work\\input_bundle\\manifest.json"
  ],
  "warnings": ["nested_tables_folded"]
}

Backend fidelity

.hwpx uses the native XML backend. This is the highest-fidelity path. It preserves memo IDs, authors, timestamps when present, document anchors, table cell spans, and exposed assets.

.hwp uses hwp-hwpx-parser. This path does not require Hancom Office or Windows COM automation. It preserves body text, tables, memo text, and [MEMO:N] anchors, but the backend may not expose author, timestamp, exact memo IDs, image references, or table span metadata. Check manifest.json warnings for the exact limitations seen in a converted file.

For the best memo and table metadata, prefer .hwpx when you can. Use .hwp when the original binary file is all you have and text/memo extraction is more important than perfect layout metadata.

Commands

Convert HWPX:

uvx hwpx2md "proposal.hwpx" -o "proposal_bundle" --overwrite

Convert HWP:

uvx hwpx2md "proposal.hwp" -o "proposal_bundle" --overwrite

Show the self-contained CLI guide:

uvx hwpx2md --help

Use the package from a local checkout:

cd hwpx2md
uv run hwpx2md "..\email\sample.hwpx" -o "..\.backup\sample_bundle" --overwrite

Python API

from pathlib import Path
from hwpx2md import bundle_document

manifest = bundle_document(
    Path("input.hwpx"),
    output_dir=Path("input_bundle"),
    overwrite=True,
)

print(manifest["counts"])

Use bundle_hwpx(...) if you want to accept only .hwpx, or bundle_hwp(...) if you want to accept only .hwp.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hwpx2md-0.1.0.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hwpx2md-0.1.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file hwpx2md-0.1.0.tar.gz.

File metadata

  • Download URL: hwpx2md-0.1.0.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hwpx2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1c1bc2357c1a735934de02494b2a7607ea2a34cb73c2a5bd0141b89ab66eb328
MD5 5d1e86992341442beef958ced3260b06
BLAKE2b-256 01eec85932a6ad073cb0082924c2d82d6a50a0666efcf84cf6d5e342a309795d

See more details on using hashes here.

File details

Details for the file hwpx2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: hwpx2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hwpx2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 33f4d55fc0e0b7a8531a2d8ab992a9796fb8575583be4b77c78120ab4a7bd566
MD5 3eb1ab676fc58ee60b132f8bbab0ab0c
BLAKE2b-256 6a9aee61ad1ab5db8e0b45ee6deb542de6e4588df7208e73697f247b57c4ba4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page