Build LLM-friendly Markdown artifact bundles from HWPX and HWP documents.
Project description
hwpx2md
Convert Korean .hwpx or .hwp documents into an LLM-readable Markdown artifact bundle.
Use this when a model or script needs the document body, memos/comments, tables, and extracted assets in plain files instead of a binary office document.
Quick start
Run directly from PyPI with uvx:
uvx hwpx2md "input.hwpx" -o "input_bundle" --overwrite
uvx hwpx2md "input.hwp" -o "input_bundle" --overwrite
If -o is omitted, hwpx2md writes the bundle next to the source file as <input_stem>_llm_bundle.
Use quotes around Windows paths, Korean filenames, and paths containing spaces.
What you get
After conversion, read the files in this order:
document.md: main body text. Memo, table, and asset references are linked inline.memos.md: all memo/comment text. Each memo includes the document anchor, paragraph text, and nearby context.chunks.jsonl: retrieval-friendly chunks. Each JSON line contains text plus related memo, table, and asset IDs.manifest.json: machine-readable inventory with backend, counts, warnings, memo metadata, table metadata, and asset metadata.tables/: one set of artifacts per table: full Markdown, CSV, compact Markdown, and cell JSON.assets/: copied embedded or preview assets when the backend exposes them.
The CLI prints a JSON summary with the output directory, backend, counts, files to read first, and warning codes:
{
"output": "C:\\work\\input_bundle",
"format": "hwpx",
"backend": "native-hwpx",
"counts": {
"sections": 1,
"paragraphs": 96,
"tables": 5,
"memos": 2,
"assets": 1,
"asset_refs": 0,
"chunks": 2,
"warnings": 1
},
"read_first": [
"C:\\work\\input_bundle\\document.md",
"C:\\work\\input_bundle\\memos.md",
"C:\\work\\input_bundle\\chunks.jsonl",
"C:\\work\\input_bundle\\manifest.json"
],
"warnings": ["nested_tables_folded"]
}
Backend fidelity
.hwpx uses the native XML backend. This is the highest-fidelity path. It preserves memo IDs, authors, timestamps when present, document anchors, table cell spans, and exposed assets.
.hwp uses hwp-hwpx-parser. This path does not require Hancom Office or Windows COM automation. It preserves body text, tables, memo text, and [MEMO:N] anchors, but the backend may not expose author, timestamp, exact memo IDs, image references, or table span metadata. Check manifest.json warnings for the exact limitations seen in a converted file.
For the best memo and table metadata, prefer .hwpx when you can. Use .hwp when the original binary file is all you have and text/memo extraction is more important than perfect layout metadata.
Commands
Convert HWPX:
uvx hwpx2md "proposal.hwpx" -o "proposal_bundle" --overwrite
Convert HWP:
uvx hwpx2md "proposal.hwp" -o "proposal_bundle" --overwrite
Show the self-contained CLI guide:
uvx hwpx2md --help
Use the package from a local checkout:
cd hwpx2md
uv run hwpx2md "..\email\sample.hwpx" -o "..\.backup\sample_bundle" --overwrite
Python API
from pathlib import Path
from hwpx2md import bundle_document
manifest = bundle_document(
Path("input.hwpx"),
output_dir=Path("input_bundle"),
overwrite=True,
)
print(manifest["counts"])
Use bundle_hwpx(...) if you want to accept only .hwpx, or bundle_hwp(...) if you want to accept only .hwp.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hwpx2md-0.1.0.tar.gz.
File metadata
- Download URL: hwpx2md-0.1.0.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c1bc2357c1a735934de02494b2a7607ea2a34cb73c2a5bd0141b89ab66eb328
|
|
| MD5 |
5d1e86992341442beef958ced3260b06
|
|
| BLAKE2b-256 |
01eec85932a6ad073cb0082924c2d82d6a50a0666efcf84cf6d5e342a309795d
|
File details
Details for the file hwpx2md-0.1.0-py3-none-any.whl.
File metadata
- Download URL: hwpx2md-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33f4d55fc0e0b7a8531a2d8ab992a9796fb8575583be4b77c78120ab4a7bd566
|
|
| MD5 |
3eb1ab676fc58ee60b132f8bbab0ab0c
|
|
| BLAKE2b-256 |
6a9aee61ad1ab5db8e0b45ee6deb542de6e4588df7208e73697f247b57c4ba4f
|