PyO3 Python bindings for rhwp — parser and renderer for HWP/HWPX documents (Korean word processor format)
Project description
rhwp-python
한국어 | English
⚠️ 비공식 커뮤니티 패키지입니다. 본 프로젝트는 edwardkim/rhwp 의 공식 배포가 아니며, rhwp 메인테이너가 직접 PyPI 에 올릴 경우를 대비해 이름을
rhwp-python으로 양보해 둔 상태입니다. rhwp 코어 버그는 업스트림 에 보고해 주세요.
rhwp — Rust 기반 HWP/HWPX(한컴오피스 문서) 파서·렌더러 — 의 PyO3 Python 바인딩.
- PyPI 패키지명:
rhwp-python - Python import:
import rhwp - Rust 코어:
external/rhwp에 git submodule 로 고정
왜 rhwp-python 인가
- HWP + HWPX 동시 지원 — 대표 대안인
pyhwp는 HWP5 만 지원하고 2016년 이후 유지보수 중단 상태. rhwp 는 두 포맷을 같은 API 로 처리. - 텍스트 추출 62배 빠름 — HWP5 기준
pyhwp대비 96 ms vs 5,980 ms (sandbox 벤치). - LangChain 즉시 연동 —
rhwp.integrations.langchain.HwpLoader를 extras 로 제공, RAG 파이프라인에 바로 플러그인 가능. - 타입 완비 —
py.typed+.pyi스텁, pyright clean.
요구 사항
- Python 3.9+ (abi3-py39 wheel 하나로 3.9 ~ 3.13+ 커버)
- 코어 API 는 런타임 Python 의존성 없음
rhwp-python[langchain]extras 는langchain-core>=0.2하나만 추가 설치
설치
pip install rhwp-python
# 또는
uv add rhwp-python
사용법
import rhwp
# HWP / HWPX 파싱 — 파일 I/O + 파싱 단계에서 GIL 해제
doc = rhwp.parse("report.hwp")
print(doc.section_count, doc.paragraph_count, doc.page_count)
# 텍스트
full_text: str = doc.extract_text() # 빈 문단 제외, "\n" 으로 join
paragraphs: list[str] = doc.paragraphs() # 빈 문단 포함 원본 리스트
# SVG 렌더링 — 단일 페이지 또는 전체
svg_page0: str = doc.render_svg(page=0)
all_svgs: list[str] = doc.render_all_svg()
written: list[str] = doc.export_svg("output/", prefix="page")
# → page_001.svg, page_002.svg, ... (단일 페이지면 page.svg)
# PDF 렌더링 — list[int] 가 아니라 Python `bytes` 반환
pdf: bytes = doc.render_pdf()
byte_size: int = doc.export_pdf("output.pdf")
rhwp.Document(path) 는 rhwp.parse(path) 와 동일하게 동작.
LangChain 통합
pip install "rhwp-python[langchain]"
from rhwp.integrations.langchain import HwpLoader
# 문서 전체를 단일 Document 로 (기본 — single 모드)
docs = HwpLoader("report.hwp").load()
# 빈 문단 제외, 문단 1개당 Document 1개 (RAG 청킹용 — paragraph 모드)
docs = HwpLoader("report.hwp", mode="paragraph").load()
# lazy_load: Document 를 on-the-fly 로 yield (paragraph 모드에서 O(1) peak memory)
for d in HwpLoader("report.hwp", mode="paragraph").lazy_load():
index_into_vector_store(d) # 사용자 파이프라인
# 표준 LangChain 텍스트 스플리터에 바로 연결
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)
모든 Document 메타데이터: source, section_count, paragraph_count,
page_count, rhwp_version. paragraph 모드에서는 paragraph_index 추가.
성능
Apple M2 (8 코어) release 빌드. Parse = 파일 읽기 + 전체 파싱 + Document 생성.
워크로드: 9 개 파일 (aift.hwp 5.5 MB + table-vpos-01.hwpx 359 KB + tac-img-02.hwpx 3.96 MB, ×3).
| 워커 수 | Parse 시간 | 순차 대비 가속 |
|---|---|---|
| 1 | 268 ms | 1.00× (기준) |
| 2 | 141 ms | 1.91× |
| 4 | 97 ms | 2.76× |
| 8 | 67 ms | 4.01× |
parse() 와 PDF 변환 단계는 py.detach 로 GIL 을 해제하므로 ThreadPoolExecutor 가
코어 수에 비례해 스케일. PDF 렌더링 자체는 usvg + pdf-writer 내부에서 CPU/allocator
바운드라 2 ~ 3 워커에서 약 1.1× 정도만 향상됨 — 재현은 benches/bench_gil.py 참고.
알려진 제약 (Phase 1)
Document객체는#[pyclass(unsendable)]— 단일 스레드 접근만 허용. 교차 스레드 접근 시RuntimeError. 멀티스레드에선benches/bench_gil.py패턴 사용 — 워커 내에서parse + consume까지 완결한 뒤 원시 타입(int,str,bytes) 만 반환.- 폰트 임베딩 / 디버그 오버레이 / 페이지 메타데이터 API 없음 (Phase 2+).
- HWP/HWPX 저장(serialization) 미지원 — 읽기/렌더링 전용.
- 표 / 이미지 / 수식 구조화 접근 없음 — 텍스트 추출만 지원.
- PDF 렌더 경로가 rhwp 코어의
[DEBUG_TAB_POS]/LAYOUT_OVERFLOW로그를 stdout 으로 출력. 필요 시grep -v -E "(DEBUG_TAB_POS|LAYOUT_OVERFLOW)"로 필터링.
개발
이 리포는 rhwp Rust 코어를 external/rhwp git submodule 로 소비합니다.
git clone --recurse-submodules https://github.com/DanMeon/rhwp-python
cd rhwp-python
# dev + testing + linting 툴 일괄 설치
uv sync --no-install-project --group all
uv run maturin develop --release
# 테스트 (core + LangChain, slow PDF 제외)
uv run pytest tests/ -m "not slow"
# PDF 렌더링 테스트
uv run pytest tests/ -m slow
# 타입 체크
uv run pyright python/ tests/
# GIL 해제 벤치마크
uv run python benches/bench_gil.py 2>&1 | grep -v -E "(DEBUG_TAB_POS|LAYOUT_OVERFLOW)"
clone 시 --recurse-submodules 를 빠뜨렸다면:
git submodule update --init --recursive
테스트 fixture 는 submodule 내부 external/rhwp/samples/ 에 있으며,
tests/conftest.py 가 이 경로를 참조합니다.
버전 관리
이 Python 패키지와 rhwp Rust 코어는 독립적으로 버저닝됩니다.
rhwp.version() 은 이 패키지 버전을, rhwp.rhwp_core_version() 은
고정된 submodule 에 포함된 Rust 코어의 버전을 반환합니다.
라이선스
MIT. 저작권자: Edward Kim (rhwp Rust 코어) + DanMeon (rhwp-python 바인딩). 자세한 내용은 LICENSE.
프로젝트 홈
- 바인딩 소스 / 이슈: https://github.com/DanMeon/rhwp-python
- rhwp Rust 코어: https://github.com/edwardkim/rhwp
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rhwp_python-0.1.1.tar.gz.
File metadata
- Download URL: rhwp_python-0.1.1.tar.gz
- Upload date:
- Size: 52.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10956847e22f4e003d09213691ab7edddfc795051a2215e4c9d7d86067d68c64
|
|
| MD5 |
8c50d72e1a93a1115a4bab9c5152ea93
|
|
| BLAKE2b-256 |
b8dc7d3730ea3fc7f2264b9eb28d6696fa8fad17a4d9518d45d6c2b56e23b47f
|
Provenance
The following attestation bundles were made for rhwp_python-0.1.1.tar.gz:
Publisher:
publish.yml on DanMeon/rhwp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rhwp_python-0.1.1.tar.gz -
Subject digest:
10956847e22f4e003d09213691ab7edddfc795051a2215e4c9d7d86067d68c64 - Sigstore transparency entry: 1362326586
- Sigstore integration time:
-
Permalink:
DanMeon/rhwp-python@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DanMeon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Trigger Event:
release
-
Statement type:
File details
Details for the file rhwp_python-0.1.1-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: rhwp_python-0.1.1-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7f92b2d085e14994bfea6f172e82261178f0fe85d12476f097d398db51d4e51a
|
|
| MD5 |
c0cd796836b6dbc00b5c4bd3e3697765
|
|
| BLAKE2b-256 |
58ddef44914abdbb5b0b5b766c7ff5d577633e5221bd9770a98bcca632648e98
|
Provenance
The following attestation bundles were made for rhwp_python-0.1.1-cp39-abi3-win_amd64.whl:
Publisher:
publish.yml on DanMeon/rhwp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rhwp_python-0.1.1-cp39-abi3-win_amd64.whl -
Subject digest:
7f92b2d085e14994bfea6f172e82261178f0fe85d12476f097d398db51d4e51a - Sigstore transparency entry: 1362326674
- Sigstore integration time:
-
Permalink:
DanMeon/rhwp-python@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DanMeon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Trigger Event:
release
-
Statement type:
File details
Details for the file rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2c3ee905f58dfb138bd0b4b714cf35660e64d67b079bd9b1ab2b54a5ea79bf0
|
|
| MD5 |
a2241eae670770dcf36a98df2fc2a28a
|
|
| BLAKE2b-256 |
835bfd183a6ffdf1d35ffc6eb7b4e7ca983a51f96444b7854f2838878183b6c4
|
Provenance
The following attestation bundles were made for rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yml on DanMeon/rhwp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
a2c3ee905f58dfb138bd0b4b714cf35660e64d67b079bd9b1ab2b54a5ea79bf0 - Sigstore transparency entry: 1362326987
- Sigstore integration time:
-
Permalink:
DanMeon/rhwp-python@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DanMeon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Trigger Event:
release
-
Statement type:
File details
Details for the file rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f906a2b36e573c6575aec53e92e68570f05141d5faa289d5e29a5e122d8feb30
|
|
| MD5 |
ca7e475eb535a6a2f261f1face301fa3
|
|
| BLAKE2b-256 |
233dd9f665d2ce7d1bd716ac9193118066dfc19c28dc77138538590c63ceda1f
|
Provenance
The following attestation bundles were made for rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
publish.yml on DanMeon/rhwp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rhwp_python-0.1.1-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
f906a2b36e573c6575aec53e92e68570f05141d5faa289d5e29a5e122d8feb30 - Sigstore transparency entry: 1362326784
- Sigstore integration time:
-
Permalink:
DanMeon/rhwp-python@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DanMeon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Trigger Event:
release
-
Statement type:
File details
Details for the file rhwp_python-0.1.1-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: rhwp_python-0.1.1-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6234a82e8341edb08df60c6b9417c8c705ab4b824847afdfbaac3999bef9ba38
|
|
| MD5 |
c5d62da6fdcff1e44e8e77ffafb85c58
|
|
| BLAKE2b-256 |
6e0a2c5dc879039812195132c9471f9d8c7c98d2f31efa392bb74c3312a233f3
|
Provenance
The following attestation bundles were made for rhwp_python-0.1.1-cp39-abi3-macosx_11_0_arm64.whl:
Publisher:
publish.yml on DanMeon/rhwp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rhwp_python-0.1.1-cp39-abi3-macosx_11_0_arm64.whl -
Subject digest:
6234a82e8341edb08df60c6b9417c8c705ab4b824847afdfbaac3999bef9ba38 - Sigstore transparency entry: 1362327061
- Sigstore integration time:
-
Permalink:
DanMeon/rhwp-python@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DanMeon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Trigger Event:
release
-
Statement type:
File details
Details for the file rhwp_python-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: rhwp_python-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0200b7537d3799d4eceffb3ebf5287d0aae4fbfe5aeeaba37f8e5eb1a9ebae83
|
|
| MD5 |
2349733a5138ad7130a6342b22436692
|
|
| BLAKE2b-256 |
a3ae6966a5ed5931cc5f9d42d3c5afc4f151ea4da0d4e7bd173a4c0b59025027
|
Provenance
The following attestation bundles were made for rhwp_python-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl:
Publisher:
publish.yml on DanMeon/rhwp-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rhwp_python-0.1.1-cp39-abi3-macosx_10_12_x86_64.whl -
Subject digest:
0200b7537d3799d4eceffb3ebf5287d0aae4fbfe5aeeaba37f8e5eb1a9ebae83 - Sigstore transparency entry: 1362326897
- Sigstore integration time:
-
Permalink:
DanMeon/rhwp-python@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/DanMeon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f489115ef2662dfd42d1a6f99ce69e48b04b202f -
Trigger Event:
release
-
Statement type: