Skip to main content

PyO3 Python bindings for rhwp — parser and renderer for HWP/HWPX documents (Korean word processor format)

Project description

rhwp-python

한국어 | English

PyPI Python CI License: MIT

⚠️ 비공식 커뮤니티 패키지입니다. 본 프로젝트는 edwardkim/rhwp공식 배포가 아니며, rhwp 메인테이너가 직접 PyPI 에 올릴 경우를 대비해 이름을 rhwp-python 으로 양보해 둔 상태입니다. rhwp 코어 버그는 업스트림 에 보고해 주세요.

rhwp — Rust 기반 HWP/HWPX(한컴오피스 문서) 파서·렌더러 — 의 PyO3 Python 바인딩.

  • PyPI 패키지명: rhwp-python
  • Python import: import rhwp
  • Rust 코어: edwardkim/rhwp

왜 rhwp-python 인가

  • HWP + HWPX 동시 지원 — 대표 대안인 pyhwp 는 HWP5 만 지원하고 2016년 이후 유지보수 중단 상태. rhwp 는 두 포맷을 같은 API 로 처리.
  • 텍스트 추출 62배 빠름 — HWP5 기준 pyhwp 대비 96 ms vs 5,980 ms (sandbox 벤치).
  • LangChain 즉시 연동rhwp.integrations.langchain.HwpLoader 를 extras 로 제공, RAG 파이프라인에 바로 플러그인 가능.
  • 타입 완비py.typed + .pyi 스텁, pyright clean.

요구 사항

  • Python 3.10+ (abi3-py310 wheel 하나로 3.10 ~ 3.13+ 커버)
  • 코어 API 는 런타임 Python 의존성 없음
  • rhwp-python[langchain] extras 는 langchain-core>=0.2 하나만 추가 설치

설치

pip install rhwp-python
# 또는
uv add rhwp-python

사용법

import rhwp

# HWP / HWPX 파싱 — 파일 I/O + 파싱 단계에서 GIL 해제
doc = rhwp.parse("report.hwp")
print(doc.section_count, doc.paragraph_count, doc.page_count)

# 텍스트
full_text: str = doc.extract_text()          # 빈 문단 제외, "\n" 으로 join
paragraphs: list[str] = doc.paragraphs()      # 빈 문단 포함 원본 리스트

# SVG 렌더링 — 단일 페이지 또는 전체
svg_page0: str = doc.render_svg(page=0)
all_svgs: list[str] = doc.render_all_svg()
written: list[str] = doc.export_svg("output/", prefix="page")
# → page_001.svg, page_002.svg, ... (단일 페이지면 page.svg)

# PDF 렌더링 — list[int] 가 아니라 Python `bytes` 반환
pdf: bytes = doc.render_pdf()
byte_size: int = doc.export_pdf("output.pdf")

rhwp.Document(path)rhwp.parse(path) 와 동일하게 동작.

LangChain 통합

pip install "rhwp-python[langchain]"
from rhwp.integrations.langchain import HwpLoader

# 문서 전체를 단일 Document 로 (기본 — single 모드)
docs = HwpLoader("report.hwp").load()

# 빈 문단 제외, 문단 1개당 Document 1개 (RAG 청킹용 — paragraph 모드)
docs = HwpLoader("report.hwp", mode="paragraph").load()

# lazy_load: Document 를 on-the-fly 로 yield (paragraph 모드에서 O(1) peak memory)
for d in HwpLoader("report.hwp", mode="paragraph").lazy_load():
    index_into_vector_store(d)   # 사용자 파이프라인

# 표준 LangChain 텍스트 스플리터에 바로 연결
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)

모든 Document 메타데이터: source, section_count, paragraph_count, page_count, rhwp_version. paragraph 모드에서는 paragraph_index 추가.

Document IR

RAG / LLM 파이프라인이 직접 소비하는 구조화 문서 모델. Pydantic V2 모델 + JSON Schema (Draft 2020-12) — HWP 의 구역 / 단락 / 표 / 그림 / 수식 / 각주 / 목록 / 캡션 / 목차 / 필드를 손실 없이 노출한다.

from rhwp.ir.nodes import ParagraphBlock, TableBlock

doc = rhwp.parse("report.hwp")
ir = doc.to_ir()                       # -> rhwp.ir.nodes.HwpDocument (Pydantic, frozen)
json_str = doc.to_ir_json(indent=2)    # JSON 직렬화

# 본문 블록을 순서대로 스트리밍 (표/문단 혼합, TableCell.blocks 까지 재귀)
for block in ir.iter_blocks(scope="body"):
    if isinstance(block, ParagraphBlock):
        print("P", block.prov.section_idx, block.prov.para_idx, block.text)
    elif isinstance(block, TableBlock):
        print("T", block.rows, "x", block.cols, "cells=", len(block.cells))

표 3중 표현cells (구조화 SQL/순회용) + html (HtmlRAG 호환 LLM 프롬프트) + text (평문 검색 폴백) 가 병기된다. 중첩 표는 TableCell.blocks 재귀로 자연 지원.

LangChain 통합 — 기존 loader 에 mode="ir-blocks" 추가:

from rhwp.integrations.langchain import HwpLoader

docs = HwpLoader("report.hwp", mode="ir-blocks").load()
# ^ 단락은 page_content=text, 표는 page_content=HTML. 메타에 kind / section_idx /
#   para_idx / (표의 경우) rows / cols / text / caption 포함

JSON Schemarhwp.ir.schema.export_schema() / load_schema(). 공개 $id: https://danmeon.github.io/rhwp-python/schema/hwp_ir/v1/schema.json (불변 경로).

rhwp-py CLI

pip install "rhwp-python[cli]"          # parse / version / schema / ir / blocks
pip install "rhwp-python[cli-chunks]"    # + chunks (langchain text splitter)

rhwp-py parse report.hwp
rhwp-py blocks report.hwp --kind table --format ndjson | jq '.rows'
rhwp-py chunks report.hwp --size 1000 --format ndjson

rhwp-py 는 구조 추출 (IR / 블록 / 청크 / 스키마) 전담 — 시각 출력 (SVG/PDF) / 메타데이터 덤프는 상류 rhwp Rust 바이너리. 자세한 사용은 rhwp-py --help 또는 cli.md 참조.

성능

Apple M2 (8 코어) release 빌드. Parse = 파일 읽기 + 전체 파싱 + Document 생성. 워크로드: 9 개 파일 (aift.hwp 5.5 MB + table-vpos-01.hwpx 359 KB + tac-img-02.hwpx 3.96 MB, ×3).

워커 수 Parse 시간 순차 대비 가속
1 268 ms 1.00× (기준)
2 141 ms 1.91×
4 97 ms 2.76×
8 67 ms 4.01×

parse() 와 PDF 변환 단계는 py.detach 로 GIL 을 해제하므로 ThreadPoolExecutor 가 코어 수에 비례해 스케일. PDF 렌더링 자체는 usvg + pdf-writer 내부에서 CPU/allocator 바운드라 2 ~ 3 워커에서 약 1.1× 정도만 향상됨 — 재현은 benches/bench_gil.py 참고.

알려진 제약 / 운영 노트

운영상 제약 (Document 의 단일 스레드 모델, async 진입점, PDF stdout 노이즈) 및 미구현 영역 요약은 KNOWN_ISSUES.md. 작업 중 / 계획 항목은 docs/roadmap/ 의 활성 spec 인덱스.

개발

소스에서 빌드·테스트·기여하는 절차는 CONTRIBUTING.md 참조.

버전 관리

이 Python 패키지와 rhwp Rust 코어는 독립적으로 버저닝됩니다. rhwp.version() 은 이 패키지 버전을, rhwp.rhwp_core_version() 은 번들된 Rust 코어 버전을 반환합니다.

라이선스

MIT. 저작권자: Edward Kim (rhwp Rust 코어) + DanMeon (rhwp-python 바인딩). 자세한 내용은 LICENSE.

프로젝트 홈

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rhwp_python-0.3.2.tar.gz (61.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rhwp_python-0.3.2-cp310-abi3-win_amd64.whl (3.6 MB view details)

Uploaded CPython 3.10+Windows x86-64

rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

rhwp_python-0.3.2-cp310-abi3-macosx_11_0_arm64.whl (3.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

rhwp_python-0.3.2-cp310-abi3-macosx_10_12_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file rhwp_python-0.3.2.tar.gz.

File metadata

  • Download URL: rhwp_python-0.3.2.tar.gz
  • Upload date:
  • Size: 61.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rhwp_python-0.3.2.tar.gz
Algorithm Hash digest
SHA256 d46f0c0b5a95a420b3480e3c7c944e9be8bfad09460f9fbf757eaf1475fdb0cf
MD5 f7d21df4dfd162960b97ca75ae51e519
BLAKE2b-256 33f3c23f89a6a739f0ccab3faad7437db0e4f23328b170758f09c8ce720d6e99

See more details on using hashes here.

Provenance

The following attestation bundles were made for rhwp_python-0.3.2.tar.gz:

Publisher: publish.yml on DanMeon/rhwp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rhwp_python-0.3.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: rhwp_python-0.3.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.6 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rhwp_python-0.3.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 64fe5e51cc3ca9720c53eceeed674f01a28f286b96e0d062b7bdec55c3335ca2
MD5 fc6fc4b1cd06ac650b79c5d12548f596
BLAKE2b-256 bac43e31b8241cf3939a813d98b0c409770dcc89b2a4eb6b63de90c48f5239f1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rhwp_python-0.3.2-cp310-abi3-win_amd64.whl:

Publisher: publish.yml on DanMeon/rhwp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a847394348ca3f58d08aba40c39530f1f72dfb51a579aa3c9d2a1297bd98a063
MD5 6da77eb50793bb2cb6f9089b8a04f2ef
BLAKE2b-256 1b4efc204911a1d082ec154ecc7bd93ac7125037bbc64ef6327adba85c04791a

See more details on using hashes here.

Provenance

The following attestation bundles were made for rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on DanMeon/rhwp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 24419b0a7bce004f4866714919bb637a6e61e1cad958f86a9e62484f5ebb65fc
MD5 23960d754a80bbf9c6563ff130474c1f
BLAKE2b-256 50b7ba0a0a275901ca790685c0308ae9a4a0ad32547be4af391e0aae171e80b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for rhwp_python-0.3.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: publish.yml on DanMeon/rhwp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rhwp_python-0.3.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rhwp_python-0.3.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 507c532573e1cd23b6cae7a8b16870e957a1bb7bc1b3b2b3dd5afb9381cc120a
MD5 6116d834543a1277f86b07666bb0d8d1
BLAKE2b-256 594626a968b5ff94b826e7503a092807f7e08d66bb803aee4138a1599068a574

See more details on using hashes here.

Provenance

The following attestation bundles were made for rhwp_python-0.3.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on DanMeon/rhwp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rhwp_python-0.3.2-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rhwp_python-0.3.2-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bfd2b90d3d3ad1d4041d76604eb6965ee56c6771f11da8f88c3ca422eea469dd
MD5 0c019781bbf096ebdcafb14ac515aaa1
BLAKE2b-256 999e9d9453d5300cce69384694bd29485940dc0bef3ab5052035f369c59d9d66

See more details on using hashes here.

Provenance

The following attestation bundles were made for rhwp_python-0.3.2-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: publish.yml on DanMeon/rhwp-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page