Skip to main content

Python bindings and integrations for the Rust hangulang document conversion engine

Project description

hangulang-python

hangulang-pythonHWP 5.0HWPX 문서(한컴오피스 / 한글)를 Python에서 다루기 위한 hangulang Rust core의 Python binding / integration package입니다.

Rust hangulangrhwp 파서 코어 위에서 문서를 semantic IR로 낮추고, DocLang XML, semantic payload, Markdown, resource asset/URI 참조를 생성합니다. hangulang-python은 이 엔진을 Python wheel, Pythonic API, typed error, CLI, optional integration 형태로 제공합니다.

상태: v0.1 alpha — 활발히 개발 중. 현재 native extension은 vendor/hangulang submodule의 Rust hangulang 엔진을 연결합니다. 공개 배포 전에는 crates.io dependency 전환 여부와 wheel release 정책을 확정해야 합니다.


왜 만들었나

HWP/HWPX는 한국 공공기관, 법무, 교육, 기업 문서 워크플로에서 여전히 중요합니다. 하지만 Python 문서 처리 생태계에서는 PDF/DOCX에 비해 HWP/HWPX의 구조화된 extraction 도구가 부족합니다.

hangulang-python은 이 공백을 다음 역할로 채웁니다:

  • Rust parser/exporter를 재사용합니다. HWP 파싱을 Python에서 다시 구현하지 않고, Rust hangulang이 검증한 semantic extraction 결과를 Python으로 노출합니다.
  • 텍스트가 아니라 구조를 전달합니다. DocLang XML, Markdown, semantic payload, asset reference, layout metadata를 first-class output으로 다룹니다.
  • Python workflow에 맞춥니다. dict, str, dataclass option, typed exception, CLI entrypoint를 제공합니다.
  • 무거운 통합은 선택 사항입니다. LangChain, Docling adapter는 core conversion API와 분리합니다.
  • rhwp raw model을 직접 노출하지 않습니다. 저수준 parser 구조가 아니라, downstream pipeline에서 바로 쓰기 쉬운 semantic export 계약을 제공합니다.

프로젝트 범위

hangulang-python은 Rust hangulang의 대체제가 아니라, Python 배포와 통합을 위한 wrapper layer입니다.

레이어 책임
rhwp HWP/HWPX 파일 포맷 파싱, 내부 문서 모델, 렌더 트리 제공
hangulang Rust core rhwp 모델을 semantic IR로 낮추고 DocLang / payload / Markdown / asset을 생성
hangulang-python native extension PyO3 경계에서 Rust core 호출, JSON/문자열/asset 반환
Python API Pythonic 함수, 옵션 dataclass, typed exception, asset write 정책
Integrations LangChain, Docling 등 외부 adapter

rhwp-python이 저수준 parser binding에 가깝다면, hangulang-python은 바로 사용할 수 있는 고수준 문서 변환 API를 지향합니다.


설치

현재는 alpha 개발 상태입니다. Python 3.10+와 Rust toolchain이 필요합니다.

개발 환경:

git submodule update --init --recursive
uv venv --python 3.12 .venv
uv pip install -e '.[dev]'

native extension을 명시적으로 다시 빌드할 때:

VIRTUAL_ENV=.venv .venv/bin/maturin develop

wheel build:

.venv/bin/maturin build --interpreter .venv/bin/python

의존성 참고: Rust core는 vendor/hangulang Git submodule로 고정합니다. Cargo dependency는 package 이름 hangulanghangulang-engine alias로 가져옵니다.

hangulang-engine = { package = "hangulang", path = "vendor/hangulang", features = ["serde"] }

빠른 시작

Python API

from hangulang import convert_to_doclang, convert_to_markdown, convert_to_payload

xml = convert_to_doclang("document.hwp")
markdown = convert_to_markdown("document.hwpx")
payload = convert_to_payload("document.hwp", include_locations=True)

입력은 파일 경로 또는 bytes를 받을 수 있습니다:

from pathlib import Path
from hangulang import convert_to_payload

data = Path("document.hwp").read_bytes()
payload = convert_to_payload(data)

옵션이 늘어나는 경우 ConversionOptions를 사용할 수 있습니다:

from hangulang import ConversionOptions, convert_to_doclang

options = ConversionOptions(include_locations=True)
xml = convert_to_doclang("document.hwp", options)

출력 API

API 출력 비고
convert_to_doclang str DocLang v0.6 XML
convert_to_markdown str 같은 Rust semantic IR에서 직접 생성
convert_to_payload dict stable semantic payload JSON을 Python dict로 반환
extract_assets list[ExtractedAsset] embedded image/resource asset 추출 또는 참조

Asset 처리

이미지는 Rust core의 resource policy를 통해 data URI, asset file, URI prefix로 다룰 수 있습니다.

from hangulang import AssetPolicy, extract_assets

assets = extract_assets(
    "document.hwp",
    asset_policy=AssetPolicy.WRITE,
    output_dir="assets",
)

for asset in assets:
    print(asset.path, asset.mime_type, asset.uri)

CLI

Python package는 hangulang console script를 제공합니다.

hangulang convert document.hwp --format doclang
hangulang convert document.hwp --format markdown
hangulang convert document.hwp --format payload --locations
hangulang assets document.hwp --out assets/

CLI는 별도 변환 구현을 갖지 않습니다. public Python API를 얇게 호출하므로, API와 CLI의 동작은 같은 Rust core를 공유합니다.


옵션과 오류

ConversionOptions

from hangulang import AssetPolicy, ConversionOptions

options = ConversionOptions(
    include_locations=True,
    asset_policy=AssetPolicy.INLINE,
    report_losses=False,
)
옵션 기본값 의미
include_locations False layout location / bbox metadata 요청
bbox_resolution "none" Python API용 bbox 해상도 의도 표현
asset_policy AssetPolicy.INLINE inline, write, URI reference 등 asset 처리 방식
asset_output_dir None asset write 정책에서 사용할 출력 디렉터리
uri_prefix None downstream storage용 asset URI prefix
report_losses False loss reporting API 확장용 예약 필드

예외

예외 의미
HangulangError 모든 package error의 base class
UnsupportedFormatError 지원하지 않는 입력 형식, 암호화/배포용 문서 등
ParseError 파일 읽기 또는 parser 단계 실패
ConversionError XML/JSON/asset 직렬화 등 변환 단계 실패

Optional integrations

Core package는 LangChain이나 Docling을 필수 의존성으로 설치하지 않습니다.

모듈 상태 역할
hangulang.integrations.langchain implemented block/document 단위 LangChain Document loader
hangulang.integrations.docling implemented Docling handoff / payload / DocLang / Markdown adapter

LangChain integration은 langchain-core>=1.0,<2.0을 기준으로 분리합니다:

uv pip install -e '.[langchain]'

LangChain loader는 기본적으로 semantic payload의 텍스트 블록을 각각 하나의 Document로 반환하고, source, schema_version, doclang_version, block_id, block_kind, page_number, bbox, resource metadata를 가능한 범위에서 보존합니다.

from hangulang.integrations.langchain import HangulangLoader

docs = HangulangLoader("document.hwp", include_locations=True).load()

문서 전체를 하나의 Document로 받아야 하는 경우:

docs = HangulangLoader("document.hwp", mode="document").load()

Docling adapter는 특정 Docling runtime class에 hard dependency를 두지 않고, framework-neutral handoff dict를 반환합니다. 필요하면 payload, DocLang XML, Markdown만 따로 받을 수 있습니다.

from hangulang.integrations.docling import HangulangDoclingAdapter

adapter = HangulangDoclingAdapter(include_locations=True)
handoff = adapter.convert("document.hwp", format="handoff")
xml = adapter.convert("document.hwp", format="doclang")

아키텍처

 HWP 5.0 (.hwp) ─┐
                 ├─► hangulang Rust core ─┬─► DocLang XML ───────► Python str
 HWPX (.hwpx) ──┘                         ├─► semantic payload ─► Python dict
                                           ├─► Markdown ─────────► Python str
                                           └─► resource assets ──► ExtractedAsset

 Python API / CLI ─► PyO3 native extension ─► Rust convert APIs

Python layer의 원칙:

  • parser logic은 Rust에 둡니다.
  • Python은 API 안정성, packaging, typing, 오류 매핑, integration을 담당합니다.
  • heavy downstream dependency는 optional extra 또는 별도 adapter에 둡니다.
  • public API는 procedural function을 먼저 안정화하고, 반복 변환/상태가 필요해질 때 object-oriented API를 추가합니다.

개발

.venv/bin/python -m pytest
.venv/bin/python -m ruff check .
.venv/bin/python -m mypy python/hangulang
cargo test
.venv/bin/maturin build --interpreter .venv/bin/python

현재 Python 테스트는 vendor/hangulang/tests/fixtures의 Rust hangulang fixture corpus를 재사용합니다.


로드맵

  • hangulang Rust submodule을 CI와 wheel build 흐름에 포함.
  • CI에서 Python test, Rust extension build, type check, wheel smoke test 실행.
  • convert_to_payload loss reporting과 Python option model 정교화.
  • asset URI/write 정책의 downstream contract 확정.
  • LangChain loader chunking strategy와 metadata schema 안정화.
  • Docling runtime plugin contract가 확정되면 handoff adapter를 공식 backend로 연결.
  • macOS, Linux, Windows wheel build matrix 구성.
  • hangulang / rhwp crates.io publish 이후 PyPI 안정 배포.

라이선스

MIT. 자세한 내용은 LICENSE를 참고하세요.

본 프로젝트는 독립적인 오픈소스 프로젝트입니다. HWP/HWPX는 한글과컴퓨터(Hancom Inc.)의 포맷이며, 본 프로젝트는 한컴과 제휴 관계가 없습니다. DocLang은 LF AI & Data Foundation의 프로젝트입니다. rhwp는 © Edward Kim (MIT)입니다.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hangulang-0.1.0a0.tar.gz (4.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hangulang-0.1.0a0-cp313-cp313-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.13Windows x86-64

hangulang-0.1.0a0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

hangulang-0.1.0a0-cp313-cp313-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

hangulang-0.1.0a0-cp312-cp312-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.12Windows x86-64

hangulang-0.1.0a0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

hangulang-0.1.0a0-cp312-cp312-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

hangulang-0.1.0a0-cp311-cp311-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.11Windows x86-64

hangulang-0.1.0a0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

hangulang-0.1.0a0-cp311-cp311-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

hangulang-0.1.0a0-cp310-cp310-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.10Windows x86-64

hangulang-0.1.0a0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

hangulang-0.1.0a0-cp310-cp310-macosx_11_0_arm64.whl (2.6 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file hangulang-0.1.0a0.tar.gz.

File metadata

  • Download URL: hangulang-0.1.0a0.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hangulang-0.1.0a0.tar.gz
Algorithm Hash digest
SHA256 99ff439b4320a7dbde21f93debf79b9d7f4d24735d409d62d104bf573f4bfc3e
MD5 db6cfff66caecd5913ba42de858f7cd3
BLAKE2b-256 c3d7c56351927fc1bb0a53ad7e7bd7a335b8faa102462252957804535948858c

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0.tar.gz:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2d3593aa713f500c2fad0b4fba08169e4ee96620697fa054ed46dcbf3ab06e46
MD5 9d03c764869726af900accb6167e8021
BLAKE2b-256 8ceaabf0a38faeb68f291705de4a295e8024908e6db4eb25c284b54515396522

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp313-cp313-win_amd64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee6a0f32c2c368260edaf137d899bacf5d0b6ecbd553bcb74cda864fe51cc9a0
MD5 56d3c63faf7e2cb1a10bf7dac1b76a26
BLAKE2b-256 b51ae07604e95a85fa7dd7139ddd4c5001c74140ad903ab324c97d5836d61c7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6fc4cc829240ecb15ce3d5d9da26604f7ea6b63412253fff4002aa0585591073
MD5 3053370278c47a1f9b24ff063e3809f7
BLAKE2b-256 c4c93b3a2ac76c6cbefe7868bc33a09caefcbe037958d0c966d4abd772cfd17b

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 20f83dbd6107a8e4f62d6a2a8965cbe1463d37a574f48ada2da2625520a867f9
MD5 9da51dfbf4d2ba9a1c34a5b362abf5e6
BLAKE2b-256 dc89153e22417cdcc36cba9ca4657623a58904ad150dab344b743cd266dd0b43

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp312-cp312-win_amd64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8f5a7c80ad714b8c4223d03b4470fcff1e02213058a280ade29355e480f774a6
MD5 59b4f5f78b659a12091a1f5598c117cf
BLAKE2b-256 2bb9ebd590f0dfa828c6229b507ef30fd02f65f50c9b303c8c626e4136e020c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9a8ed66d8404e3d477f2ca9096dbc11063c66699d7fb5ca71e33ff3b5ce1ea49
MD5 276de77f1367c62402555348fac9dac2
BLAKE2b-256 c3e23da855372626ee1725890cb038a825a0eb63b7a4181483532f398ff9bab3

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4df99ddda7ce1aad91407007e7c65b2e112d3ecb03a2b23caae9ec526dab0747
MD5 f8a05be0f6355597215cf63290f56b13
BLAKE2b-256 4f32b68b8eacfbada5d1e341d80a65f0a2fc4b5fe224786e225202447be3e5ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp311-cp311-win_amd64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7f8c249fdd7db22b23c93366aec70d19ab57b63ca89d616ff2ff1973b48ef90e
MD5 90748f407898d9a9768d4e3f57b060fe
BLAKE2b-256 115e3dd2f8f967a91aa6597604313bcb056ca2ed08353c4e4a10d748fa200c25

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b9fe1b6e844545a86adbd663d6c5fbf5b928de6a979c2147697024af0698fa0f
MD5 1f23f66d668395e1cd514bd9ac9624e2
BLAKE2b-256 50c394c71f59436cc92b7af1267f5b0f0ca3c1bf31d55103d1bf7e596e9230b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1aef5231dd7d65befaee309d1cbff3a7ba2080d02d4a31ae042ba29278405325
MD5 d0914096b3bdb875a070325df2e92caf
BLAKE2b-256 f5226ca4cf9707b2b5d5b3aa27597d4eec17c439c06cd43713e8081b6da9b62b

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp310-cp310-win_amd64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bd3f9cfc490bcc2f5c8e2394864743c3cd94132d8b1bf2f2aaf595fdf006b01d
MD5 a0090725fcc5233b6425c974c72436fc
BLAKE2b-256 9c68a6010e7437abad60d05fc4be979de63939dec143dcaa6300ebabff2e0e7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hangulang-0.1.0a0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hangulang-0.1.0a0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 654d92f212b02a945cc14f4a5286516e68e500ba4b5c7a18d8bbee0e19ad951e
MD5 3b5fd12aa02cc264b41ae1298bdd2822
BLAKE2b-256 f2dacb4d360ccc5cccbe3755e4db74c878b8054c6f5f67d749ffd386da504967

See more details on using hashes here.

Provenance

The following attestation bundles were made for hangulang-0.1.0a0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: release.yml on myeolinmalchi/hangulang-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page