순수 Python HWP/HWPX 파서 - JVM 없이 텍스트, 표, 각주, 메모 추출

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Daehyeon

These details have not been verified by PyPI

Project description

HWP-HWPX Parser

순수 Python HWP/HWPX 파서 - JVM 없이 텍스트, 표, 각주, 미주, 메모 추출

특징

JVM 불필요: 순수 Python 구현, Java 설치 없이 바로 사용
경량: 최소 의존성 (olefile만 필요)
빠른 시작: pip install hwp-hwpx-parser로 즉시 사용
통합 API: HWP/HWPX 파일을 동일한 인터페이스로 처리
풍부한 추출: 텍스트, 표, 이미지, 각주, 미주, 하이퍼링크, 메모 지원

설치

pip install hwp-hwpx-parser

빠른 시작

from hwp_hwpx_parser import Reader

# 컨텍스트 매니저 사용 (권장)
with Reader("document.hwp") as r:
    print(r.text)                    # 본문 텍스트
    print(r.tables)                  # 표 목록
    print(r.get_memos())             # 메모 목록

# HWPX 파일도 동일하게 사용
with Reader("document.hwpx") as r:
    print(r.text)

API 레퍼런스

Reader (통합 리더)

from hwp_hwpx_parser import Reader

with Reader("document.hwp") as r:
    # 기본 속성
    r.text                      # 본문 텍스트 (str)
    r.tables                    # 표 목록 (List[TableData])
    r.file_type                 # 파일 타입 (FileType.HWP5 또는 FileType.HWPX)
    r.is_valid                  # 유효한 파일인지 (bool)
    r.is_encrypted              # 암호화 여부 (bool)
    
    # 메서드
    r.extract_text()                    # 텍스트 추출
    r.extract_text_with_notes()         # 텍스트 + 각주/미주/링크/메모 통합 추출
    r.get_tables()                      # 표 목록
    r.get_images()                      # 이미지 목록
    r.get_memos()                       # 메모 목록
    r.get_tables_as_markdown()          # 표를 마크다운 형식으로
    r.get_tables_as_csv()               # 표를 CSV 형식으로

개별 리더

from hwp_hwpx_parser import HWP5Reader, HWPXReader

# HWP 5.0 파일 전용
reader = HWP5Reader("document.hwp")

# HWPX 파일 전용
reader = HWPXReader("document.hwpx")

편의 함수

from hwp_hwpx_parser import read, extract_hwp5, extract_hwpx

# read() - Reader 인스턴스 반환
reader = read("document.hwp")
print(reader.text)
reader.close()

# extract_hwp5() - HWP 텍스트 바로 추출
text = extract_hwp5("document.hwp")

# extract_hwpx() - HWPX 텍스트 바로 추출
text = extract_hwpx("document.hwpx")

데이터 모델

from hwp_hwpx_parser import (
    ExtractOptions,    # 추출 옵션
    TableData,         # 표 데이터
    TableStyle,        # 표 스타일 (INLINE, MARKDOWN, CSV)
    ImageData,         # 이미지 데이터
    NoteData,          # 각주/미주 데이터
    HyperlinkData,     # 하이퍼링크 데이터
    MemoData,          # 메모 데이터
    ExtractResult,     # 통합 추출 결과
)

# TableData 사용
table = reader.tables[0]
print(table.rows)           # 2D 리스트: [[cell1, cell2], ...]
print(table.row_count)      # 행 수
print(table.col_count)      # 열 수
print(table.to_markdown())  # 마크다운 변환
print(table.to_csv())       # CSV 변환

# ImageData 사용
images = reader.get_images()
for img in images:
    print(img.filename)     # 파일명 (예: "BIN0001.png")
    print(img.format)       # 이미지 포맷 (예: "PNG")
    print(len(img.data))    # 바이너리 데이터 크기
    with open(img.filename, "wb") as f:
        f.write(img.data)   # 파일로 저장

# ExtractResult 사용
result = reader.extract_text_with_notes()
print(result.text)          # 본문 (각주는 [^1], 미주는 [^e1]로 표시)
print(result.footnotes)     # List[NoteData]
print(result.endnotes)      # List[NoteData]
print(result.hyperlinks)    # List[Tuple[str, str]]
print(result.memos)         # List[MemoData]

추출 옵션

from hwp_hwpx_parser import ExtractOptions, TableStyle, ImageMarkerStyle

options = ExtractOptions(
    table_style=TableStyle.MARKDOWN,        # 표 출력 스타일
    table_delimiter=",",                    # CSV 구분자
    image_marker=ImageMarkerStyle.SIMPLE,   # 이미지 마커 스타일
    paragraph_separator="\n\n",             # 문단 구분자
    line_separator="\n",                    # 줄 구분자
    include_empty_paragraphs=False,         # 빈 문단 포함 여부
)

text = reader.extract_text(options)

지원 기능

기능	HWP	HWPX
텍스트 추출	✅	✅
표 추출 (마크다운)	✅	✅
중첩 표 추출	✅	✅
이미지 추출	✅	✅
이미지 위치 마커	✅	✅
각주 추출	✅	✅
미주 추출	✅	✅
표 내 각주/미주 마커	✅	✅
하이퍼링크 추출	✅	✅
메모 추출	✅	✅
암호화 파일 감지	✅	✅

표 내 각주/미주 처리

표 셀 내에 각주/미주가 있을 경우, 마커만 삽입하고 내용은 별도 섹션에 출력됩니다:

| 채점 기준 | 배점 |
| --- | --- |
| 4 구간의 이각 변화[^e1]를 모두 옳게 서술한 경우 | 100 % |
| 3 구간의 이각 변화만 옳게 서술한 경우 | 75 % |

---
## 미주
[^e1]: 이각 변화는 지구에서 관측할 때 태양과 특정 천체 사이의 각도 거리가...

이미지 추출

문서에 포함된 이미지를 추출하고, 텍스트 내 이미지 위치를 마커로 표시합니다:

from hwp_hwpx_parser import Reader, ExtractOptions, ImageMarkerStyle

with Reader("document.hwp") as r:
    # 이미지 추출
    images = r.get_images()
    for img in images:
        with open(img.filename, "wb") as f:
            f.write(img.data)
    
    # 이미지 위치 마커 포함 텍스트 추출
    options = ExtractOptions(image_marker=ImageMarkerStyle.WITH_NAME)
    text = r.extract_text(options)
    # 출력: "본문 텍스트 [IMAGE: BIN0001.png] 이어지는 텍스트..."

ImageMarkerStyle 옵션:

NONE: 이미지 마커 생략
SIMPLE: [IMAGE] 형태로 표시
WITH_NAME: [IMAGE: 파일명] 형태로 표시 (이미지 파일 참조 추적 가능)

같은 이미지가 여러 번 사용된 경우에도 정확히 어떤 파일을 참조하는지 추적됩니다.

문서 편집이 필요하다면

이 패키지는 읽기 전용입니다. 문서 편집(텍스트 수정, 표 조작 등)이 필요하면 hwp-hwpx-editor를 설치하세요:

pip install hwp-hwpx-editor

hwp-hwpx-editor는 이 패키지를 기반으로 Java 라이브러리를 활용한 편집 기능을 제공합니다.

요구사항

Python: 3.8 이상
의존성: olefile>=0.46 (자동 설치)
Java: 불필요

라이선스

Apache License 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Daehyeon

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jan 29, 2026

0.1.5

Jan 14, 2026

0.1.4

Jan 13, 2026

0.1.3

Jan 13, 2026

0.1.2

Jan 8, 2026

0.1.1

Jan 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hwp_hwpx_parser-1.0.0.tar.gz (185.6 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hwp_hwpx_parser-1.0.0-py3-none-any.whl (28.4 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file hwp_hwpx_parser-1.0.0.tar.gz.

File metadata

Download URL: hwp_hwpx_parser-1.0.0.tar.gz
Upload date: Jan 29, 2026
Size: 185.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hwp_hwpx_parser-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`27ba84c6871ebcd34bb2da257bdd2209bf15ad2b6d19b7aee1168f20c7d03f20`
MD5	`21d037016e2d1ad6da6856d656751ebb`
BLAKE2b-256	`e9f4a381b0e7c6d8cbc4c0013585746d143f61081a58c5a0483e37669155be89`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hwp_hwpx_parser-1.0.0.tar.gz:

Publisher: publish.yml on KimDaehyeon6873/hwp-hwpx-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hwp_hwpx_parser-1.0.0.tar.gz
- Subject digest: 27ba84c6871ebcd34bb2da257bdd2209bf15ad2b6d19b7aee1168f20c7d03f20
- Sigstore transparency entry: 869343976
- Sigstore integration time: Jan 29, 2026
Source repository:
- Permalink: KimDaehyeon6873/hwp-hwpx-parser@339d1290ed46f90dc5d02a72622eb21df4b8925c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/KimDaehyeon6873
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@339d1290ed46f90dc5d02a72622eb21df4b8925c
- Trigger Event: push

File details

Details for the file hwp_hwpx_parser-1.0.0-py3-none-any.whl.

File metadata

Download URL: hwp_hwpx_parser-1.0.0-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 28.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hwp_hwpx_parser-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f853f6e7043c388b1eb7adcfa465f4d84fd7ee15ba0d49c383af9707924b9139`
MD5	`66d2847d0bba6676a190de0508b26c68`
BLAKE2b-256	`5592eb144ed0c7360ac0fc14a59193776b881b77a1a13b093f337bd386bb6086`

See more details on using hashes here.

Provenance

The following attestation bundles were made for hwp_hwpx_parser-1.0.0-py3-none-any.whl:

Publisher: publish.yml on KimDaehyeon6873/hwp-hwpx-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: hwp_hwpx_parser-1.0.0-py3-none-any.whl
- Subject digest: f853f6e7043c388b1eb7adcfa465f4d84fd7ee15ba0d49c383af9707924b9139
- Sigstore transparency entry: 869343980
- Sigstore integration time: Jan 29, 2026
Source repository:
- Permalink: KimDaehyeon6873/hwp-hwpx-parser@339d1290ed46f90dc5d02a72622eb21df4b8925c
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/KimDaehyeon6873
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@339d1290ed46f90dc5d02a72622eb21df4b8925c
- Trigger Event: push

hwp-hwpx-parser 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

HWP-HWPX Parser

특징

설치

빠른 시작

API 레퍼런스

Reader (통합 리더)

개별 리더

편의 함수

데이터 모델

추출 옵션

지원 기능

표 내 각주/미주 처리

이미지 추출

문서 편집이 필요하다면

요구사항

라이선스

관련 프로젝트

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance