Skip to main content

LangChain document loaders for HWP/HWPX using hwp-hwpx-parser

Project description

LangChain HWP/HWPX Loader

langchain-hwp-hwpx-loaderhwp-hwpx-parser 기반의 순수 Python LangChain 로더입니다.
한국어 문서(.hwp, .hwpx)를 오프라인/온프렘 환경에서 읽어 RAG 파이프라인에 넣기 쉽게 만듭니다.

한국어 사용 가이드

1) 설치

pip install langchain-hwp-hwpx-loader

2) 단일 파일 로딩 (mode="single")

문서 전체를 Document 1개로 반환합니다.

from pathlib import Path

from hwp_hwpx_parser import ExtractOptions, ImageMarkerStyle, TableStyle
from langchain_hwp_hwpx import HwpHwpxLoader

options = ExtractOptions(
    table_style=TableStyle.MARKDOWN,
    image_marker=ImageMarkerStyle.SIMPLE,
)

loader = HwpHwpxLoader(
    file_path=Path("docs/sample.hwp"),
    mode="single",
    extract_options=options,
    include_tables=True,
    include_notes=True,
    include_memos=True,
    include_hyperlinks=True,
)

docs = loader.load()
print("docs:", len(docs))
print("metadata:", docs[0].metadata)
print("content preview:", docs[0].page_content[:400])

3) 요소 단위 로딩 (mode="elements")

본문/표/각주/미주/메모/링크/이미지를 분리된 Document로 반환합니다.

from langchain_hwp_hwpx import HwpHwpxLoader

loader = HwpHwpxLoader("docs/sample.hwpx", mode="elements")

for doc in loader.lazy_load():
    print(
        doc.metadata["element_index"],
        doc.metadata["element_type"],
        doc.metadata.get("note_number"),
        doc.metadata.get("url"),
    )

4) 폴더 단위 로딩

디렉토리 전체를 재귀 탐색해 .hwp, .hwpx 파일을 순서대로 로딩합니다.

from langchain_hwp_hwpx import HwpHwpxDirectoryLoader

loader = HwpHwpxDirectoryLoader(
    dir_path="docs",
    glob="**/*",
    recursive=True,
    mode="single",
    on_error="warn",
)

docs = loader.load()
print("loaded:", len(docs))

5) 주요 옵션

  • mode: "single" 또는 "elements"
  • include_tables, include_notes, include_memos, include_hyperlinks
  • include_images, images_dir, image_document_mode
  • on_encrypted: "raise" | "skip" | "placeholder"
  • on_invalid: "raise" | "skip" | "placeholder"
  • on_error: "raise" | "skip" | "warn"
  • extract_options: hwp_hwpx_parser.ExtractOptions 전달 가능

6) 반환 메타데이터

공통 메타데이터:

  • source, file_name, file_type
  • loader, parser
  • extracted_at (기본: UTC ISO timestamp)

mode="elements" 추가 메타데이터:

  • element_type, element_index
  • 표: row_count, col_count
  • 각주/미주: note_type, note_number
  • 링크: url, text
  • 이미지: filename, image_format, saved_path(저장 모드일 때)

7) 자주 묻는 점

  • 암호화 문서 복호화는 지원하지 않습니다(감지 후 정책 처리).
  • OCR/레이아웃 렌더링은 범위 밖입니다.
  • Python 3.14에서 langchain-core 경고가 보일 수 있어, 실무에서는 Python 3.11/3.12를 권장합니다.

English (Brief)

Pure-Python LangChain loader for Korean .hwp / .hwpx documents.

  • Install: pip install langchain-hwp-hwpx-loader
  • Main classes: HwpHwpxLoader, HwpHwpxDirectoryLoader
  • Modes: single, elements
  • Python: >=3.10,<4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_hwp_hwpx_loader-0.1.1.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_hwp_hwpx_loader-0.1.1-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file langchain_hwp_hwpx_loader-0.1.1.tar.gz.

File metadata

File hashes

Hashes for langchain_hwp_hwpx_loader-0.1.1.tar.gz
Algorithm Hash digest
SHA256 45d3a0928175a9af83f910d575c4df4ee95630e6a87cac586a87670f2489d3d6
MD5 b392741f402fb274e692c50cd319460b
BLAKE2b-256 23cfa7b439510b192f3fd8bbfe9f2b472a1a19829c15aa9fd0a29c9350532c2d

See more details on using hashes here.

File details

Details for the file langchain_hwp_hwpx_loader-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_hwp_hwpx_loader-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ad1f11bf999824bd7386a22ef4bea1d3e8ec8750aad77e37557c1dbedbcb2c33
MD5 8d61901a69f1229dce0aa5dcc7dc8450
BLAKE2b-256 cdc26acaba27d0560bbf6df13e5a6f0f65d107487753223b96c03b2d0557007b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page