LangChain document loaders for HWP/HWPX using hwp-hwpx-parser
Project description
LangChain HWP/HWPX Loader
langchain-hwp-hwpx-loader는 hwp-hwpx-parser 기반의 순수 Python LangChain 로더입니다.
한국어 문서(.hwp, .hwpx)를 오프라인/온프렘 환경에서 읽어 RAG 파이프라인에 넣기 쉽게 만듭니다.
한국어 사용 가이드
1) 설치
pip install langchain-hwp-hwpx-loader
2) 단일 파일 로딩 (mode="single")
문서 전체를 Document 1개로 반환합니다.
from pathlib import Path
from hwp_hwpx_parser import ExtractOptions, ImageMarkerStyle, TableStyle
from langchain_hwp_hwpx import HwpHwpxLoader
options = ExtractOptions(
table_style=TableStyle.MARKDOWN,
image_marker=ImageMarkerStyle.SIMPLE,
)
loader = HwpHwpxLoader(
file_path=Path("docs/sample.hwp"),
mode="single",
extract_options=options,
include_tables=True,
include_notes=True,
include_memos=True,
include_hyperlinks=True,
)
docs = loader.load()
print("docs:", len(docs))
print("metadata:", docs[0].metadata)
print("content preview:", docs[0].page_content[:400])
3) 요소 단위 로딩 (mode="elements")
본문/표/각주/미주/메모/링크/이미지를 분리된 Document로 반환합니다.
from langchain_hwp_hwpx import HwpHwpxLoader
loader = HwpHwpxLoader("docs/sample.hwpx", mode="elements")
for doc in loader.lazy_load():
print(
doc.metadata["element_index"],
doc.metadata["element_type"],
doc.metadata.get("note_number"),
doc.metadata.get("url"),
)
4) 폴더 단위 로딩
디렉토리 전체를 재귀 탐색해 .hwp, .hwpx 파일을 순서대로 로딩합니다.
from langchain_hwp_hwpx import HwpHwpxDirectoryLoader
loader = HwpHwpxDirectoryLoader(
dir_path="docs",
glob="**/*",
recursive=True,
mode="single",
on_error="warn",
)
docs = loader.load()
print("loaded:", len(docs))
5) 주요 옵션
mode:"single"또는"elements"include_tables,include_notes,include_memos,include_hyperlinksinclude_images,images_dir,image_document_modeon_encrypted:"raise" | "skip" | "placeholder"on_invalid:"raise" | "skip" | "placeholder"on_error:"raise" | "skip" | "warn"extract_options:hwp_hwpx_parser.ExtractOptions전달 가능
6) 반환 메타데이터
공통 메타데이터:
source,file_name,file_typeloader,parserextracted_at(기본: UTC ISO timestamp)
mode="elements" 추가 메타데이터:
element_type,element_index- 표:
row_count,col_count - 각주/미주:
note_type,note_number - 링크:
url,text - 이미지:
filename,image_format,saved_path(저장 모드일 때)
7) 자주 묻는 점
- 암호화 문서 복호화는 지원하지 않습니다(감지 후 정책 처리).
- OCR/레이아웃 렌더링은 범위 밖입니다.
- Python 3.14에서
langchain-core경고가 보일 수 있어, 실무에서는 Python 3.11/3.12를 권장합니다.
English (Brief)
Pure-Python LangChain loader for Korean .hwp / .hwpx documents.
- Install:
pip install langchain-hwp-hwpx-loader - Main classes:
HwpHwpxLoader,HwpHwpxDirectoryLoader - Modes:
single,elements - Python:
>=3.10,<4.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_hwp_hwpx_loader-0.1.1.tar.gz.
File metadata
- Download URL: langchain_hwp_hwpx_loader-0.1.1.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45d3a0928175a9af83f910d575c4df4ee95630e6a87cac586a87670f2489d3d6
|
|
| MD5 |
b392741f402fb274e692c50cd319460b
|
|
| BLAKE2b-256 |
23cfa7b439510b192f3fd8bbfe9f2b472a1a19829c15aa9fd0a29c9350532c2d
|
File details
Details for the file langchain_hwp_hwpx_loader-0.1.1-py3-none-any.whl.
File metadata
- Download URL: langchain_hwp_hwpx_loader-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad1f11bf999824bd7386a22ef4bea1d3e8ec8750aad77e37557c1dbedbcb2c33
|
|
| MD5 |
8d61901a69f1229dce0aa5dcc7dc8450
|
|
| BLAKE2b-256 |
cdc26acaba27d0560bbf6df13e5a6f0f65d107487753223b96c03b2d0557007b
|