Skip to main content

LangChain document loaders for HWP/HWPX using hwp-hwpx-parser

Project description

LangChain HWP/HWPX Loader

langchain-hwp-hwpx-loader is a pure-Python LangChain loader for Korean .hwp and .hwpx documents, powered by hwp-hwpx-parser.

It is built for offline/on-prem RAG indexing and keeps body text, tables, footnotes/endnotes, memos, and hyperlinks with configurable policies.

Installation

pip install langchain-hwp-hwpx-loader

Quickstart

Single document mode

from pathlib import Path

from hwp_hwpx_parser import ExtractOptions, ImageMarkerStyle, TableStyle
from langchain_hwp_hwpx import HwpHwpxLoader

options = ExtractOptions(
    table_style=TableStyle.MARKDOWN,
    image_marker=ImageMarkerStyle.SIMPLE,
)

loader = HwpHwpxLoader(
    file_path=Path("docs/sample.hwp"),
    mode="single",
    extract_options=options,
    include_tables=True,
    include_notes=True,
    include_memos=True,
    include_hyperlinks=True,
)

docs = loader.load()
print(docs[0].page_content[:400])
print(docs[0].metadata)

Elements mode

from langchain_hwp_hwpx import HwpHwpxLoader

loader = HwpHwpxLoader("docs/sample.hwpx", mode="elements")
for doc in loader.lazy_load():
    print(doc.metadata["element_type"], doc.metadata["element_index"])

Directory loading

from langchain_hwp_hwpx import HwpHwpxDirectoryLoader

directory_loader = HwpHwpxDirectoryLoader(
    dir_path="docs",
    glob="**/*",
    recursive=True,
    mode="single",
    on_error="warn",
)

docs = directory_loader.load()
print(len(docs))

Main Options

  • mode: "single" or "elements"
  • include_tables, include_notes, include_memos, include_hyperlinks
  • include_images, images_dir, image_document_mode
  • on_encrypted: "raise" | "skip" | "placeholder"
  • on_invalid: "raise" | "skip" | "placeholder"
  • on_error: "raise" | "skip" | "warn"

extract_options accepts hwp_hwpx_parser.ExtractOptions.

Metadata

All returned documents include common metadata:

  • source, file_name, file_type
  • loader, parser
  • extracted_at (UTC ISO timestamp by default)

mode="elements" adds:

  • element_type, element_index
  • table fields: row_count, col_count
  • note fields: note_type, note_number
  • optional memo/hyperlink/image specific fields when available

Limitations

  • Encrypted documents are only detectable. Decryption is not supported.
  • OCR and visual layout reconstruction are out of scope.
  • Parsing quality depends on hwp-hwpx-parser capabilities.

Compatibility

  • Python: >=3.10,<4.0
  • langchain-core>=1.0.0,<2.0.0
  • hwp-hwpx-parser>=1.0.0,<2.0.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_hwp_hwpx_loader-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_hwp_hwpx_loader-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file langchain_hwp_hwpx_loader-0.1.0.tar.gz.

File metadata

File hashes

Hashes for langchain_hwp_hwpx_loader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fb45c2850557a6db8c9fc91a3795cad58bcdeb601d2a594a7dab5c1c98370074
MD5 072f722b928811c9340f179554787264
BLAKE2b-256 b4c9c9238aa3e6d04caaf5d2c0cc2d8fe7b3f1e728a23f8a8414c5578e92e063

See more details on using hashes here.

File details

Details for the file langchain_hwp_hwpx_loader-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_hwp_hwpx_loader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 84187fcc2ccee1282dd3369d05b9ee09ffb42cb98ff0dacda767adeae8e1ea88
MD5 ece52c8cd3e02c44ed1fab62fde2844c
BLAKE2b-256 2c6ce2d0b08df43b9f8da0447eea2680b2f5f7149db3cbfa2f0a573d4a34a759

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page