Skip to main content

LangChain document loader for structured Microsoft Word (.docx) files using python-docx.

Project description

structured-docx-loader

A LangChain BaseLoader for Microsoft Word (.docx) files that preserves document structure instead of flattening it into one undifferentiated blob of text.

langchain-community's existing Word loaders either dump raw text (Docx2txtLoader) or depend on the heavyweight unstructured library (UnstructuredWordDocumentLoader). DocxLoader uses python-docx directly to walk the document in its native order and:

  • Renders heading styles (Heading 1-Heading 9) as Markdown headings, preserving hierarchy.
  • Converts tables to Markdown (default), HTML, or a key-value row format suitable for retrieval.
  • Supports three loading granularities: a single document, one document per heading section, or one document per paragraph/table element.

Install

pip install structured-docx-loader

Usage

from structured_docx_loader import DocxLoader

# Load the entire document as a single Document
loader = DocxLoader("example.docx")
docs = loader.load()

# Split by heading sections, with HTML tables
loader = DocxLoader("example.docx", mode="sections", table_format="html")
docs = loader.load()

# One Document per paragraph/table row, tables as key-value pairs
loader = DocxLoader(
    "example.docx",
    mode="elements",
    table_format="key_value",
    table_extraction_strategy="row",
)
docs = loader.load()

file_path also accepts an HTTP(S) URL, in which case the file is downloaded to a temporary location before parsing.

Options

Argument Values Description
mode "single" (default), "sections", "elements" Granularity of the returned Document objects.
table_format "markdown" (default), "html", "key_value" How tables are rendered into text.
table_extraction_strategy "table" (default), "row" Whether a table becomes one block or one block per row.

Development

pip install -e ".[test,lint,typing]"
pytest
ruff check .
mypy structured_docx_loader

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structured_docx_loader-0.1.0.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

structured_docx_loader-0.1.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file structured_docx_loader-0.1.0.tar.gz.

File metadata

  • Download URL: structured_docx_loader-0.1.0.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for structured_docx_loader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a06b789158d18726b9c1ebad982e84598c2d4829d320dc72f8f5533ded1bb2b
MD5 f6302d10c2ba84caabd2a4f402073fce
BLAKE2b-256 4ab5257beee4f34dec31132a03a1427d0451821d13b7e8709b0d0e3d73ee4eff

See more details on using hashes here.

Provenance

The following attestation bundles were made for structured_docx_loader-0.1.0.tar.gz:

Publisher: publish.yml on Harshitn24/structured-docx-loader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file structured_docx_loader-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for structured_docx_loader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9338ecd8c91d6f7ebb1395c4b9e13af349ccc7b7eea485d5dd70972e6ce62ba
MD5 531dbc870fd1e1ed8b0d6d5860e6102b
BLAKE2b-256 54c53242cfd74fc7c9e0365d8d25228efd48b8f6d0422edeff0850bb57a5a047

See more details on using hashes here.

Provenance

The following attestation bundles were made for structured_docx_loader-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Harshitn24/structured-docx-loader

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page