LangChain document loader for structured Microsoft Word (.docx) files using python-docx.
Project description
structured-docx-loader
A LangChain BaseLoader for Microsoft Word (.docx) files that preserves document structure instead of flattening it into one undifferentiated blob of text.
langchain-community's existing Word loaders either dump raw text (Docx2txtLoader) or depend on the heavyweight unstructured library (UnstructuredWordDocumentLoader). DocxLoader uses python-docx directly to walk the document in its native order and:
- Renders heading styles (
Heading 1-Heading 9) as Markdown headings, preserving hierarchy. - Converts tables to Markdown (default), HTML, or a key-value row format suitable for retrieval.
- Supports three loading granularities: a single document, one document per heading section, or one document per paragraph/table element.
Install
pip install structured-docx-loader
Usage
from structured_docx_loader import DocxLoader
# Load the entire document as a single Document
loader = DocxLoader("example.docx")
docs = loader.load()
# Split by heading sections, with HTML tables
loader = DocxLoader("example.docx", mode="sections", table_format="html")
docs = loader.load()
# One Document per paragraph/table row, tables as key-value pairs
loader = DocxLoader(
"example.docx",
mode="elements",
table_format="key_value",
table_extraction_strategy="row",
)
docs = loader.load()
file_path also accepts an HTTP(S) URL, in which case the file is downloaded to a temporary location before parsing.
Options
| Argument | Values | Description |
|---|---|---|
mode |
"single" (default), "sections", "elements" |
Granularity of the returned Document objects. |
table_format |
"markdown" (default), "html", "key_value" |
How tables are rendered into text. |
table_extraction_strategy |
"table" (default), "row" |
Whether a table becomes one block or one block per row. |
Development
pip install -e ".[test,lint,typing]"
pytest
ruff check .
mypy structured_docx_loader
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structured_docx_loader-0.1.0.tar.gz.
File metadata
- Download URL: structured_docx_loader-0.1.0.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a06b789158d18726b9c1ebad982e84598c2d4829d320dc72f8f5533ded1bb2b
|
|
| MD5 |
f6302d10c2ba84caabd2a4f402073fce
|
|
| BLAKE2b-256 |
4ab5257beee4f34dec31132a03a1427d0451821d13b7e8709b0d0e3d73ee4eff
|
Provenance
The following attestation bundles were made for structured_docx_loader-0.1.0.tar.gz:
Publisher:
publish.yml on Harshitn24/structured-docx-loader
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structured_docx_loader-0.1.0.tar.gz -
Subject digest:
3a06b789158d18726b9c1ebad982e84598c2d4829d320dc72f8f5533ded1bb2b - Sigstore transparency entry: 1955004921
- Sigstore integration time:
-
Permalink:
Harshitn24/structured-docx-loader@d3be09b4baf1dccba37aa058285282f3c5ecd3da -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Harshitn24
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d3be09b4baf1dccba37aa058285282f3c5ecd3da -
Trigger Event:
push
-
Statement type:
File details
Details for the file structured_docx_loader-0.1.0-py3-none-any.whl.
File metadata
- Download URL: structured_docx_loader-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9338ecd8c91d6f7ebb1395c4b9e13af349ccc7b7eea485d5dd70972e6ce62ba
|
|
| MD5 |
531dbc870fd1e1ed8b0d6d5860e6102b
|
|
| BLAKE2b-256 |
54c53242cfd74fc7c9e0365d8d25228efd48b8f6d0422edeff0850bb57a5a047
|
Provenance
The following attestation bundles were made for structured_docx_loader-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Harshitn24/structured-docx-loader
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
structured_docx_loader-0.1.0-py3-none-any.whl -
Subject digest:
d9338ecd8c91d6f7ebb1395c4b9e13af349ccc7b7eea485d5dd70972e6ce62ba - Sigstore transparency entry: 1955004985
- Sigstore integration time:
-
Permalink:
Harshitn24/structured-docx-loader@d3be09b4baf1dccba37aa058285282f3c5ecd3da -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Harshitn24
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d3be09b4baf1dccba37aa058285282f3c5ecd3da -
Trigger Event:
push
-
Statement type: