Kreuzberg document loader for LangChain — extract text from 88+ file formats with true async and rich metadata
Project description
langchain-kreuzberg
Overview
langchain-kreuzberg is a LangChain document loader that wraps Kreuzberg's extraction API. It supports 88+ file formats out of the box, provides true async extraction powered by Rust's tokio runtime, and produces LangChain Document objects enriched with rich metadata including detected languages, quality scores, and extracted keywords.
Installation
pip install langchain-kreuzberg
Requires Python 3.10+.
Quick Start
from langchain_kreuzberg import KreuzbergLoader
loader = KreuzbergLoader(file_path="report.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata["source"])
Features
- 88+ file formats -- PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and many more
- True async -- native async extraction backed by Rust's tokio runtime; no thread-pool workarounds
- Rich metadata -- title, author, page count, detected languages, quality score, extracted keywords, and more
- OCR with 3 backends -- Tesseract, EasyOCR, and PaddleOCR with configurable language support
- Per-page splitting -- yield one
Documentper page for fine-grained RAG pipelines - Bytes input -- load documents directly from raw bytes (e.g., API responses, S3 objects)
- Output format selection -- choose between plain text, Markdown, Djot, HTML, or structured output
Usage Examples
Load a PDF with defaults
from langchain_kreuzberg import KreuzbergLoader
loader = KreuzbergLoader(file_path="contract.pdf")
docs = loader.load()
Load multiple files
loader = KreuzbergLoader(
file_path=["report.pdf", "notes.docx", "data.xlsx"],
)
docs = loader.load()
OCR a scanned document with Tesseract
from kreuzberg import ExtractionConfig, OcrConfig
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(backend="tesseract", language="eng"),
)
loader = KreuzbergLoader(
file_path="scanned.pdf",
config=config,
)
docs = loader.load()
Load all files from a directory
loader = KreuzbergLoader(
file_path="./documents/",
glob="**/*.pdf",
)
docs = loader.load()
Per-page splitting for RAG
from kreuzberg import ExtractionConfig, PageConfig
config = ExtractionConfig(pages=PageConfig(extract_pages=True))
loader = KreuzbergLoader(
file_path="handbook.pdf",
config=config,
)
docs = loader.load()
# docs[0].metadata["page"] == 0 (zero-indexed)
Load from bytes (API response)
import httpx
response = httpx.get("https://example.com/report.pdf")
loader = KreuzbergLoader(
data=response.content,
mime_type="application/pdf",
)
docs = loader.load()
Advanced config
from kreuzberg import ExtractionConfig, OcrConfig, PageConfig
config = ExtractionConfig(
output_format="markdown",
ocr=OcrConfig(backend="easyocr", language="deu"),
force_ocr=True,
pages=PageConfig(extract_pages=True),
)
loader = KreuzbergLoader(
file_path="report.pdf",
config=config,
)
docs = loader.load()
Async loading
import asyncio
from langchain_kreuzberg import KreuzbergLoader
async def main():
loader = KreuzbergLoader(file_path="report.pdf")
docs = await loader.aload()
print(f"Loaded {len(docs)} documents")
asyncio.run(main())
API Reference
KreuzbergLoader
from langchain_kreuzberg import KreuzbergLoader
Extends langchain_core.document_loaders.BaseLoader.
Constructor Parameters
All parameters are keyword-only.
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | Path | list[str | Path] | None |
None |
File path, list of file paths, or directory path to load. |
data |
bytes | None |
None |
Raw bytes to extract text from. Mutually exclusive with file_path. |
mime_type |
str | None |
None |
MIME type hint. Required when using data, optional for file_path. |
glob |
str | None |
None |
Glob pattern for directory loading. |
config |
ExtractionConfig | None |
None |
Kreuzberg ExtractionConfig for controlling extraction behavior (output format, OCR settings, page splitting, etc.). See the Kreuzberg repository for all options. |
Methods
| Method | Return Type | Description |
|---|---|---|
load() |
list[Document] |
Load all documents into memory. |
lazy_load() |
Iterator[Document] |
Lazily yield documents one at a time (synchronous). |
aload() |
list[Document] |
Load all documents asynchronously. |
alazy_load() |
AsyncIterator[Document] |
Lazily yield documents one at a time (asynchronous). |
Metadata Fields
Each Document produced by KreuzbergLoader includes the following metadata fields (when available):
| Field | Type | Description |
|---|---|---|
source |
str |
File path or bytes://<mime_type> for bytes input. |
mime_type |
str |
Detected or provided MIME type. |
page_count |
int |
Total number of pages in the document. |
output_format |
str |
The output format used for extraction. |
quality_score |
float |
Extraction quality score (0.0 -- 1.0). |
detected_languages |
list[str] |
Languages detected in the document. |
extracted_keywords |
list[dict] |
Keywords with text, score, and algorithm fields. |
table_count |
int |
Number of tables found in the document. |
tables |
list[dict] |
Table data with cells, markdown, and page_number fields. |
processing_warnings |
list[dict] |
Warnings with source and message fields. |
page |
int |
Zero-indexed page number (only present in per-page mode). |
is_blank |
bool |
Whether the page is blank (only present in per-page mode). |
title |
str |
Document title (from file metadata). |
author |
str |
Document author (from file metadata). |
subject |
str |
Document subject (from file metadata). |
creator |
str |
Application that created the document. |
producer |
str |
Application that produced the document. |
creation_date |
str |
Document creation date. |
modification_date |
str |
Document last modification date. |
Additional metadata fields from Kreuzberg's document-level metadata are flattened into the metadata dict when present.
Supported Formats
Kreuzberg supports 88+ file formats including PDF, DOCX, images (via OCR), spreadsheets, presentations, HTML, Markdown, and many more. For the full and up-to-date list of supported formats, see the Kreuzberg repository.
Contributing
This project uses uv for dependency management.
# Clone the repository
git clone https://github.com/kreuzberg-dev/langchain-kreuzberg.git
cd langchain-kreuzberg
# Install dependencies (including dev group)
uv sync
# Run linting
uv run ruff check .
uv run ruff format --check .
uv run mypy .
# Run unit tests
uv run pytest --cov
# Run integration tests (real file extraction, no mocks)
uv run pytest -m integration -v
# Install pre-commit hooks
prek install
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_kreuzberg-1.0.2.tar.gz.
File metadata
- Download URL: langchain_kreuzberg-1.0.2.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0122f16da335e7567bd57676b6d0c32b727a6a5520d3d7026a8de41fb5f79b69
|
|
| MD5 |
8c03038291d455c4ee4172ecbf787555
|
|
| BLAKE2b-256 |
da7277df5dd8239d9b101ce918f344402c12d723368a4b0ac57ded205a2da8da
|
Provenance
The following attestation bundles were made for langchain_kreuzberg-1.0.2.tar.gz:
Publisher:
publish.yaml on kreuzberg-dev/langchain-kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_kreuzberg-1.0.2.tar.gz -
Subject digest:
0122f16da335e7567bd57676b6d0c32b727a6a5520d3d7026a8de41fb5f79b69 - Sigstore transparency entry: 1097662577
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/langchain-kreuzberg@e6d7a56677992ee027b3bce33f0c50db4f9b1122 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@e6d7a56677992ee027b3bce33f0c50db4f9b1122 -
Trigger Event:
push
-
Statement type:
File details
Details for the file langchain_kreuzberg-1.0.2-py3-none-any.whl.
File metadata
- Download URL: langchain_kreuzberg-1.0.2-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e0281f79f53b04f9adaeabfcca4396eac9129e111d76dafc2f5388c6b8a44c8
|
|
| MD5 |
940fc532f455921ceccb83e1d772674f
|
|
| BLAKE2b-256 |
d777910b493c02c25ff914b9946c21882bd2557c9134442f426066e2579e31a4
|
Provenance
The following attestation bundles were made for langchain_kreuzberg-1.0.2-py3-none-any.whl:
Publisher:
publish.yaml on kreuzberg-dev/langchain-kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
langchain_kreuzberg-1.0.2-py3-none-any.whl -
Subject digest:
4e0281f79f53b04f9adaeabfcca4396eac9129e111d76dafc2f5388c6b8a44c8 - Sigstore transparency entry: 1097662600
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/langchain-kreuzberg@e6d7a56677992ee027b3bce33f0c50db4f9b1122 -
Branch / Tag:
refs/tags/v1.0.2 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@e6d7a56677992ee027b3bce33f0c50db4f9b1122 -
Trigger Event:
push
-
Statement type: