Kreuzberg document loader for LangChain — extract text from 75+ file formats with true async and rich metadata

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project links

Project description

langchain-kreuzberg

Overview

langchain-kreuzberg is a LangChain document loader that wraps Kreuzberg's extraction API. It supports 75+ file formats out of the box, provides true async extraction powered by Rust's tokio runtime, and produces LangChain Document objects enriched with rich metadata including detected languages, quality scores, and extracted keywords.

Installation

pip install langchain-kreuzberg

Requires Python 3.10+.

Quick Start

from langchain_kreuzberg import KreuzbergLoader

loader = KreuzbergLoader(file_path="report.pdf")
docs = loader.load()

print(docs[0].page_content[:200])
print(docs[0].metadata["source"])

Features

75+ file formats -- PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and many more
True async -- native async extraction backed by Rust's tokio runtime; no thread-pool workarounds
Rich metadata -- title, author, page count, detected languages, quality score, extracted keywords, and more
OCR with 3 backends -- Tesseract, EasyOCR, and PaddleOCR with configurable language support
Per-page splitting -- yield one Document per page for fine-grained RAG pipelines
Bytes input -- load documents directly from raw bytes (e.g., API responses, S3 objects)
Output format selection -- choose between plain text, Markdown, Djot, HTML, or structured output

Usage Examples

Load a PDF with defaults

from langchain_kreuzberg import KreuzbergLoader

loader = KreuzbergLoader(file_path="contract.pdf")
docs = loader.load()

Load multiple files

loader = KreuzbergLoader(
    file_path=["report.pdf", "notes.docx", "data.xlsx"],
)
docs = loader.load()

OCR a scanned document with Tesseract

from kreuzberg import ExtractionConfig, OcrConfig

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(backend="tesseract", language="eng"),
)

loader = KreuzbergLoader(
    file_path="scanned.pdf",
    config=config,
)
docs = loader.load()

Load all files from a directory

loader = KreuzbergLoader(
    file_path="./documents/",
    glob="**/*.pdf",
)
docs = loader.load()

Per-page splitting for RAG

from kreuzberg import ExtractionConfig, PageConfig

config = ExtractionConfig(pages=PageConfig(extract_pages=True))

loader = KreuzbergLoader(
    file_path="handbook.pdf",
    config=config,
)
docs = loader.load()
# docs[0].metadata["page"] == 0  (zero-indexed)

Load from bytes (API response)

import httpx

response = httpx.get("https://example.com/report.pdf")

loader = KreuzbergLoader(
    data=response.content,
    mime_type="application/pdf",
)
docs = loader.load()

Advanced config

from kreuzberg import ExtractionConfig, OcrConfig, PageConfig

config = ExtractionConfig(
    output_format="markdown",
    ocr=OcrConfig(backend="easyocr", language="deu"),
    force_ocr=True,
    pages=PageConfig(extract_pages=True),
)

loader = KreuzbergLoader(
    file_path="report.pdf",
    config=config,
)
docs = loader.load()

Async loading

import asyncio
from langchain_kreuzberg import KreuzbergLoader

async def main():
    loader = KreuzbergLoader(file_path="report.pdf")
    docs = await loader.aload()
    print(f"Loaded {len(docs)} documents")

asyncio.run(main())

API Reference

`KreuzbergLoader`

from langchain_kreuzberg import KreuzbergLoader

Extends langchain_core.document_loaders.BaseLoader.

Constructor Parameters

All parameters are keyword-only.

Parameter	Type	Default	Description
`file_path`	`str \| Path \| list[str \| Path] \| None`	`None`	File path, list of file paths, or directory path to load.
`data`	`bytes \| None`	`None`	Raw bytes to extract text from. Mutually exclusive with `file_path`.
`mime_type`	`str \| None`	`None`	MIME type hint. Required when using `data`, optional for `file_path`.
`glob`	`str \| None`	`None`	Glob pattern for directory loading.
`config`	`ExtractionConfig \| None`	`None`	Kreuzberg `ExtractionConfig` for controlling extraction behavior (output format, OCR settings, page splitting, etc.). See the Kreuzberg repository for all options.

Methods

Method	Return Type	Description
`load()`	`list[Document]`	Load all documents into memory.
`lazy_load()`	`Iterator[Document]`	Lazily yield documents one at a time (synchronous).
`aload()`	`list[Document]`	Load all documents asynchronously.
`alazy_load()`	`AsyncIterator[Document]`	Lazily yield documents one at a time (asynchronous).

Metadata Fields

Each Document produced by KreuzbergLoader includes the following metadata fields (when available):

Field	Type	Description
`source`	`str`	File path or `bytes://<mime_type>` for bytes input.
`mime_type`	`str`	Detected or provided MIME type.
`page_count`	`int`	Total number of pages in the document.
`output_format`	`str`	The output format used for extraction.
`quality_score`	`float`	Extraction quality score (0.0 -- 1.0).
`detected_languages`	`list[str]`	Languages detected in the document.
`extracted_keywords`	`list[dict]`	Keywords with `text`, `score`, and `algorithm` fields.
`table_count`	`int`	Number of tables found in the document.
`tables`	`list[dict]`	Table data with `cells`, `markdown`, and `page_number` fields.
`processing_warnings`	`list[dict]`	Warnings with `source` and `message` fields.
`page`	`int`	Zero-indexed page number (only present in per-page mode).
`is_blank`	`bool`	Whether the page is blank (only present in per-page mode).
`title`	`str`	Document title (from file metadata).
`author`	`str`	Document author (from file metadata).
`subject`	`str`	Document subject (from file metadata).
`creator`	`str`	Application that created the document.
`producer`	`str`	Application that produced the document.
`creation_date`	`str`	Document creation date.
`modification_date`	`str`	Document last modification date.

Additional metadata fields from Kreuzberg's document-level metadata are flattened into the metadata dict when present.

Supported Formats

Kreuzberg supports 75+ file formats including PDF, DOCX, images (via OCR), spreadsheets, presentations, HTML, Markdown, and many more. For the full and up-to-date list of supported formats, see the Kreuzberg repository.

Contributing

This project uses uv for dependency management.

# Clone the repository
git clone https://github.com/kreuzberg-dev/langchain-kreuzberg.git
cd langchain-kreuzberg

# Install dependencies (including dev group)
uv sync

# Run linting
uv run ruff check .
uv run ruff format --check .
uv run mypy .

# Run unit tests
uv run pytest --cov

# Run integration tests (real file extraction, no mocks)
uv run pytest -m integration -v

# Install pre-commit hooks
prek install

License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.2

Mar 13, 2026

This version

1.0.1

Mar 4, 2026

1.0.0

Feb 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_kreuzberg-1.0.1.tar.gz (8.7 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langchain_kreuzberg-1.0.1-py3-none-any.whl (8.8 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file langchain_kreuzberg-1.0.1.tar.gz.

File metadata

Download URL: langchain_kreuzberg-1.0.1.tar.gz
Upload date: Mar 4, 2026
Size: 8.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_kreuzberg-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`96dd5544fdec0c946646a051f3fde13c74c49702ccc65230f291876a317fc6b9`
MD5	`fa397325fcc76f264fd6067fdbb87aee`
BLAKE2b-256	`6d66a6796ad2f25eb08a153a3b79c52f36d98eaccd80c18c5e73bdcf96d77b71`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_kreuzberg-1.0.1.tar.gz:

Publisher: publish.yaml on kreuzberg-dev/langchain-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langchain_kreuzberg-1.0.1.tar.gz
- Subject digest: 96dd5544fdec0c946646a051f3fde13c74c49702ccc65230f291876a317fc6b9
- Sigstore transparency entry: 1032589852
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: kreuzberg-dev/langchain-kreuzberg@89a37d3561ffc459cd8f5c5517a167df9b114c27
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@89a37d3561ffc459cd8f5c5517a167df9b114c27
- Trigger Event: push

File details

Details for the file langchain_kreuzberg-1.0.1-py3-none-any.whl.

File metadata

Download URL: langchain_kreuzberg-1.0.1-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 8.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_kreuzberg-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1878eeba656b99a4948df4f20ea551ce878edac4d125e52f253d26bd942f288f`
MD5	`1575f941b1adcdcd5816f27598633d57`
BLAKE2b-256	`6121929c7020743726c7f6be8a3cceaa70cb0f62736210967387c6d43a7325f2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_kreuzberg-1.0.1-py3-none-any.whl:

Publisher: publish.yaml on kreuzberg-dev/langchain-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langchain_kreuzberg-1.0.1-py3-none-any.whl
- Subject digest: 1878eeba656b99a4948df4f20ea551ce878edac4d125e52f253d26bd942f288f
- Sigstore transparency entry: 1032589891
- Sigstore integration time: Mar 4, 2026
Source repository:
- Permalink: kreuzberg-dev/langchain-kreuzberg@89a37d3561ffc459cd8f5c5517a167df9b114c27
- Branch / Tag: refs/tags/v1.0.1
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@89a37d3561ffc459cd8f5c5517a167df9b114c27
- Trigger Event: push

langchain-kreuzberg 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

langchain-kreuzberg

Overview

Installation

Quick Start

Features

Usage Examples

Load a PDF with defaults

Load multiple files

OCR a scanned document with Tesseract

Load all files from a directory

Per-page splitting for RAG

Load from bytes (API response)

Advanced config

Async loading

API Reference

KreuzbergLoader

Constructor Parameters

Methods

Metadata Fields

Supported Formats

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`KreuzbergLoader`