Skip to main content

Kreuzberg document loader for LangChain — extract text from 75+ file formats with true async and rich metadata

Project description

langchain-kreuzberg

Kreuzberg Logo

Overview

langchain-kreuzberg is a LangChain document loader that wraps Kreuzberg's extraction API. It supports 75+ file formats out of the box, provides true async extraction powered by Rust's tokio runtime, and produces LangChain Document objects enriched with rich metadata including detected languages, quality scores, and extracted keywords.

Installation

pip install langchain-kreuzberg

Requires Python 3.10+.

Quick Start

from langchain_kreuzberg import KreuzbergLoader

loader = KreuzbergLoader(file_path="report.pdf")
docs = loader.load()

print(docs[0].page_content[:200])
print(docs[0].metadata["source"])

Features

  • 75+ file formats -- PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and many more
  • True async -- native async extraction backed by Rust's tokio runtime; no thread-pool workarounds
  • Rich metadata -- title, author, page count, detected languages, quality score, extracted keywords, and more
  • OCR with 3 backends -- Tesseract, EasyOCR, and PaddleOCR with configurable language support
  • Per-page splitting -- yield one Document per page for fine-grained RAG pipelines
  • Bytes input -- load documents directly from raw bytes (e.g., API responses, S3 objects)
  • Output format selection -- choose between plain text, Markdown, Djot, HTML, or structured output

Usage Examples

Load a PDF with defaults

from langchain_kreuzberg import KreuzbergLoader

loader = KreuzbergLoader(file_path="contract.pdf")
docs = loader.load()

Load multiple files

loader = KreuzbergLoader(
    file_path=["report.pdf", "notes.docx", "data.xlsx"],
)
docs = loader.load()

OCR a scanned document with Tesseract

from kreuzberg import ExtractionConfig, OcrConfig

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(backend="tesseract", language="eng"),
)

loader = KreuzbergLoader(
    file_path="scanned.pdf",
    config=config,
)
docs = loader.load()

Load all files from a directory

loader = KreuzbergLoader(
    file_path="./documents/",
    glob="**/*.pdf",
)
docs = loader.load()

Per-page splitting for RAG

from kreuzberg import ExtractionConfig, PageConfig

config = ExtractionConfig(pages=PageConfig(extract_pages=True))

loader = KreuzbergLoader(
    file_path="handbook.pdf",
    config=config,
)
docs = loader.load()
# docs[0].metadata["page"] == 0  (zero-indexed)

Load from bytes (API response)

import httpx

response = httpx.get("https://example.com/report.pdf")

loader = KreuzbergLoader(
    data=response.content,
    mime_type="application/pdf",
)
docs = loader.load()

Advanced config

from kreuzberg import ExtractionConfig, OcrConfig, PageConfig

config = ExtractionConfig(
    output_format="markdown",
    ocr=OcrConfig(backend="easyocr", language="deu"),
    force_ocr=True,
    pages=PageConfig(extract_pages=True),
)

loader = KreuzbergLoader(
    file_path="report.pdf",
    config=config,
)
docs = loader.load()

Async loading

import asyncio
from langchain_kreuzberg import KreuzbergLoader

async def main():
    loader = KreuzbergLoader(file_path="report.pdf")
    docs = await loader.aload()
    print(f"Loaded {len(docs)} documents")

asyncio.run(main())

API Reference

KreuzbergLoader

from langchain_kreuzberg import KreuzbergLoader

Extends langchain_core.document_loaders.BaseLoader.

Constructor Parameters

All parameters are keyword-only.

Parameter Type Default Description
file_path str | Path | list[str | Path] | None None File path, list of file paths, or directory path to load.
data bytes | None None Raw bytes to extract text from. Mutually exclusive with file_path.
mime_type str | None None MIME type hint. Required when using data, optional for file_path.
glob str | None None Glob pattern for directory loading.
config ExtractionConfig | None None Kreuzberg ExtractionConfig for controlling extraction behavior (output format, OCR settings, page splitting, etc.). See the Kreuzberg repository for all options.

Methods

Method Return Type Description
load() list[Document] Load all documents into memory.
lazy_load() Iterator[Document] Lazily yield documents one at a time (synchronous).
aload() list[Document] Load all documents asynchronously.
alazy_load() AsyncIterator[Document] Lazily yield documents one at a time (asynchronous).

Metadata Fields

Each Document produced by KreuzbergLoader includes the following metadata fields (when available):

Field Type Description
source str File path or bytes://<mime_type> for bytes input.
mime_type str Detected or provided MIME type.
page_count int Total number of pages in the document.
output_format str The output format used for extraction.
quality_score float Extraction quality score (0.0 -- 1.0).
detected_languages list[str] Languages detected in the document.
extracted_keywords list[dict] Keywords with text, score, and algorithm fields.
table_count int Number of tables found in the document.
tables list[dict] Table data with cells, markdown, and page_number fields.
processing_warnings list[dict] Warnings with source and message fields.
page int Zero-indexed page number (only present in per-page mode).
is_blank bool Whether the page is blank (only present in per-page mode).
title str Document title (from file metadata).
author str Document author (from file metadata).
subject str Document subject (from file metadata).
creator str Application that created the document.
producer str Application that produced the document.
creation_date str Document creation date.
modification_date str Document last modification date.

Additional metadata fields from Kreuzberg's document-level metadata are flattened into the metadata dict when present.

Supported Formats

Kreuzberg supports 75+ file formats including PDF, DOCX, images (via OCR), spreadsheets, presentations, HTML, Markdown, and many more. For the full and up-to-date list of supported formats, see the Kreuzberg repository.

Contributing

This project uses uv for dependency management.

# Clone the repository
git clone https://github.com/kreuzberg-dev/langchain-kreuzberg.git
cd langchain-kreuzberg

# Install dependencies (including dev group)
uv sync

# Run linting
uv run ruff check .
uv run ruff format --check .
uv run mypy .

# Run unit tests
uv run pytest --cov

# Run integration tests (real file extraction, no mocks)
uv run pytest -m integration -v

# Install pre-commit hooks
prek install

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_kreuzberg-1.0.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_kreuzberg-1.0.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file langchain_kreuzberg-1.0.1.tar.gz.

File metadata

  • Download URL: langchain_kreuzberg-1.0.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_kreuzberg-1.0.1.tar.gz
Algorithm Hash digest
SHA256 96dd5544fdec0c946646a051f3fde13c74c49702ccc65230f291876a317fc6b9
MD5 fa397325fcc76f264fd6067fdbb87aee
BLAKE2b-256 6d66a6796ad2f25eb08a153a3b79c52f36d98eaccd80c18c5e73bdcf96d77b71

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_kreuzberg-1.0.1.tar.gz:

Publisher: publish.yaml on kreuzberg-dev/langchain-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langchain_kreuzberg-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_kreuzberg-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1878eeba656b99a4948df4f20ea551ce878edac4d125e52f253d26bd942f288f
MD5 1575f941b1adcdcd5816f27598633d57
BLAKE2b-256 6121929c7020743726c7f6be8a3cceaa70cb0f62736210967387c6d43a7325f2

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_kreuzberg-1.0.1-py3-none-any.whl:

Publisher: publish.yaml on kreuzberg-dev/langchain-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page