Skip to main content

Kreuzberg document loader for LangChain — extract text from 88+ file formats with true async and rich metadata

Project description

langchain-kreuzberg

Kreuzberg Logo

Overview

langchain-kreuzberg is a LangChain document loader that wraps Kreuzberg's extraction API. It supports 88+ file formats out of the box, provides true async extraction powered by Rust's tokio runtime, and produces LangChain Document objects enriched with rich metadata including detected languages, quality scores, and extracted keywords.

Installation

pip install langchain-kreuzberg

Requires Python 3.10+.

Quick Start

from langchain_kreuzberg import KreuzbergLoader

loader = KreuzbergLoader(file_path="report.pdf")
docs = loader.load()

print(docs[0].page_content[:200])
print(docs[0].metadata["source"])

Features

  • 88+ file formats -- PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and many more
  • True async -- native async extraction backed by Rust's tokio runtime; no thread-pool workarounds
  • Rich metadata -- title, author, page count, detected languages, quality score, extracted keywords, and more
  • OCR with 3 backends -- Tesseract, EasyOCR, and PaddleOCR with configurable language support
  • Per-page splitting -- yield one Document per page for fine-grained RAG pipelines
  • Bytes input -- load documents directly from raw bytes (e.g., API responses, S3 objects)
  • Output format selection -- choose between plain text, Markdown, Djot, HTML, or structured output

Usage Examples

Load a PDF with defaults

from langchain_kreuzberg import KreuzbergLoader

loader = KreuzbergLoader(file_path="contract.pdf")
docs = loader.load()

Load multiple files

loader = KreuzbergLoader(
    file_path=["report.pdf", "notes.docx", "data.xlsx"],
)
docs = loader.load()

OCR a scanned document with Tesseract

from kreuzberg import ExtractionConfig, OcrConfig

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(backend="tesseract", language="eng"),
)

loader = KreuzbergLoader(
    file_path="scanned.pdf",
    config=config,
)
docs = loader.load()

Load all files from a directory

loader = KreuzbergLoader(
    file_path="./documents/",
    glob="**/*.pdf",
)
docs = loader.load()

Per-page splitting for RAG

from kreuzberg import ExtractionConfig, PageConfig

config = ExtractionConfig(pages=PageConfig(extract_pages=True))

loader = KreuzbergLoader(
    file_path="handbook.pdf",
    config=config,
)
docs = loader.load()
# docs[0].metadata["page"] == 0  (zero-indexed)

Load from bytes (API response)

import httpx

response = httpx.get("https://example.com/report.pdf")

loader = KreuzbergLoader(
    data=response.content,
    mime_type="application/pdf",
)
docs = loader.load()

Advanced config

from kreuzberg import ExtractionConfig, OcrConfig, PageConfig

config = ExtractionConfig(
    output_format="markdown",
    ocr=OcrConfig(backend="easyocr", language="deu"),
    force_ocr=True,
    pages=PageConfig(extract_pages=True),
)

loader = KreuzbergLoader(
    file_path="report.pdf",
    config=config,
)
docs = loader.load()

Async loading

import asyncio
from langchain_kreuzberg import KreuzbergLoader

async def main():
    loader = KreuzbergLoader(file_path="report.pdf")
    docs = await loader.aload()
    print(f"Loaded {len(docs)} documents")

asyncio.run(main())

API Reference

KreuzbergLoader

from langchain_kreuzberg import KreuzbergLoader

Extends langchain_core.document_loaders.BaseLoader.

Constructor Parameters

All parameters are keyword-only.

Parameter Type Default Description
file_path str | Path | list[str | Path] | None None File path, list of file paths, or directory path to load.
data bytes | None None Raw bytes to extract text from. Mutually exclusive with file_path.
mime_type str | None None MIME type hint. Required when using data, optional for file_path.
glob str | None None Glob pattern for directory loading.
config ExtractionConfig | None None Kreuzberg ExtractionConfig for controlling extraction behavior (output format, OCR settings, page splitting, etc.). See the Kreuzberg repository for all options.

Methods

Method Return Type Description
load() list[Document] Load all documents into memory.
lazy_load() Iterator[Document] Lazily yield documents one at a time (synchronous).
aload() list[Document] Load all documents asynchronously.
alazy_load() AsyncIterator[Document] Lazily yield documents one at a time (asynchronous).

Metadata Fields

Each Document produced by KreuzbergLoader includes the following metadata fields (when available):

Field Type Description
source str File path or bytes://<mime_type> for bytes input.
mime_type str Detected or provided MIME type.
page_count int Total number of pages in the document.
output_format str The output format used for extraction.
quality_score float Extraction quality score (0.0 -- 1.0).
detected_languages list[str] Languages detected in the document.
extracted_keywords list[dict] Keywords with text, score, and algorithm fields.
table_count int Number of tables found in the document.
tables list[dict] Table data with cells, markdown, and page_number fields.
processing_warnings list[dict] Warnings with source and message fields.
page int Zero-indexed page number (only present in per-page mode).
is_blank bool Whether the page is blank (only present in per-page mode).
title str Document title (from file metadata).
author str Document author (from file metadata).
subject str Document subject (from file metadata).
creator str Application that created the document.
producer str Application that produced the document.
creation_date str Document creation date.
modification_date str Document last modification date.

Additional metadata fields from Kreuzberg's document-level metadata are flattened into the metadata dict when present.

Supported Formats

Kreuzberg supports 88+ file formats including PDF, DOCX, images (via OCR), spreadsheets, presentations, HTML, Markdown, and many more. For the full and up-to-date list of supported formats, see the Kreuzberg repository.

Contributing

This project uses uv for dependency management.

# Clone the repository
git clone https://github.com/kreuzberg-dev/langchain-kreuzberg.git
cd langchain-kreuzberg

# Install dependencies (including dev group)
uv sync

# Run linting
uv run ruff check .
uv run ruff format --check .
uv run mypy .

# Run unit tests
uv run pytest --cov

# Run integration tests (real file extraction, no mocks)
uv run pytest -m integration -v

# Install pre-commit hooks
prek install

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_kreuzberg-1.0.2.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_kreuzberg-1.0.2-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file langchain_kreuzberg-1.0.2.tar.gz.

File metadata

  • Download URL: langchain_kreuzberg-1.0.2.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for langchain_kreuzberg-1.0.2.tar.gz
Algorithm Hash digest
SHA256 0122f16da335e7567bd57676b6d0c32b727a6a5520d3d7026a8de41fb5f79b69
MD5 8c03038291d455c4ee4172ecbf787555
BLAKE2b-256 da7277df5dd8239d9b101ce918f344402c12d723368a4b0ac57ded205a2da8da

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_kreuzberg-1.0.2.tar.gz:

Publisher: publish.yaml on kreuzberg-dev/langchain-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file langchain_kreuzberg-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_kreuzberg-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4e0281f79f53b04f9adaeabfcca4396eac9129e111d76dafc2f5388c6b8a44c8
MD5 940fc532f455921ceccb83e1d772674f
BLAKE2b-256 d777910b493c02c25ff914b9946c21882bd2557c9134442f426066e2579e31a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for langchain_kreuzberg-1.0.2-py3-none-any.whl:

Publisher: publish.yaml on kreuzberg-dev/langchain-kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page