Skip to main content

Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats

Project description

Kreuzberg

Discord PyPI version Documentation Benchmarks License: MIT DeepSource

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview

Document Intelligence Capabilities

  • Text Extraction: High-fidelity text extraction preserving document structure and formatting
  • Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
  • Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
  • OCR Integration: Tesseract OCR with markdown output (default) and table extraction from scanned documents
  • Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)

Technical Architecture

  • Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
  • Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
  • Extensibility: Plugin architecture for custom extractors via the Extractor base class
  • API Design: Synchronous and asynchronous APIs with consistent interfaces
  • Type Safety: Complete type annotations throughout the codebase

Open Source Foundation

Kreuzberg leverages established open source technologies:

  • Pandoc: Universal document converter for robust format support
  • PDFium: Google's PDF rendering engine for accurate PDF processing
  • Tesseract: Google's OCR engine for text recognition
  • Python-docx/pptx: Native Microsoft Office format support

Quick Start

Extract Text with CLI

# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt

# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json

Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")

Docker

Two optimized images available:

# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg

# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation GuideCLI DocumentationAPI Reference

Deployment Options

🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

  • Extract text from PDFs, images, Office docs, and more
  • Multilingual OCR support with Tesseract
  • Metadata parsing and language detection

📖 MCP Documentation

Supported Formats

Category Formats
Documents PDF, DOCX, DOC, RTF, TXT, EPUB
Images JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets XLSX, XLS, CSV, ODS
Presentations PPTX, PPT, ODP
Web HTML, XML, MHTML
Archives Support via extraction

📊 Performance Characteristics

View comprehensive benchmarksBenchmark methodologyDetailed Analysis

Technical Specifications

Metric Kreuzberg Sync Kreuzberg Async Benchmarked
Throughput (tiny files) 31.78 files/s 23.94 files/s Highest throughput
Throughput (small files) 8.91 files/s 9.31 files/s Highest throughput
Memory footprint 359.8 MB 395.2 MB Lowest usage
Installation size 71 MB 71 MB Smallest size
Success rate 100% 100% Perfect
Supported formats 18 18 Comprehensive

Architecture Advantages

  • Native C extensions: Built on PDFium and Tesseract for maximum performance
  • Async/await support: True asynchronous processing with intelligent task scheduling
  • Memory efficiency: Streaming architecture minimizes memory allocation
  • Process pooling: Automatic multiprocessing for CPU-intensive operations
  • Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation

Quick Links

License

MIT License - see LICENSE for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-3.13.2.tar.gz (9.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kreuzberg-3.13.2-py3-none-any.whl (104.5 kB view details)

Uploaded Python 3

File details

Details for the file kreuzberg-3.13.2.tar.gz.

File metadata

  • Download URL: kreuzberg-3.13.2.tar.gz
  • Upload date:
  • Size: 9.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-3.13.2.tar.gz
Algorithm Hash digest
SHA256 bf1f6f28691b89a07f0292ae2af3f70d30617843d8afc4bbbd0b9d6f46d65bee
MD5 f152a518b1f1012c46d5a45dc81369af
BLAKE2b-256 2b6b6f2e4a0a2e31faa4fa0a4b8b10593d3e0ff2ee00290c6be35bd031d20bbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.13.2.tar.gz:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-3.13.2-py3-none-any.whl.

File metadata

  • Download URL: kreuzberg-3.13.2-py3-none-any.whl
  • Upload date:
  • Size: 104.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-3.13.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7d1f60ce81f239e08b1c5897f5c237dde867c1d7e2844e93140edbaea46b4c7
MD5 8cff12ecb39a9a44eb50753e50934811
BLAKE2b-256 1f9b635b9483bea4d0c94bfb0ec8cc78a27f5650f561f3ad20cb429f62e9ac3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.13.2-py3-none-any.whl:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page