Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
Project description
Kreuzberg
A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.
Framework Overview
Document Intelligence Capabilities
- Text Extraction: High-fidelity text extraction preserving document structure and formatting
- Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
- Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
- OCR Integration: Tesseract OCR with markdown output (default) and table extraction from scanned documents
- Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)
Technical Architecture
- Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
- Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
- Extensibility: Plugin architecture for custom extractors via the Extractor base class
- API Design: Synchronous and asynchronous APIs with consistent interfaces
- Type Safety: Complete type annotations throughout the codebase
Open Source Foundation
Kreuzberg leverages established open source technologies:
- Pandoc: Universal document converter for robust format support
- PDFium: Google's PDF rendering engine for accurate PDF processing
- Tesseract: Google's OCR engine for text recognition
- Python-docx/pptx: Native Microsoft Office format support
Quick Start
Extract Text with CLI
# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt
# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text
# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json
Python Usage
Async (recommended for web apps):
from kreuzberg import extract_file
# In your async function
result = await extract_file("presentation.pptx")
print(result.content)
# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")
Sync (for scripts and CLI tools):
from kreuzberg import extract_file_sync
result = extract_file_sync("report.docx")
print(result.content)
# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
Docker
Two optimized images available:
# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg
# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest
# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
📖 Installation Guide • CLI Documentation • API Reference
Deployment Options
🤖 MCP Server (AI Integration)
Add to Claude Desktop with one command:
claude mcp add kreuzberg uvx kreuzberg-mcp
Or configure manually in claude_desktop_config.json:
{
"mcpServers": {
"kreuzberg": {
"command": "uvx",
"args": ["kreuzberg-mcp"]
}
}
}
MCP capabilities:
- Extract text from PDFs, images, Office docs, and more
- Multilingual OCR support with Tesseract
- Metadata parsing and language detection
Supported Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, RTF, TXT, EPUB |
| Images | JPG, PNG, TIFF, BMP, GIF, WEBP |
| Spreadsheets | XLSX, XLS, CSV, ODS |
| Presentations | PPTX, PPT, ODP |
| Web | HTML, XML, MHTML |
| Archives | Support via extraction |
📊 Performance Characteristics
View comprehensive benchmarks • Benchmark methodology • Detailed Analysis
Technical Specifications
| Metric | Kreuzberg Sync | Kreuzberg Async | Benchmarked |
|---|---|---|---|
| Throughput (tiny files) | 31.78 files/s | 23.94 files/s | Highest throughput |
| Throughput (small files) | 8.91 files/s | 9.31 files/s | Highest throughput |
| Memory footprint | 359.8 MB | 395.2 MB | Lowest usage |
| Installation size | 71 MB | 71 MB | Smallest size |
| Success rate | 100% | 100% | Perfect |
| Supported formats | 18 | 18 | Comprehensive |
Architecture Advantages
- Native C extensions: Built on PDFium and Tesseract for maximum performance
- Async/await support: True asynchronous processing with intelligent task scheduling
- Memory efficiency: Streaming architecture minimizes memory allocation
- Process pooling: Automatic multiprocessing for CPU-intensive operations
- Optimized data flow: Efficient data handling with minimal transformations
Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
Documentation
Quick Links
- Installation Guide - Setup and dependencies
- User Guide - Comprehensive usage guide
- Performance Analysis - Detailed benchmark results
- API Reference - Complete API documentation
- Docker Guide - Container deployment
- REST API - HTTP endpoints
- CLI Guide - Command-line usage
- OCR Configuration - OCR engine setup
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kreuzberg-3.13.2.tar.gz.
File metadata
- Download URL: kreuzberg-3.13.2.tar.gz
- Upload date:
- Size: 9.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf1f6f28691b89a07f0292ae2af3f70d30617843d8afc4bbbd0b9d6f46d65bee
|
|
| MD5 |
f152a518b1f1012c46d5a45dc81369af
|
|
| BLAKE2b-256 |
2b6b6f2e4a0a2e31faa4fa0a4b8b10593d3e0ff2ee00290c6be35bd031d20bbb
|
Provenance
The following attestation bundles were made for kreuzberg-3.13.2.tar.gz:
Publisher:
release.yaml on Goldziher/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-3.13.2.tar.gz -
Subject digest:
bf1f6f28691b89a07f0292ae2af3f70d30617843d8afc4bbbd0b9d6f46d65bee - Sigstore transparency entry: 469620507
- Sigstore integration time:
-
Permalink:
Goldziher/kreuzberg@f15c826168be019346b2df4d916c7ba7c18618f6 -
Branch / Tag:
refs/tags/v3.13.2 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@f15c826168be019346b2df4d916c7ba7c18618f6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-3.13.2-py3-none-any.whl.
File metadata
- Download URL: kreuzberg-3.13.2-py3-none-any.whl
- Upload date:
- Size: 104.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7d1f60ce81f239e08b1c5897f5c237dde867c1d7e2844e93140edbaea46b4c7
|
|
| MD5 |
8cff12ecb39a9a44eb50753e50934811
|
|
| BLAKE2b-256 |
1f9b635b9483bea4d0c94bfb0ec8cc78a27f5650f561f3ad20cb429f62e9ac3f
|
Provenance
The following attestation bundles were made for kreuzberg-3.13.2-py3-none-any.whl:
Publisher:
release.yaml on Goldziher/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-3.13.2-py3-none-any.whl -
Subject digest:
f7d1f60ce81f239e08b1c5897f5c237dde867c1d7e2844e93140edbaea46b4c7 - Sigstore transparency entry: 469620518
- Sigstore integration time:
-
Permalink:
Goldziher/kreuzberg@f15c826168be019346b2df4d916c7ba7c18618f6 -
Branch / Tag:
refs/tags/v3.13.2 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@f15c826168be019346b2df4d916c7ba7c18618f6 -
Trigger Event:
release
-
Statement type: