A text extraction library supporting PDFs, images, office documents and more
Project description
Kreuzberg
High-performance Python library for text extraction from documents. Extract text from PDFs, images, office documents, and more with both async and sync APIs.
Why Kreuzberg?
- 🚀 Fastest Performance: Benchmarked as the fastest text extraction library
- 💾 Memory Efficient: 14x smaller than alternatives (71MB vs 1GB+)
- ⚡ Dual APIs: Only library with both sync and async support
- 🔧 Zero Configuration: Works out of the box with sane defaults
- 🏠 Local Processing: No cloud dependencies or external API calls
- 📦 Rich Format Support: PDFs, images, Office docs, HTML, and more
- 🔍 Multiple OCR Engines: Tesseract, EasyOCR, and PaddleOCR support
- 🐳 Production Ready: CLI, REST API, and Docker images included
Quick Start
Installation
# Basic installation
pip install kreuzberg
# With optional features
pip install "kreuzberg[cli,api]" # CLI + REST API
pip install "kreuzberg[easyocr,gmft]" # EasyOCR + table extraction
pip install "kreuzberg[all]" # Everything
System Dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc
# macOS
brew install tesseract pandoc
# Windows
choco install tesseract pandoc
Basic Usage
import asyncio
from kreuzberg import extract_file
async def main():
# Extract from any document type
result = await extract_file("document.pdf")
print(result.content)
print(result.metadata)
asyncio.run(main())
Deployment Options
🐳 Docker (Recommended)
# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:3.4.0
# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
Available variants: 3.4.0, 3.4.0-easyocr, 3.4.0-paddle, 3.4.0-gmft, 3.4.0-all
🌐 REST API
# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run
# Health check
curl http://localhost:8000/health
# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"
💻 Command Line
# Install CLI
pip install "kreuzberg[cli]"
# Extract to stdout
kreuzberg extract document.pdf
# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata
# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/
Supported Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, RTF, TXT, EPUB |
| Images | JPG, PNG, TIFF, BMP, GIF, WEBP |
| Spreadsheets | XLSX, XLS, CSV, ODS |
| Presentations | PPTX, PPT, ODP |
| Web | HTML, XML, MHTML |
| Archives | Support via extraction |
Performance
Fastest extraction speeds with minimal resource usage:
| Library | Speed | Memory | Size | Success Rate |
|---|---|---|---|---|
| Kreuzberg | ⚡ Fastest | 💾 Lowest | 📦 71MB | ✅ 100% |
| Unstructured | 2-3x slower | 2x higher | 146MB | 95% |
| MarkItDown | 3-4x slower | 3x higher | 251MB | 90% |
| Docling | 4-5x slower | 10x higher | 1,032MB | 85% |
Rule of thumb: Use async API for complex documents and batch processing (up to 4.5x faster)
Documentation
Quick Links
- Installation Guide - Setup and dependencies
- User Guide - Comprehensive usage guide
- API Reference - Complete API documentation
- Docker Guide - Container deployment
- REST API - HTTP endpoints
- CLI Guide - Command-line usage
- OCR Configuration - OCR engine setup
Advanced Features
- 📊 Table Extraction: Extract tables from PDFs with GMFT
- 🧩 Content Chunking: Split documents for RAG applications
- 🎯 Custom Extractors: Extend with your own document handlers
- 🔧 Configuration: Flexible TOML-based configuration
- 🪝 Hooks: Pre/post-processing customization
- 🌍 Multi-language OCR: 100+ languages supported
- ⚙️ Metadata Extraction: Rich document metadata
- 🔄 Batch Processing: Efficient bulk document processing
License
MIT License - see LICENSE for details.
Documentation • PyPI • Docker Hub • Discord
Made with ❤️ by the Kreuzberg contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kreuzberg-3.4.1.tar.gz.
File metadata
- Download URL: kreuzberg-3.4.1.tar.gz
- Upload date:
- Size: 9.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbc96bf34a46c3c47d731cafd305a5b30f771bb305254cafb6cdfac95a53d6e3
|
|
| MD5 |
fef0c3f6e71b080e8a6b03b4b9862343
|
|
| BLAKE2b-256 |
6f828bee89691f020ef25f620b25b3097089f87103d4dcf5cdff625c6a75a0f3
|
Provenance
The following attestation bundles were made for kreuzberg-3.4.1.tar.gz:
Publisher:
release.yaml on Goldziher/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-3.4.1.tar.gz -
Subject digest:
fbc96bf34a46c3c47d731cafd305a5b30f771bb305254cafb6cdfac95a53d6e3 - Sigstore transparency entry: 261632320
- Sigstore integration time:
-
Permalink:
Goldziher/kreuzberg@d708c161ede7c4bdd718bd22aa5bbd3c852d7450 -
Branch / Tag:
refs/tags/v3.4.1 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@d708c161ede7c4bdd718bd22aa5bbd3c852d7450 -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-3.4.1-py3-none-any.whl.
File metadata
- Download URL: kreuzberg-3.4.1-py3-none-any.whl
- Upload date:
- Size: 85.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e721fb7e4abe36d95984b4801ca89361e20caa1d2d8930818323d115cc08af6
|
|
| MD5 |
19e9932d98d38ca67dacede6731f6fe9
|
|
| BLAKE2b-256 |
0cb946f314708d349a9858d0364a33fa40b3a373910b7c3133188c948ec89c5f
|
Provenance
The following attestation bundles were made for kreuzberg-3.4.1-py3-none-any.whl:
Publisher:
release.yaml on Goldziher/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-3.4.1-py3-none-any.whl -
Subject digest:
2e721fb7e4abe36d95984b4801ca89361e20caa1d2d8930818323d115cc08af6 - Sigstore transparency entry: 261632322
- Sigstore integration time:
-
Permalink:
Goldziher/kreuzberg@d708c161ede7c4bdd718bd22aa5bbd3c852d7450 -
Branch / Tag:
refs/tags/v3.4.1 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@d708c161ede7c4bdd718bd22aa5bbd3c852d7450 -
Trigger Event:
release
-
Statement type: