Skip to main content

A text extraction library supporting PDFs, images, office documents and more

Project description

Kreuzberg

Discord PyPI version Documentation License: MIT

High-performance Python library for text extraction from documents. Extract text from PDFs, images, office documents, and more with both async and sync APIs.

📖 Complete Documentation

Why Kreuzberg?

  • 🚀 Fastest Performance: Benchmarked as the fastest text extraction library
  • 💾 Memory Efficient: 14x smaller than alternatives (71MB vs 1GB+)
  • ⚡ Dual APIs: Only library with both sync and async support
  • 🔧 Zero Configuration: Works out of the box with sane defaults
  • 🏠 Local Processing: No cloud dependencies or external API calls
  • 📦 Rich Format Support: PDFs, images, Office docs, HTML, and more
  • 🔍 Multiple OCR Engines: Tesseract, EasyOCR, and PaddleOCR support
  • 🐳 Production Ready: CLI, REST API, and Docker images included

Quick Start

Installation

# Basic installation
pip install kreuzberg

# With optional features
pip install "kreuzberg[cli,api]"        # CLI + REST API
pip install "kreuzberg[easyocr,gmft]"   # EasyOCR + table extraction
pip install "kreuzberg[all]"            # Everything

System Dependencies

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install tesseract pandoc

Basic Usage

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract from any document type
    result = await extract_file("document.pdf")
    print(result.content)
    print(result.metadata)

asyncio.run(main())

Deployment Options

🐳 Docker (Recommended)

# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:3.4.0

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"

Available variants: 3.4.0, 3.4.0-easyocr, 3.4.0-paddle, 3.4.0-gmft, 3.4.0-all

🌐 REST API

# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run

# Health check
curl http://localhost:8000/health

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"

💻 Command Line

# Install CLI
pip install "kreuzberg[cli]"

# Extract to stdout
kreuzberg extract document.pdf

# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/

Supported Formats

Category Formats
Documents PDF, DOCX, DOC, RTF, TXT, EPUB
Images JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets XLSX, XLS, CSV, ODS
Presentations PPTX, PPT, ODP
Web HTML, XML, MHTML
Archives Support via extraction

Performance

Fastest extraction speeds with minimal resource usage:

Library Speed Memory Size Success Rate
Kreuzberg Fastest 💾 Lowest 📦 71MB 100%
Unstructured 2-3x slower 2x higher 146MB 95%
MarkItDown 3-4x slower 3x higher 251MB 90%
Docling 4-5x slower 10x higher 1,032MB 85%

Rule of thumb: Use async API for complex documents and batch processing (up to 4.5x faster)

Documentation

Quick Links

Advanced Features

  • 📊 Table Extraction: Extract tables from PDFs with GMFT
  • 🧩 Content Chunking: Split documents for RAG applications
  • 🎯 Custom Extractors: Extend with your own document handlers
  • 🔧 Configuration: Flexible TOML-based configuration
  • 🪝 Hooks: Pre/post-processing customization
  • 🌍 Multi-language OCR: 100+ languages supported
  • ⚙️ Metadata Extraction: Rich document metadata
  • 🔄 Batch Processing: Efficient bulk document processing

License

MIT License - see LICENSE for details.


DocumentationPyPIDocker HubDiscord

Made with ❤️ by the Kreuzberg contributors

Project details


Release history Release notifications | RSS feed

This version

3.4.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-3.4.1.tar.gz (9.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kreuzberg-3.4.1-py3-none-any.whl (85.4 kB view details)

Uploaded Python 3

File details

Details for the file kreuzberg-3.4.1.tar.gz.

File metadata

  • Download URL: kreuzberg-3.4.1.tar.gz
  • Upload date:
  • Size: 9.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-3.4.1.tar.gz
Algorithm Hash digest
SHA256 fbc96bf34a46c3c47d731cafd305a5b30f771bb305254cafb6cdfac95a53d6e3
MD5 fef0c3f6e71b080e8a6b03b4b9862343
BLAKE2b-256 6f828bee89691f020ef25f620b25b3097089f87103d4dcf5cdff625c6a75a0f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.4.1.tar.gz:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-3.4.1-py3-none-any.whl.

File metadata

  • Download URL: kreuzberg-3.4.1-py3-none-any.whl
  • Upload date:
  • Size: 85.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-3.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2e721fb7e4abe36d95984b4801ca89361e20caa1d2d8930818323d115cc08af6
MD5 19e9932d98d38ca67dacede6731f6fe9
BLAKE2b-256 0cb946f314708d349a9858d0364a33fa40b3a373910b7c3133188c948ec89c5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.4.1-py3-none-any.whl:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page