Skip to main content

A text extraction library supporting PDFs, images, office documents and more

Project description

Kreuzberg

Discord PyPI version Documentation License: MIT

Kreuzberg is a high-performance Python library for text extraction from documents. Benchmarked as one of the fastest text extraction libraries available, it provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs optimized for speed and efficiency.

Why Kreuzberg?

  • 🚀 Substantially Faster: Extraction speeds that significantly outperform other text extraction libraries
  • ⚡ Unique Dual API: The only framework supporting both sync and async APIs for maximum flexibility
  • 💾 Memory Efficient: Lower memory footprint compared to competing libraries
  • 📊 Proven Performance: Comprehensive benchmarks demonstrate superior performance across formats
  • Simple and Hassle-Free: Clean API that just works, without complex configuration
  • Local Processing: No external API calls or cloud dependencies required
  • Resource Efficient: Lightweight processing without GPU requirements
  • Format Support: Comprehensive support for documents, images, and text formats
  • Multiple OCR Engines: Support for Tesseract, EasyOCR, and PaddleOCR
  • Command Line Interface: Powerful CLI for batch processing and automation
  • Metadata Extraction: Get document metadata alongside text content
  • Table Extraction: Extract tables from documents using the excellent GMFT library
  • Modern Python: Built with async/await, type hints, and a functional-first approach
  • Permissive OSS: MIT licensed with permissively licensed dependencies

Quick Start

pip install kreuzberg

# Or install with CLI support
pip install "kreuzberg[cli]"

# Or install with API server
pip install "kreuzberg[api]"

Install pandoc:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install -y tesseract pandoc

The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.

Alternative OCR engines

# Install with EasyOCR support
pip install "kreuzberg[easyocr]"

# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"

Quick Example

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract text from a PDF
    result = await extract_file("document.pdf")
    print(result.content)

    # Extract text from an image
    result = await extract_file("scan.jpg")
    print(result.content)

    # Extract text from a Word document
    result = await extract_file("report.docx")
    print(result.content)

asyncio.run(main())

Docker

Docker images are available for easy deployment:

# Run the API server
docker run -p 8000:8000 goldziher/kreuzberg:latest

# Extract files via API
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"

See the Docker documentation for more options.

REST API

Run Kreuzberg as a REST API server:

pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run

See the API documentation for endpoints and usage.

Command Line Interface

Kreuzberg includes a powerful CLI for processing documents from the command line:

# Extract text from a file
kreuzberg extract document.pdf

# Extract with JSON output and metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Extract from stdin
cat document.html | kreuzberg extract

# Use specific OCR backend
kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de

# Extract with configuration file
kreuzberg extract document.pdf --config config.toml

CLI Configuration

Configure via pyproject.toml:

[tool.kreuzberg]
force_ocr = true
chunk_content = false
extract_tables = true
max_chars = 4000
ocr_backend = "tesseract"

[tool.kreuzberg.tesseract]
language = "eng+deu"
psm = 3

For full CLI documentation, see the CLI Guide.

Documentation

For comprehensive documentation, visit our GitHub Pages:

Supported Formats

Kreuzberg supports a wide range of document formats:

  • Documents: PDF, DOCX, RTF, TXT, EPUB, etc.
  • Images: JPG, PNG, TIFF, BMP, GIF, etc.
  • Spreadsheets: XLSX, XLS, CSV, etc.
  • Presentations: PPTX, PPT, etc.
  • Web Content: HTML, XML, etc.

OCR Engines

Kreuzberg supports multiple OCR engines:

  • Tesseract (Default): Lightweight, fast startup, requires system installation
  • EasyOCR: Good for many languages, pure Python, but downloads models on first use
  • PaddleOCR: Excellent for Asian languages, pure Python, but downloads models on first use

For comparison and selection guidance, see the OCR Backends documentation.

Performance

Kreuzberg delivers exceptional performance compared to other text extraction libraries:

🏆 Competitive Benchmarks

Comprehensive benchmarks comparing Kreuzberg against other popular Python text extraction libraries show:

  • Fastest Extraction: Consistently fastest processing times across file formats
  • Lowest Memory Usage: Most memory-efficient text extraction solution
  • 100% Success Rate: Reliable extraction across all tested document types
  • Optimal for High-Throughput: Designed for real-time, production applications

💾 Installation Size Efficiency

Kreuzberg delivers maximum performance with minimal overhead:

  1. Kreuzberg: 71.0 MB (20 deps) - Most lightweight
  2. Unstructured: 145.8 MB (54 deps) - Moderate footprint
  3. MarkItDown: 250.7 MB (25 deps) - ML inference overhead
  4. Docling: 1,031.9 MB (88 deps) - Full ML stack included

Kreuzberg is up to 14x smaller than competing solutions while delivering superior performance.

⚡ Sync vs Async Performance

Kreuzberg is the only library offering both sync and async APIs. Choose based on your use case:

Operation Sync Time Async Time Async Advantage
Simple text (Markdown) 0.4ms 17.5ms ❌ 41x slower
HTML documents 1.6ms 1.1ms ✅ 1.5x faster
Complex PDFs 39.0s 8.5s ✅ 4.6x faster
OCR processing 0.4s 0.7s ✅ 1.7x faster
Batch operations 38.6s 8.5s ✅ 4.5x faster

Rule of thumb: Use async for complex documents, OCR, batch processing, and backend APIs.

For detailed benchmarks and methodology, see our Performance Documentation.

Contributing

We welcome contributions! Please see our Contributing Guide for details on setting up your development environment and submitting pull requests.

License

This library is released under the MIT license.

Project details


Release history Release notifications | RSS feed

This version

3.4.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-3.4.0.tar.gz (9.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kreuzberg-3.4.0-py3-none-any.whl (86.3 kB view details)

Uploaded Python 3

File details

Details for the file kreuzberg-3.4.0.tar.gz.

File metadata

  • Download URL: kreuzberg-3.4.0.tar.gz
  • Upload date:
  • Size: 9.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-3.4.0.tar.gz
Algorithm Hash digest
SHA256 416e245cc01c2f7096d457b8678a39f1d11812f216e0ee055f55a4b4431b38b9
MD5 3d169081eb3552d271cc8eb36fb98947
BLAKE2b-256 5540a97ad667637d69796afadf0f3889b44842be34c6c7471f373203975dd0d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.4.0.tar.gz:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-3.4.0-py3-none-any.whl.

File metadata

  • Download URL: kreuzberg-3.4.0-py3-none-any.whl
  • Upload date:
  • Size: 86.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-3.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c921dbf95a878de19608a209fad3b6354eff4a1261650c64f297e3c702d2398f
MD5 2ac987b560323bc8c98d816fe47f5712
BLAKE2b-256 21d36c7f5af502f98ee2b5a2557e35c924be05bf38679e63914869edd00ea361

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.4.0-py3-none-any.whl:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page