A text extraction library supporting PDFs, images, office documents and more

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project description

Kreuzberg

High-performance Python library for text extraction from documents. Extract text from PDFs, images, office documents, and more with both async and sync APIs.

📖 Complete Documentation

Why Kreuzberg?

🚀 Fastest Performance: Benchmarked as the fastest text extraction library
💾 Memory Efficient: 14x smaller than alternatives (71MB vs 1GB+)
⚡ Dual APIs: Only library with both sync and async support
🔧 Zero Configuration: Works out of the box with sane defaults
🏠 Local Processing: No cloud dependencies or external API calls
📦 Rich Format Support: PDFs, images, Office docs, HTML, and more
🔍 Multiple OCR Engines: Tesseract, EasyOCR, and PaddleOCR support
🐳 Production Ready: CLI, REST API, and Docker images included

Quick Start

Installation

# Basic installation
pip install kreuzberg

# With optional features
pip install "kreuzberg[cli,api]"        # CLI + REST API
pip install "kreuzberg[easyocr,gmft]"   # EasyOCR + table extraction
pip install "kreuzberg[all]"            # Everything

System Dependencies

# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install tesseract pandoc

Basic Usage

import asyncio
from kreuzberg import extract_file

async def main():
    # Extract from any document type
    result = await extract_file("document.pdf")
    print(result.content)
    print(result.metadata)

asyncio.run(main())

Deployment Options

🐳 Docker (Recommended)

# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:3.4.0

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"

Available variants: 3.4.0, 3.4.0-easyocr, 3.4.0-paddle, 3.4.0-gmft, 3.4.0-all

🌐 REST API

# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run

# Health check
curl http://localhost:8000/health

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"

💻 Command Line

# Install CLI
pip install "kreuzberg[cli]"

# Extract to stdout
kreuzberg extract document.pdf

# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/

Supported Formats

Category	Formats
Documents	PDF, DOCX, DOC, RTF, TXT, EPUB
Images	JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT, ODP
Web	HTML, XML, MHTML
Archives	Support via extraction

Performance

Fastest extraction speeds with minimal resource usage:

Library	Speed	Memory	Size	Success Rate
Kreuzberg	⚡ Fastest	💾 Lowest	📦 71MB	✅ 100%
Unstructured	2-3x slower	2x higher	146MB	95%
MarkItDown	3-4x slower	3x higher	251MB	90%
Docling	4-5x slower	10x higher	1,032MB	85%

Rule of thumb: Use async API for complex documents and batch processing (up to 4.5x faster)

Documentation

Quick Links

Installation Guide - Setup and dependencies
User Guide - Comprehensive usage guide
API Reference - Complete API documentation
Docker Guide - Container deployment
REST API - HTTP endpoints
CLI Guide - Command-line usage
OCR Configuration - OCR engine setup

Advanced Features

📊 Table Extraction: Extract tables from PDFs with GMFT
🧩 Content Chunking: Split documents for RAG applications
🎯 Custom Extractors: Extend with your own document handlers
🔧 Configuration: Flexible TOML-based configuration
🪝 Hooks: Pre/post-processing customization
🌍 Multi-language OCR: 100+ languages supported
⚙️ Metadata Extraction: Rich document metadata
🔄 Batch Processing: Efficient bulk document processing

License

MIT License - see LICENSE for details.

Documentation • PyPI • Docker Hub • Discord

Made with ❤️ by the Kreuzberg contributors

Project details

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.7.2

Apr 4, 2026

4.7.1

Apr 3, 2026

4.7.0

Apr 3, 2026

4.6.3

Mar 27, 2026

4.6.2

Mar 26, 2026

4.6.1

Mar 25, 2026

4.6.0

Mar 24, 2026

4.5.4

Mar 23, 2026

4.5.3

Mar 22, 2026

4.5.2

Mar 21, 2026

4.5.1

Mar 20, 2026

4.4.6

Mar 13, 2026

4.4.5

Mar 10, 2026

4.4.4

Mar 7, 2026

4.4.3

Mar 6, 2026

4.4.2

Mar 4, 2026

4.4.1

Feb 28, 2026

4.4.0

Feb 27, 2026

4.3.8

Feb 21, 2026

4.3.7

Feb 20, 2026

4.3.6

Feb 19, 2026

4.3.5

Feb 17, 2026

4.3.4

Feb 16, 2026

4.3.3

Feb 14, 2026

4.3.2

Feb 13, 2026

4.3.1

Feb 12, 2026

4.3.0

Feb 11, 2026

4.2.15

Feb 8, 2026

4.2.14

Feb 7, 2026

4.2.13

Feb 7, 2026

4.2.12

Feb 6, 2026

4.2.11

Feb 6, 2026

4.2.10

Feb 5, 2026

4.2.9

Feb 3, 2026

4.2.8

Feb 2, 2026

4.2.7

Feb 1, 2026

4.2.6

Jan 31, 2026

4.2.5

Jan 30, 2026

4.2.4

Jan 29, 2026

4.2.3

Jan 29, 2026

4.2.2

Jan 28, 2026

4.2.1

Jan 27, 2026

4.2.0

Jan 26, 2026

4.1.2

Jan 25, 2026

4.1.1

Jan 23, 2026

4.1.0

Jan 22, 2026

4.0.8

Jan 17, 2026

4.0.7

Jan 16, 2026

4.0.6

Jan 14, 2026

4.0.5

Jan 14, 2026

4.0.4

Jan 13, 2026

4.0.3

Jan 13, 2026

4.0.2

Jan 12, 2026

4.0.1

Jan 11, 2026

4.0.0

Jan 11, 2026

4.0.0rc29 pre-release

Jan 9, 2026

4.0.0rc28 pre-release

Jan 7, 2026

4.0.0rc27 pre-release

Jan 4, 2026

4.0.0rc26 pre-release

Jan 3, 2026

4.0.0rc25 pre-release

Jan 3, 2026

4.0.0rc24 pre-release

Jan 1, 2026

4.0.0rc23 pre-release

Dec 30, 2025

4.0.0rc22 pre-release

Dec 28, 2025

4.0.0rc21 pre-release

Dec 26, 2025

4.0.0rc20 pre-release

Dec 25, 2025

4.0.0rc19 pre-release

Dec 24, 2025

4.0.0rc18 pre-release

Dec 23, 2025

4.0.0rc17 pre-release

Dec 22, 2025

4.0.0rc16 pre-release

Dec 21, 2025

4.0.0rc15 pre-release

Dec 20, 2025

4.0.0rc14 pre-release

Dec 20, 2025

4.0.0rc13 pre-release

Dec 19, 2025

4.0.0rc12 pre-release

Dec 19, 2025

4.0.0rc11 pre-release

Dec 19, 2025

4.0.0rc10 pre-release

Dec 17, 2025

4.0.0rc9 pre-release

Dec 15, 2025

4.0.0rc8 pre-release

Dec 14, 2025

4.0.0rc7 pre-release

Dec 12, 2025

4.0.0rc6 pre-release

Dec 10, 2025

4.0.0rc2 pre-release

Nov 30, 2025

4.0.0rc1 pre-release

Nov 23, 2025

3.22.0

Nov 27, 2025

3.21.0

Nov 5, 2025

3.20.2

Oct 11, 2025

3.20.1

Oct 11, 2025

3.20.0

Oct 11, 2025

3.19.1

Sep 30, 2025

3.19.0

Sep 29, 2025

3.18.0

Sep 27, 2025

3.17.3

Sep 23, 2025

3.17.2

Sep 22, 2025

3.17.1

Sep 19, 2025

3.17.0

Sep 17, 2025

3.16.0

Sep 16, 2025

3.15.0

Sep 14, 2025

3.14.1

Sep 13, 2025

3.14.0

Sep 13, 2025

3.13.3

Sep 10, 2025

3.13.2

Sep 4, 2025

3.13.1

Sep 4, 2025

3.13.0

Sep 4, 2025

3.11.4

Aug 24, 2025

3.11.3

Aug 24, 2025

3.11.2

Aug 15, 2025

3.11.1

Aug 13, 2025

3.11.0

Aug 1, 2025

3.10.1

Jul 31, 2025

3.10.0

Jul 29, 2025

3.9.1

Jul 29, 2025

3.9.0

Jul 17, 2025

3.8.2

Jul 13, 2025

3.8.1

Jul 13, 2025

3.8.0

Jul 12, 2025

3.7.0

Jul 11, 2025

3.6.2

Jul 11, 2025

3.6.1

Jul 4, 2025

3.6.0

Jul 4, 2025

3.5.0

Jul 4, 2025

3.4.2

Jul 3, 2025

This version

3.4.1

Jul 3, 2025

3.4.0

Jul 3, 2025

3.3.0

Jul 2, 2025

3.2.0

Jun 23, 2025

3.1.7

Jun 9, 2025

3.1.6

May 26, 2025

3.1.5

May 13, 2025

3.1.4

Apr 26, 2025

3.1.3

Apr 10, 2025

3.1.2

Apr 8, 2025

3.1.1

Apr 2, 2025

3.1.0

Mar 28, 2025

3.0.1

Mar 26, 2025

3.0.0

Mar 23, 2025

2.1.2

Mar 1, 2025

2.1.1

Mar 1, 2025

2.1.0

Feb 20, 2025

2.0.1

Feb 15, 2025

2.0.0

Feb 15, 2025

1.7.0

Feb 14, 2025

1.6.0

Feb 9, 2025

1.5.0

Feb 8, 2025

1.4.0

Feb 8, 2025

1.3.0

Feb 3, 2025

1.2.0

Feb 2, 2025

1.1.0

Feb 1, 2025

1.0.0

Feb 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-3.4.1.tar.gz (9.4 MB view details)

Uploaded Jul 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kreuzberg-3.4.1-py3-none-any.whl (85.4 kB view details)

Uploaded Jul 3, 2025 Python 3

File details

Details for the file kreuzberg-3.4.1.tar.gz.

File metadata

Download URL: kreuzberg-3.4.1.tar.gz
Upload date: Jul 3, 2025
Size: 9.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-3.4.1.tar.gz
Algorithm	Hash digest
SHA256	`fbc96bf34a46c3c47d731cafd305a5b30f771bb305254cafb6cdfac95a53d6e3`
MD5	`fef0c3f6e71b080e8a6b03b4b9862343`
BLAKE2b-256	`6f828bee89691f020ef25f620b25b3097089f87103d4dcf5cdff625c6a75a0f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.4.1.tar.gz:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-3.4.1.tar.gz
- Subject digest: fbc96bf34a46c3c47d731cafd305a5b30f771bb305254cafb6cdfac95a53d6e3
- Sigstore transparency entry: 261632320
- Sigstore integration time: Jul 3, 2025
Source repository:
- Permalink: Goldziher/kreuzberg@d708c161ede7c4bdd718bd22aa5bbd3c852d7450
- Branch / Tag: refs/tags/v3.4.1
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@d708c161ede7c4bdd718bd22aa5bbd3c852d7450
- Trigger Event: release

File details

Details for the file kreuzberg-3.4.1-py3-none-any.whl.

File metadata

Download URL: kreuzberg-3.4.1-py3-none-any.whl
Upload date: Jul 3, 2025
Size: 85.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-3.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e721fb7e4abe36d95984b4801ca89361e20caa1d2d8930818323d115cc08af6`
MD5	`19e9932d98d38ca67dacede6731f6fe9`
BLAKE2b-256	`0cb946f314708d349a9858d0364a33fa40b3a373910b7c3133188c948ec89c5f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.4.1-py3-none-any.whl:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-3.4.1-py3-none-any.whl
- Subject digest: 2e721fb7e4abe36d95984b4801ca89361e20caa1d2d8930818323d115cc08af6
- Sigstore transparency entry: 261632322
- Sigstore integration time: Jul 3, 2025
Source repository:
- Permalink: Goldziher/kreuzberg@d708c161ede7c4bdd718bd22aa5bbd3c852d7450
- Branch / Tag: refs/tags/v3.4.1
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@d708c161ede7c4bdd718bd22aa5bbd3c852d7450
- Trigger Event: release

kreuzberg 3.4.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Kreuzberg

Why Kreuzberg?

Quick Start

Installation

System Dependencies

Basic Usage

Deployment Options

🐳 Docker (Recommended)

🌐 REST API

💻 Command Line

Supported Formats

Performance

Documentation

Quick Links

Advanced Features

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance