Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project links

documentation

Project description

Kreuzberg

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview

Document Intelligence Capabilities

Text Extraction: High-fidelity text extraction preserving document structure and formatting
Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
OCR Integration: Tesseract OCR with markdown output (default) and table extraction from scanned documents
Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)

Technical Architecture

Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
Extensibility: Plugin architecture for custom extractors via the Extractor base class
API Design: Synchronous and asynchronous APIs with consistent interfaces
Type Safety: Complete type annotations throughout the codebase

Open Source Foundation

Kreuzberg leverages established open source technologies:

Pandoc: Universal document converter for robust format support
PDFium: Google's PDF rendering engine for accurate PDF processing
Tesseract: Google's OCR engine for text recognition
Python-docx/pptx: Native Microsoft Office format support

Quick Start

Extract Text with CLI

# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt

# With all features (chunking, language detection, etc.)
uvx kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json

Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")

Docker

Two optimized images available:

# Base image (API + CLI + multilingual OCR)
docker run -p 8000:8000 goldziher/kreuzberg

# Core image (+ chunking + crypto + document classification + language detection)
docker run -p 8000:8000 goldziher/kreuzberg-core:latest

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation Guide • CLI Documentation • API Reference

Deployment Options

🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

Extract text from PDFs, images, Office docs, and more
Multilingual OCR support with Tesseract
Metadata parsing and language detection

📖 MCP Documentation

Supported Formats

Category	Formats
Documents	PDF, DOCX, DOC, RTF, TXT, EPUB
Images	JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT, ODP
Web	HTML, XML, MHTML
Archives	Support via extraction

📊 Performance Characteristics

View comprehensive benchmarks • Benchmark methodology • Detailed Analysis

Technical Specifications

Metric	Kreuzberg Sync	Kreuzberg Async	Benchmarked
Throughput (tiny files)	31.78 files/s	23.94 files/s	Highest throughput
Throughput (small files)	8.91 files/s	9.31 files/s	Highest throughput
Memory footprint	359.8 MB	395.2 MB	Lowest usage
Installation size	71 MB	71 MB	Smallest size
Success rate	100%	100%	Perfect
Supported formats	18	18	Comprehensive

Architecture Advantages

Native C extensions: Built on PDFium and Tesseract for maximum performance
Async/await support: True asynchronous processing with intelligent task scheduling
Memory efficiency: Streaming architecture minimizes memory allocation
Process pooling: Automatic multiprocessing for CPU-intensive operations
Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation

License

MIT License - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project links

documentation

Release history Release notifications | RSS feed

4.7.2

Apr 4, 2026

4.7.1

Apr 3, 2026

4.7.0

Apr 3, 2026

4.6.3

Mar 27, 2026

4.6.2

Mar 26, 2026

4.6.1

Mar 25, 2026

4.6.0

Mar 24, 2026

4.5.4

Mar 23, 2026

4.5.3

Mar 22, 2026

4.5.2

Mar 21, 2026

4.5.1

Mar 20, 2026

4.4.6

Mar 13, 2026

4.4.5

Mar 10, 2026

4.4.4

Mar 7, 2026

4.4.3

Mar 6, 2026

4.4.2

Mar 4, 2026

4.4.1

Feb 28, 2026

4.4.0

Feb 27, 2026

4.3.8

Feb 21, 2026

4.3.7

Feb 20, 2026

4.3.6

Feb 19, 2026

4.3.5

Feb 17, 2026

4.3.4

Feb 16, 2026

4.3.3

Feb 14, 2026

4.3.2

Feb 13, 2026

4.3.1

Feb 12, 2026

4.3.0

Feb 11, 2026

4.2.15

Feb 8, 2026

4.2.14

Feb 7, 2026

4.2.13

Feb 7, 2026

4.2.12

Feb 6, 2026

4.2.11

Feb 6, 2026

4.2.10

Feb 5, 2026

4.2.9

Feb 3, 2026

4.2.8

Feb 2, 2026

4.2.7

Feb 1, 2026

4.2.6

Jan 31, 2026

4.2.5

Jan 30, 2026

4.2.4

Jan 29, 2026

4.2.3

Jan 29, 2026

4.2.2

Jan 28, 2026

4.2.1

Jan 27, 2026

4.2.0

Jan 26, 2026

4.1.2

Jan 25, 2026

4.1.1

Jan 23, 2026

4.1.0

Jan 22, 2026

4.0.8

Jan 17, 2026

4.0.7

Jan 16, 2026

4.0.6

Jan 14, 2026

4.0.5

Jan 14, 2026

4.0.4

Jan 13, 2026

4.0.3

Jan 13, 2026

4.0.2

Jan 12, 2026

4.0.1

Jan 11, 2026

4.0.0

Jan 11, 2026

4.0.0rc29 pre-release

Jan 9, 2026

4.0.0rc28 pre-release

Jan 7, 2026

4.0.0rc27 pre-release

Jan 4, 2026

4.0.0rc26 pre-release

Jan 3, 2026

4.0.0rc25 pre-release

Jan 3, 2026

4.0.0rc24 pre-release

Jan 1, 2026

4.0.0rc23 pre-release

Dec 30, 2025

4.0.0rc22 pre-release

Dec 28, 2025

4.0.0rc21 pre-release

Dec 26, 2025

4.0.0rc20 pre-release

Dec 25, 2025

4.0.0rc19 pre-release

Dec 24, 2025

4.0.0rc18 pre-release

Dec 23, 2025

4.0.0rc17 pre-release

Dec 22, 2025

4.0.0rc16 pre-release

Dec 21, 2025

4.0.0rc15 pre-release

Dec 20, 2025

4.0.0rc14 pre-release

Dec 20, 2025

4.0.0rc13 pre-release

Dec 19, 2025

4.0.0rc12 pre-release

Dec 19, 2025

4.0.0rc11 pre-release

Dec 19, 2025

4.0.0rc10 pre-release

Dec 17, 2025

4.0.0rc9 pre-release

Dec 15, 2025

4.0.0rc8 pre-release

Dec 14, 2025

4.0.0rc7 pre-release

Dec 12, 2025

4.0.0rc6 pre-release

Dec 10, 2025

4.0.0rc2 pre-release

Nov 30, 2025

4.0.0rc1 pre-release

Nov 23, 2025

3.22.0

Nov 27, 2025

3.21.0

Nov 5, 2025

3.20.2

Oct 11, 2025

3.20.1

Oct 11, 2025

3.20.0

Oct 11, 2025

3.19.1

Sep 30, 2025

3.19.0

Sep 29, 2025

3.18.0

Sep 27, 2025

3.17.3

Sep 23, 2025

3.17.2

Sep 22, 2025

3.17.1

Sep 19, 2025

3.17.0

Sep 17, 2025

3.16.0

Sep 16, 2025

3.15.0

Sep 14, 2025

3.14.1

Sep 13, 2025

3.14.0

Sep 13, 2025

3.13.3

Sep 10, 2025

This version

3.13.2

Sep 4, 2025

3.13.1

Sep 4, 2025

3.13.0

Sep 4, 2025

3.11.4

Aug 24, 2025

3.11.3

Aug 24, 2025

3.11.2

Aug 15, 2025

3.11.1

Aug 13, 2025

3.11.0

Aug 1, 2025

3.10.1

Jul 31, 2025

3.10.0

Jul 29, 2025

3.9.1

Jul 29, 2025

3.9.0

Jul 17, 2025

3.8.2

Jul 13, 2025

3.8.1

Jul 13, 2025

3.8.0

Jul 12, 2025

3.7.0

Jul 11, 2025

3.6.2

Jul 11, 2025

3.6.1

Jul 4, 2025

3.6.0

Jul 4, 2025

3.5.0

Jul 4, 2025

3.4.2

Jul 3, 2025

3.4.1

Jul 3, 2025

3.4.0

Jul 3, 2025

3.3.0

Jul 2, 2025

3.2.0

Jun 23, 2025

3.1.7

Jun 9, 2025

3.1.6

May 26, 2025

3.1.5

May 13, 2025

3.1.4

Apr 26, 2025

3.1.3

Apr 10, 2025

3.1.2

Apr 8, 2025

3.1.1

Apr 2, 2025

3.1.0

Mar 28, 2025

3.0.1

Mar 26, 2025

3.0.0

Mar 23, 2025

2.1.2

Mar 1, 2025

2.1.1

Mar 1, 2025

2.1.0

Feb 20, 2025

2.0.1

Feb 15, 2025

2.0.0

Feb 15, 2025

1.7.0

Feb 14, 2025

1.6.0

Feb 9, 2025

1.5.0

Feb 8, 2025

1.4.0

Feb 8, 2025

1.3.0

Feb 3, 2025

1.2.0

Feb 2, 2025

1.1.0

Feb 1, 2025

1.0.0

Feb 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-3.13.2.tar.gz (9.9 MB view details)

Uploaded Sep 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kreuzberg-3.13.2-py3-none-any.whl (104.5 kB view details)

Uploaded Sep 4, 2025 Python 3

File details

Details for the file kreuzberg-3.13.2.tar.gz.

File metadata

Download URL: kreuzberg-3.13.2.tar.gz
Upload date: Sep 4, 2025
Size: 9.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-3.13.2.tar.gz
Algorithm	Hash digest
SHA256	`bf1f6f28691b89a07f0292ae2af3f70d30617843d8afc4bbbd0b9d6f46d65bee`
MD5	`f152a518b1f1012c46d5a45dc81369af`
BLAKE2b-256	`2b6b6f2e4a0a2e31faa4fa0a4b8b10593d3e0ff2ee00290c6be35bd031d20bbb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.13.2.tar.gz:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-3.13.2.tar.gz
- Subject digest: bf1f6f28691b89a07f0292ae2af3f70d30617843d8afc4bbbd0b9d6f46d65bee
- Sigstore transparency entry: 469620507
- Sigstore integration time: Sep 4, 2025
Source repository:
- Permalink: Goldziher/kreuzberg@f15c826168be019346b2df4d916c7ba7c18618f6
- Branch / Tag: refs/tags/v3.13.2
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@f15c826168be019346b2df4d916c7ba7c18618f6
- Trigger Event: release

File details

Details for the file kreuzberg-3.13.2-py3-none-any.whl.

File metadata

Download URL: kreuzberg-3.13.2-py3-none-any.whl
Upload date: Sep 4, 2025
Size: 104.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-3.13.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7d1f60ce81f239e08b1c5897f5c237dde867c1d7e2844e93140edbaea46b4c7`
MD5	`8cff12ecb39a9a44eb50753e50934811`
BLAKE2b-256	`1f9b635b9483bea4d0c94bfb0ec8cc78a27f5650f561f3ad20cb429f62e9ac3f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-3.13.2-py3-none-any.whl:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-3.13.2-py3-none-any.whl
- Subject digest: f7d1f60ce81f239e08b1c5897f5c237dde867c1d7e2844e93140edbaea46b4c7
- Sigstore transparency entry: 469620518
- Sigstore integration time: Sep 4, 2025
Source repository:
- Permalink: Goldziher/kreuzberg@f15c826168be019346b2df4d916c7ba7c18618f6
- Branch / Tag: refs/tags/v3.13.2
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@f15c826168be019346b2df4d916c7ba7c18618f6
- Trigger Event: release

kreuzberg 3.13.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Kreuzberg

Framework Overview

Document Intelligence Capabilities

Technical Architecture

Open Source Foundation

Quick Start

Extract Text with CLI

Python Usage

Docker

Deployment Options

🤖 MCP Server (AI Integration)

Supported Formats

📊 Performance Characteristics

Technical Specifications

Architecture Advantages

Documentation

Quick Links

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance