A text extraction library supporting PDFs, images, office documents and more
Project description
Kreuzberg
Kreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs.
Why Kreuzberg?
- Simple and Hassle-Free: Clean API that just works, without complex configuration
- Local Processing: No external API calls or cloud dependencies required
- Resource Efficient: Lightweight processing without GPU requirements
- Format Support: Comprehensive support for documents, images, and text formats
- Multiple OCR Engines: Support for Tesseract, EasyOCR, and PaddleOCR
- Command Line Interface: Powerful CLI for batch processing and automation
- Metadata Extraction: Get document metadata alongside text content
- Table Extraction: Extract tables from documents using the excellent GMFT library
- Modern Python: Built with async/await, type hints, and a functional-first approach
- Permissive OSS: MIT licensed with permissively licensed dependencies
Quick Start
pip install kreuzberg
# Or install with CLI support
pip install "kreuzberg[cli]"
Install pandoc:
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc
# macOS
brew install tesseract pandoc
# Windows
choco install -y tesseract pandoc
The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.
Alternative OCR engines
# Install with EasyOCR support
pip install "kreuzberg[easyocr]"
# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"
Quick Example
import asyncio
from kreuzberg import extract_file
async def main():
# Extract text from a PDF
result = await extract_file("document.pdf")
print(result.content)
# Extract text from an image
result = await extract_file("scan.jpg")
print(result.content)
# Extract text from a Word document
result = await extract_file("report.docx")
print(result.content)
asyncio.run(main())
Command Line Interface
Kreuzberg includes a powerful CLI for processing documents from the command line:
# Extract text from a file
kreuzberg extract document.pdf
# Extract with JSON output and metadata
kreuzberg extract document.pdf --output-format json --show-metadata
# Extract from stdin
cat document.html | kreuzberg extract
# Use specific OCR backend
kreuzberg extract image.png --ocr-backend easyocr --easyocr-languages en,de
# Extract with configuration file
kreuzberg extract document.pdf --config config.toml
CLI Configuration
Configure via pyproject.toml:
[tool.kreuzberg]
force_ocr = true
chunk_content = false
extract_tables = true
max_chars = 4000
ocr_backend = "tesseract"
[tool.kreuzberg.tesseract]
language = "eng+deu"
psm = 3
For full CLI documentation, see the CLI Guide.
Documentation
For comprehensive documentation, visit our GitHub Pages:
- Getting Started - Installation and basic usage
- User Guide - In-depth usage information
- CLI Guide - Command-line interface documentation
- API Reference - Detailed API documentation
- Examples - Code examples for common use cases
- OCR Configuration - Configure OCR engines
- OCR Backends - Choose the right OCR engine
Supported Formats
Kreuzberg supports a wide range of document formats:
- Documents: PDF, DOCX, RTF, TXT, EPUB, etc.
- Images: JPG, PNG, TIFF, BMP, GIF, etc.
- Spreadsheets: XLSX, XLS, CSV, etc.
- Presentations: PPTX, PPT, etc.
- Web Content: HTML, XML, etc.
OCR Engines
Kreuzberg supports multiple OCR engines:
- Tesseract (Default): Lightweight, fast startup, requires system installation
- EasyOCR: Good for many languages, pure Python, but downloads models on first use
- PaddleOCR: Excellent for Asian languages, pure Python, but downloads models on first use
For comparison and selection guidance, see the OCR Backends documentation.
Performance
Kreuzberg offers both sync and async APIs. Choose the right one based on your use case:
| Operation | Sync Time | Async Time | Async Advantage |
|---|---|---|---|
| Simple text (Markdown) | 0.4ms | 17.5ms | ❌ 41x slower |
| HTML documents | 1.6ms | 1.1ms | ✅ 1.5x faster |
| Complex PDFs | 39.0s | 8.5s | ✅ 4.6x faster |
| OCR processing | 0.4s | 0.7s | ✅ 1.7x faster |
| Batch operations | 38.6s | 8.5s | ✅ 4.5x faster |
Rule of thumb:
- Use sync for simple documents and CLI applications
- Use async for complex PDFs, OCR, and batch processing
- Use batch operations for multiple files
For detailed benchmarks and methodology, see our Performance Documentation.
Contributing
We welcome contributions! Please see our Contributing Guide for details on setting up your development environment and submitting pull requests.
License
This library is released under the MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kreuzberg-3.3.0.tar.gz.
File metadata
- Download URL: kreuzberg-3.3.0.tar.gz
- Upload date:
- Size: 9.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f297a789b2f40f75c13e1cff9da42ebcd4fbc84110f02f2a643c9bb819ddda21
|
|
| MD5 |
0e804c89b7c9ae1f4d7a1f3254cdce0d
|
|
| BLAKE2b-256 |
d5eb13bb19dcda2eb76730fe7989760d34a9bda3e7f2fb7a7257d44f143b4fd4
|
Provenance
The following attestation bundles were made for kreuzberg-3.3.0.tar.gz:
Publisher:
release.yaml on Goldziher/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-3.3.0.tar.gz -
Subject digest:
f297a789b2f40f75c13e1cff9da42ebcd4fbc84110f02f2a643c9bb819ddda21 - Sigstore transparency entry: 260205612
- Sigstore integration time:
-
Permalink:
Goldziher/kreuzberg@cf1954579eaca8ec932df1d8d8e81bf679211cfc -
Branch / Tag:
refs/tags/v3.3.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@cf1954579eaca8ec932df1d8d8e81bf679211cfc -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-3.3.0-py3-none-any.whl.
File metadata
- Download URL: kreuzberg-3.3.0-py3-none-any.whl
- Upload date:
- Size: 84.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3080e51981b696354555ba79571d3bc68545555df27f0810076a839d80a167fa
|
|
| MD5 |
ca3219c58f0bc51dc34c8efb09afff0d
|
|
| BLAKE2b-256 |
86c19d9120cf6ebb934c5162bd7d3ae1c36a2f834885d86ed4120064fbbb9159
|
Provenance
The following attestation bundles were made for kreuzberg-3.3.0-py3-none-any.whl:
Publisher:
release.yaml on Goldziher/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-3.3.0-py3-none-any.whl -
Subject digest:
3080e51981b696354555ba79571d3bc68545555df27f0810076a839d80a167fa - Sigstore transparency entry: 260205627
- Sigstore integration time:
-
Permalink:
Goldziher/kreuzberg@cf1954579eaca8ec932df1d8d8e81bf679211cfc -
Branch / Tag:
refs/tags/v3.3.0 - Owner: https://github.com/Goldziher
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@cf1954579eaca8ec932df1d8d8e81bf679211cfc -
Trigger Event:
release
-
Statement type: