High-performance document intelligence library for Python. Extract text, metadata, and structured data from PDFs, Office documents, images, and 50+ formats. Powered by Rust core for 10-50x speed improvements.
Project description
Python
Extract text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Native Python bindings with async/await support, multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), and extensible plugin system.
Installation
Package Installation
Install via pip:
pip install kreuzberg
For async support and additional features:
pip install kreuzberg[async]
System Requirements
- Python 3.10+ required
- Optional: ONNX Runtime version 1.22.x for embeddings support
- Optional: Tesseract OCR for OCR functionality
Quick Start
Basic Extraction
Extract text, metadata, and structure from any supported document format:
import asyncio
from kreuzberg import extract_file, ExtractionConfig
async def main() -> None:
config = ExtractionConfig(
use_cache=True,
enable_quality_processing=True
)
result = await extract_file("document.pdf", config=config)
print(result.content)
asyncio.run(main())
Common Use Cases
Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
With OCR (for scanned documents):
import asyncio
from kreuzberg import extract_file
async def main() -> None:
result = await extract_file("document.pdf")
print(result.content)
asyncio.run(main())
Table Extraction
import asyncio
from kreuzberg import extract_file
async def main() -> None:
result = await extract_file("document.pdf")
content: str = result.content
tables: int = len(result.tables)
format_type: str | None = result.metadata.format_type
print(f"Content length: {len(content)} characters")
print(f"Tables found: {tables}")
print(f"Format: {format_type}")
asyncio.run(main())
Processing Multiple Files
import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
async def main() -> None:
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=3)
)
)
result = await extract_file("scanned.pdf", config=config)
print(result.content)
print(f"Detected Languages: {result.detected_languages}")
asyncio.run(main())
Async Processing
For non-blocking document processing:
import asyncio
from pathlib import Path
from kreuzberg import extract_file
async def main() -> None:
file_path: Path = Path("document.pdf")
result = await extract_file(file_path)
print(f"Content: {result.content}")
print(f"MIME Type: {result.metadata.format_type}")
print(f"Tables: {len(result.tables)}")
asyncio.run(main())
Next Steps
- Installation Guide - Platform-specific setup
- API Documentation - Complete API reference
- Examples & Guides - Full code examples and usage guides
- Configuration Guide - Advanced configuration options
Features
Supported File Formats (56+)
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
Office Documents
| Category | Formats | Capabilities |
|---|---|---|
| Word Processing | .docx, .odt |
Full text, tables, images, metadata, styles |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods |
Sheet data, formulas, cell metadata, charts |
| Presentations | .pptx, .ppt, .ppsx |
Slides, speaker notes, images, metadata |
.pdf |
Text, tables, images, metadata, OCR support | |
| eBooks | .epub, .fb2 |
Chapters, metadata, embedded resources |
Images (OCR-Enabled)
| Category | Formats | Features |
|---|---|---|
| Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif |
OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced | .jp2, .jpx, .jpm, .mj2, .pnm, .pbm, .pgm, .ppm |
OCR, table detection, format-specific metadata |
| Vector | .svg |
DOM parsing, embedded text, graphics metadata |
Web & Data
| Category | Formats | Features |
|---|---|---|
| Markup | .html, .htm, .xhtml, .xml, .svg |
DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv |
Schema detection, nested structures, validation |
| Text & Markdown | .txt, .md, .markdown, .rst, .org, .rtf |
CommonMark, GFM, reStructuredText, Org Mode |
Email & Archives
| Category | Formats | Features |
|---|---|---|
.eml, .msg |
Headers, body (HTML/plain), attachments, threading | |
| Archives | .zip, .tar, .tgz, .gz, .7z |
File listing, nested archives, metadata |
Academic & Scientific
| Category | Formats | Features |
|---|---|---|
| Citations | .bib, .biblatex, .ris, .enw, .csl |
Bibliography parsing, citation extraction |
| Scientific | .tex, .latex, .typst, .jats, .ipynb, .docbook |
LaTeX, Jupyter notebooks, PubMed JATS |
| Documentation | .opml, .pod, .mdoc, .troff |
Technical documentation formats |
Key Capabilities
-
Text Extraction - Extract all text content with position and formatting information
-
Metadata Extraction - Retrieve document properties, creation date, author, etc.
-
Table Extraction - Parse tables with structure and cell content preservation
-
Image Extraction - Extract embedded images and render page previews
-
OCR Support - Integrate multiple OCR backends for scanned documents
-
Async/Await - Non-blocking document processing with concurrent operations
-
Plugin System - Extensible post-processing for custom text transformation
-
Embeddings - Generate vector embeddings using ONNX Runtime models
-
Batch Processing - Efficiently process multiple documents in parallel
-
Memory Efficient - Stream large files without loading entirely into memory
-
Language Detection - Detect and support multiple languages in documents
-
Configuration - Fine-grained control over extraction behavior
Performance Characteristics
| Format | Speed | Memory | Notes |
|---|---|---|---|
| PDF (text) | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| Office docs | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| Images (OCR) | 1-5 MB/s | Variable | Depends on OCR backend |
| Archives | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| Web formats | 50-200 MB/s | Streaming | HTML, XML, JSON |
OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
-
Tesseract
-
Easyocr
-
Paddleocr
OCR Configuration Example
import asyncio
from kreuzberg import extract_file
async def main() -> None:
result = await extract_file("document.pdf")
print(result.content)
asyncio.run(main())
Async Support
This binding provides full async/await support for non-blocking document processing:
import asyncio
from pathlib import Path
from kreuzberg import extract_file
async def main() -> None:
file_path: Path = Path("document.pdf")
result = await extract_file(file_path)
print(f"Content: {result.content}")
print(f"MIME Type: {result.metadata.format_type}")
print(f"Tables: {len(result.tables)}")
asyncio.run(main())
Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit Plugin System Guide.
Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
Batch Processing
Process multiple documents efficiently:
import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
async def main() -> None:
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=3)
)
)
result = await extract_file("scanned.pdf", config=config)
print(result.content)
print(f"Detected Languages: {result.detected_languages}")
asyncio.run(main())
Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
Documentation
Contributing
Contributions are welcome! See Contributing Guide.
License
MIT License - see LICENSE file for details.
Support
- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kreuzberg-4.2.10.tar.gz.
File metadata
- Download URL: kreuzberg-4.2.10.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc65e56c60bc66b85e39b33cd3d7a823b46ce343cf1c4eeb1b45e38893b21c07
|
|
| MD5 |
0fc27f78e0f8645bcbafd4a4cf7d01ae
|
|
| BLAKE2b-256 |
2d62efb85eff7ed13d33869a8fd086c3a203d72709c7b7eed8226744eed86c0e
|
Provenance
The following attestation bundles were made for kreuzberg-4.2.10.tar.gz:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-4.2.10.tar.gz -
Subject digest:
dc65e56c60bc66b85e39b33cd3d7a823b46ce343cf1c4eeb1b45e38893b21c07 - Sigstore transparency entry: 919320941
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Branch / Tag:
refs/tags/v4.2.10 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-4.2.10-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: kreuzberg-4.2.10-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 17.2 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
601632f6fa3666cfcd8004edc9d31b5077f13358d79ba176c4b496f010218d89
|
|
| MD5 |
f1fc65c9ac705786ab109b7801780592
|
|
| BLAKE2b-256 |
808f754ee90e021751e9efdf692580c07628db5b00ddcc973ea865c04b5943be
|
Provenance
The following attestation bundles were made for kreuzberg-4.2.10-cp310-abi3-win_amd64.whl:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-4.2.10-cp310-abi3-win_amd64.whl -
Subject digest:
601632f6fa3666cfcd8004edc9d31b5077f13358d79ba176c4b496f010218d89 - Sigstore transparency entry: 919320952
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Branch / Tag:
refs/tags/v4.2.10 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-4.2.10-cp310-abi3-manylinux_2_39_aarch64.whl.
File metadata
- Download URL: kreuzberg-4.2.10-cp310-abi3-manylinux_2_39_aarch64.whl
- Upload date:
- Size: 16.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.39+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9c9e7efefac6809dd88db418dda4ec735d9d86e61a185eb6b3733b33a617d64
|
|
| MD5 |
748fedc0fb205e0545e27e347a3431ae
|
|
| BLAKE2b-256 |
77d270bb89aff5b2eb1016633d41a7cedaeac31ed41a8ec9f86f54ddf750b0c0
|
Provenance
The following attestation bundles were made for kreuzberg-4.2.10-cp310-abi3-manylinux_2_39_aarch64.whl:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-4.2.10-cp310-abi3-manylinux_2_39_aarch64.whl -
Subject digest:
d9c9e7efefac6809dd88db418dda4ec735d9d86e61a185eb6b3733b33a617d64 - Sigstore transparency entry: 919320961
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Branch / Tag:
refs/tags/v4.2.10 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-4.2.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: kreuzberg-4.2.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 16.9 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1a5bcf1afd7895389d019fd02f4d307354048055e6838ebfcc55fa95f61a2a2
|
|
| MD5 |
c69f96106d35750accb80d2acef3c3ff
|
|
| BLAKE2b-256 |
9e8b97174d58c51e2d6511b598a51ded6f1429d6d722efaf916d1b049287dc89
|
Provenance
The following attestation bundles were made for kreuzberg-4.2.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-4.2.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
a1a5bcf1afd7895389d019fd02f4d307354048055e6838ebfcc55fa95f61a2a2 - Sigstore transparency entry: 919320957
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Branch / Tag:
refs/tags/v4.2.10 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Trigger Event:
release
-
Statement type:
File details
Details for the file kreuzberg-4.2.10-cp310-abi3-macosx_14_0_arm64.whl.
File metadata
- Download URL: kreuzberg-4.2.10-cp310-abi3-macosx_14_0_arm64.whl
- Upload date:
- Size: 14.8 MB
- Tags: CPython 3.10+, macOS 14.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6b2e406fac5b853405766285479e65c4bb95d58314fb932dd825a0ffd3cfa6e
|
|
| MD5 |
1a810822f1d05a39953cf11442d072f3
|
|
| BLAKE2b-256 |
2a86d4b6c1c7a5ea7d77644918c9650519939a25904a049671965f19723d1c3d
|
Provenance
The following attestation bundles were made for kreuzberg-4.2.10-cp310-abi3-macosx_14_0_arm64.whl:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg-4.2.10-cp310-abi3-macosx_14_0_arm64.whl -
Subject digest:
d6b2e406fac5b853405766285479e65c4bb95d58314fb932dd825a0ffd3cfa6e - Sigstore transparency entry: 919320963
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Branch / Tag:
refs/tags/v4.2.10 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@5b2ba69a5cf85aa65cdcb7c436e389ed473a3e7b -
Trigger Event:
release
-
Statement type: