High-performance document intelligence library for Python. Extract text, metadata, and structured data from PDFs, Office documents, images, and 88+ formats. Powered by Rust core for 10-50x speed improvements.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project links

Project description

Python

Extract text, tables, images, and metadata from 91+ file formats and 248 programming languages including PDF, Office documents, and images. Native Python bindings with async/await support, multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), and extensible plugin system.

Installation

Package Installation

Install via pip:

pip install kreuzberg

For async support and additional features:

pip install kreuzberg[async]

System Requirements

Python 3.10+ required
Optional: ONNX Runtime version 1.22.x for embeddings support
Optional: Tesseract OCR for OCR functionality

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

import asyncio
from kreuzberg import extract_file, ExtractionConfig

async def main() -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")
    print(result.content)

asyncio.run(main())

Table Extraction

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")

    content: str = result.content
    tables: int = len(result.tables)
    format_type: str | None = result.metadata.format_type

    print(f"Content length: {len(content)} characters")
    print(f"Tables found: {tables}")
    print(f"Format: {format_type}")

asyncio.run(main())

Processing Multiple Files

import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="tesseract",
            language="eng",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(result.content)
    print(f"Detected Languages: {result.detected_languages}")

asyncio.run(main())

Async Processing

For non-blocking document processing:

import asyncio
from pathlib import Path
from kreuzberg import extract_file

async def main() -> None:
    file_path: Path = Path("document.pdf")

    result = await extract_file(file_path)

    print(f"Content: {result.content}")
    print(f"MIME Type: {result.metadata.format_type}")
    print(f"Tables: {len(result.tables)}")

asyncio.run(main())

Next Steps

Installation Guide - Platform-specific setup
API Documentation - Complete API reference
Examples & Guides - Full code examples and usage guides
Configuration Guide - Advanced configuration options

Features

Supported File Formats (91+)

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category	Formats	Capabilities
Word Processing	`.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt`	Full text, tables, images, metadata, styles
Spreadsheets	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`	Sheet data, formulas, cell metadata, charts
Presentations	`.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.ppt`	Slides, speaker notes, images, metadata
PDF	`.pdf`	Text, tables, images, metadata, OCR support
eBooks	`.epub`, `.fb2`	Chapters, metadata, embedded resources
Database	`.dbf`	Table data extraction, field type support
Hangul	`.hwp`, `.hwpx`	Korean document format, text extraction

Images (OCR-Enabled)

Category	Formats	Features
Raster	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`	OCR, table detection, EXIF metadata, dimensions, color space
Advanced	`.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`	OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata
Vector	`.svg`	DOM parsing, embedded text, graphics metadata

Web & Data

Category	Formats	Features
Markup	`.html`, `.htm`, `.xhtml`, `.xml`, `.svg`	DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data	`.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`	Schema detection, nested structures, validation
Text & Markdown	`.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf`	CommonMark, GFM, Djot, reStructuredText, Org Mode

Email & Archives

Category	Formats	Features
Email	`.eml`, `.msg`	Headers, body (HTML/plain), attachments, threading
Archives	`.zip`, `.tar`, `.tgz`, `.gz`, `.7z`	File listing, nested archives, metadata

Academic & Scientific

Category	Formats	Features
Citations	`.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`	Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON
Scientific	`.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook`	LaTeX, Jupyter notebooks, PubMed JATS
Documentation	`.opml`, `.pod`, `.mdoc`, `.troff`	Technical documentation formats

Code Intelligence (248 Languages)

Feature	Description
Structure Extraction	Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis	Module dependencies, re-exports, wildcard imports
Symbol Extraction	Variables, constants, type aliases, properties
Docstring Parsing	Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics	Parse errors with line/column positions
Syntax-Aware Chunking	Split code by semantic boundaries, not arbitrary byte offsets

Complete Format Reference

Key Capabilities

Text Extraction - Extract all text content with position and formatting information
Metadata Extraction - Retrieve document properties, creation date, author, etc.
Table Extraction - Parse tables with structure and cell content preservation
Image Extraction - Extract embedded images and render page previews
OCR Support - Integrate multiple OCR backends for scanned documents
Async/Await - Non-blocking document processing with concurrent operations
Plugin System - Extensible post-processing for custom text transformation
Embeddings - Generate vector embeddings using ONNX Runtime models
Batch Processing - Efficiently process multiple documents in parallel
Memory Efficient - Stream large files without loading entirely into memory
Language Detection - Detect and support multiple languages in documents
Code Intelligence - Extract structure, imports, exports, symbols, and docstrings from 248 programming languages via tree-sitter
Configuration - Fine-grained control over extraction behavior

Performance Characteristics

Format	Speed	Memory	Notes
PDF (text)	10-100 MB/s	~50MB per doc	Fastest extraction
Office docs	20-200 MB/s	~100MB per doc	DOCX, XLSX, PPTX
Images (OCR)	1-5 MB/s	Variable	Depends on OCR backend
Archives	5-50 MB/s	~200MB per doc	ZIP, TAR, etc.
Web formats	50-200 MB/s	Streaming	HTML, XML, JSON

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

Tesseract
Easyocr
Paddleocr

OCR Configuration Example

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")
    print(result.content)

asyncio.run(main())

Async Support

This binding provides full async/await support for non-blocking document processing:

import asyncio
from pathlib import Path
from kreuzberg import extract_file

async def main() -> None:
    file_path: Path = Path("document.pdf")

    result = await extract_file(file_path)

    print(f"Content: {result.content}")
    print(f"MIME Type: {result.metadata.format_type}")
    print(f"Tables: {len(result.tables)}")

asyncio.run(main())

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

Batch Processing

Process multiple documents efficiently:

import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="tesseract",
            language="eng",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(result.content)
    print(f"Detected Languages: {result.detected_languages}")

asyncio.run(main())

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

Discord Community: Join our Discord
GitHub Issues: Report bugs
Discussions: Ask questions

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

4.9.5

Apr 23, 2026

4.9.4

Apr 22, 2026

4.9.3

Apr 22, 2026

4.9.2

Apr 19, 2026

4.9.1

Apr 19, 2026

4.8.6

Apr 17, 2026

4.8.5

Apr 14, 2026

4.8.2

Apr 10, 2026

4.8.0

Apr 8, 2026

4.7.4

Apr 6, 2026

4.7.3

Apr 5, 2026

4.7.2

Apr 4, 2026

4.7.1

Apr 3, 2026

4.7.0

Apr 3, 2026

4.6.3

Mar 27, 2026

4.6.2

Mar 26, 2026

4.6.1

Mar 25, 2026

4.6.0

Mar 24, 2026

4.5.4

Mar 23, 2026

4.5.3

Mar 22, 2026

4.5.2

Mar 21, 2026

4.5.1

Mar 20, 2026

4.4.6

Mar 13, 2026

4.4.5

Mar 10, 2026

4.4.4

Mar 7, 2026

4.4.3

Mar 6, 2026

4.4.2

Mar 4, 2026

4.4.1

Feb 28, 2026

4.4.0

Feb 27, 2026

4.3.8

Feb 21, 2026

4.3.7

Feb 20, 2026

4.3.6

Feb 19, 2026

4.3.5

Feb 17, 2026

4.3.4

Feb 16, 2026

4.3.3

Feb 14, 2026

4.3.2

Feb 13, 2026

4.3.1

Feb 12, 2026

4.3.0

Feb 11, 2026

4.2.15

Feb 8, 2026

4.2.14

Feb 7, 2026

4.2.13

Feb 7, 2026

4.2.12

Feb 6, 2026

4.2.11

Feb 6, 2026

4.2.10

Feb 5, 2026

4.2.9

Feb 3, 2026

4.2.8

Feb 2, 2026

4.2.7

Feb 1, 2026

4.2.6

Jan 31, 2026

4.2.5

Jan 30, 2026

4.2.4

Jan 29, 2026

4.2.3

Jan 29, 2026

4.2.2

Jan 28, 2026

4.2.1

Jan 27, 2026

4.2.0

Jan 26, 2026

4.1.2

Jan 25, 2026

4.1.1

Jan 23, 2026

4.1.0

Jan 22, 2026

4.0.8

Jan 17, 2026

4.0.7

Jan 16, 2026

4.0.6

Jan 14, 2026

4.0.5

Jan 14, 2026

4.0.4

Jan 13, 2026

4.0.3

Jan 13, 2026

4.0.2

Jan 12, 2026

4.0.1

Jan 11, 2026

4.0.0

Jan 11, 2026

4.0.0rc29 pre-release

Jan 9, 2026

4.0.0rc28 pre-release

Jan 7, 2026

4.0.0rc27 pre-release

Jan 4, 2026

4.0.0rc26 pre-release

Jan 3, 2026

4.0.0rc25 pre-release

Jan 3, 2026

4.0.0rc24 pre-release

Jan 1, 2026

4.0.0rc23 pre-release

Dec 30, 2025

4.0.0rc22 pre-release

Dec 28, 2025

4.0.0rc21 pre-release

Dec 26, 2025

4.0.0rc20 pre-release

Dec 25, 2025

4.0.0rc19 pre-release

Dec 24, 2025

4.0.0rc18 pre-release

Dec 23, 2025

4.0.0rc17 pre-release

Dec 22, 2025

4.0.0rc16 pre-release

Dec 21, 2025

4.0.0rc15 pre-release

Dec 20, 2025

4.0.0rc14 pre-release

Dec 20, 2025

4.0.0rc13 pre-release

Dec 19, 2025

4.0.0rc12 pre-release

Dec 19, 2025

4.0.0rc11 pre-release

Dec 19, 2025

4.0.0rc10 pre-release

Dec 17, 2025

4.0.0rc9 pre-release

Dec 15, 2025

4.0.0rc8 pre-release

Dec 14, 2025

4.0.0rc7 pre-release

Dec 12, 2025

4.0.0rc6 pre-release

Dec 10, 2025

4.0.0rc2 pre-release

Nov 30, 2025

4.0.0rc1 pre-release

Nov 23, 2025

3.22.0

Nov 27, 2025

3.21.0

Nov 5, 2025

3.20.2

Oct 11, 2025

3.20.1

Oct 11, 2025

3.20.0

Oct 11, 2025

3.19.1

Sep 30, 2025

3.19.0

Sep 29, 2025

3.18.0

Sep 27, 2025

3.17.3

Sep 23, 2025

3.17.2

Sep 22, 2025

3.17.1

Sep 19, 2025

3.17.0

Sep 17, 2025

3.16.0

Sep 16, 2025

3.15.0

Sep 14, 2025

3.14.1

Sep 13, 2025

3.14.0

Sep 13, 2025

3.13.3

Sep 10, 2025

3.13.2

Sep 4, 2025

3.13.1

Sep 4, 2025

3.13.0

Sep 4, 2025

3.11.4

Aug 24, 2025

3.11.3

Aug 24, 2025

3.11.2

Aug 15, 2025

3.11.1

Aug 13, 2025

3.11.0

Aug 1, 2025

3.10.1

Jul 31, 2025

3.10.0

Jul 29, 2025

3.9.1

Jul 29, 2025

3.9.0

Jul 17, 2025

3.8.2

Jul 13, 2025

3.8.1

Jul 13, 2025

3.8.0

Jul 12, 2025

3.7.0

Jul 11, 2025

3.6.2

Jul 11, 2025

3.6.1

Jul 4, 2025

3.6.0

Jul 4, 2025

3.5.0

Jul 4, 2025

3.4.2

Jul 3, 2025

3.4.1

Jul 3, 2025

3.4.0

Jul 3, 2025

3.3.0

Jul 2, 2025

3.2.0

Jun 23, 2025

3.1.7

Jun 9, 2025

3.1.6

May 26, 2025

3.1.5

May 13, 2025

3.1.4

Apr 26, 2025

3.1.3

Apr 10, 2025

3.1.2

Apr 8, 2025

3.1.1

Apr 2, 2025

3.1.0

Mar 28, 2025

3.0.1

Mar 26, 2025

3.0.0

Mar 23, 2025

2.1.2

Mar 1, 2025

2.1.1

Mar 1, 2025

2.1.0

Feb 20, 2025

2.0.1

Feb 15, 2025

2.0.0

Feb 15, 2025

1.7.0

Feb 14, 2025

1.6.0

Feb 9, 2025

1.5.0

Feb 8, 2025

1.4.0

Feb 8, 2025

1.3.0

Feb 3, 2025

1.2.0

Feb 2, 2025

1.1.0

Feb 1, 2025

1.0.0

Feb 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_x86_64.whl (35.5 MB view details)

Uploaded Apr 23, 2026 CPython 3.10+manylinux: glibc 2.28+ x86-64

kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_aarch64.whl (33.3 MB view details)

Uploaded Apr 23, 2026 CPython 3.10+manylinux: glibc 2.28+ ARM64

kreuzberg-4.9.5-cp310-abi3-macosx_14_0_arm64.whl (28.8 MB view details)

Uploaded Apr 23, 2026 CPython 3.10+macOS 14.0+ ARM64

File details

Details for the file kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

Download URL: kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_x86_64.whl
Upload date: Apr 23, 2026
Size: 35.5 MB
Tags: CPython 3.10+, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`87c702b584103b8addb142dd879742f0c71ea15170fa6849166b79b870e6b595`
MD5	`ae85261fb4ee8599c660f0d8e12c9929`
BLAKE2b-256	`e0365ff6fe57b6b91ef065b2ce0ba79ca46e800632843f3dc800aa4274f16cdd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_x86_64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_x86_64.whl
- Subject digest: 87c702b584103b8addb142dd879742f0c71ea15170fa6849166b79b870e6b595
- Sigstore transparency entry: 1362024597
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: kreuzberg-dev/kreuzberg@315319934506b12d74ac322586d334bc0c6edfd7
- Branch / Tag: refs/tags/v4.9.5
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@315319934506b12d74ac322586d334bc0c6edfd7
- Trigger Event: release

File details

Details for the file kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

Download URL: kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_aarch64.whl
Upload date: Apr 23, 2026
Size: 33.3 MB
Tags: CPython 3.10+, manylinux: glibc 2.28+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm	Hash digest
SHA256	`335cf8a6565ca2b0a03e4f646afe549586a2a3387b68b770fb768c1de7dc35b1`
MD5	`86f1371b47ae1f7910b63c5a76aabc35`
BLAKE2b-256	`09ce94cdc6856cb0596cc03f69d9df0f0879727be8d8ad65c5b2284a6a7aca4c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_aarch64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-4.9.5-cp310-abi3-manylinux_2_28_aarch64.whl
- Subject digest: 335cf8a6565ca2b0a03e4f646afe549586a2a3387b68b770fb768c1de7dc35b1
- Sigstore transparency entry: 1362024421
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: kreuzberg-dev/kreuzberg@315319934506b12d74ac322586d334bc0c6edfd7
- Branch / Tag: refs/tags/v4.9.5
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@315319934506b12d74ac322586d334bc0c6edfd7
- Trigger Event: release

File details

Details for the file kreuzberg-4.9.5-cp310-abi3-macosx_14_0_arm64.whl.

File metadata

Download URL: kreuzberg-4.9.5-cp310-abi3-macosx_14_0_arm64.whl
Upload date: Apr 23, 2026
Size: 28.8 MB
Tags: CPython 3.10+, macOS 14.0+ ARM64
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-4.9.5-cp310-abi3-macosx_14_0_arm64.whl
Algorithm	Hash digest
SHA256	`9319c01a0ae90dfbbe5d019109ed0ccc123508ecf0f5ed1430b2a6b56cbef99c`
MD5	`40bd479f48df59c6b5960640801d4471`
BLAKE2b-256	`65bc1b77430cdd71e32ee1d985d8bb2faaeba1589217d38951f4acaa9f8ce0da`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.5-cp310-abi3-macosx_14_0_arm64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-4.9.5-cp310-abi3-macosx_14_0_arm64.whl
- Subject digest: 9319c01a0ae90dfbbe5d019109ed0ccc123508ecf0f5ed1430b2a6b56cbef99c
- Sigstore transparency entry: 1362024334
- Sigstore integration time: Apr 23, 2026
Source repository:
- Permalink: kreuzberg-dev/kreuzberg@315319934506b12d74ac322586d334bc0c6edfd7
- Branch / Tag: refs/tags/v4.9.5
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@315319934506b12d74ac322586d334bc0c6edfd7
- Trigger Event: release

kreuzberg 4.9.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python

Installation

Package Installation

System Requirements

Quick Start

Basic Extraction

Common Use Cases

Extract with Custom Configuration

Table Extraction

Processing Multiple Files

Async Processing

Next Steps

Features

Supported File Formats (91+)

Office Documents

Images (OCR-Enabled)

Web & Data

Email & Archives

Academic & Scientific

Code Intelligence (248 Languages)

Key Capabilities

Performance Characteristics

OCR Support

OCR Configuration Example

Async Support

Plugin System

Embeddings Support

Batch Processing

Configuration

Documentation

Contributing

License

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance