A text extraction library supporting PDFs, images, office documents and more

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Project description

Kreuzberg

Kreuzberg is a Python library for text extraction from documents. It provides a unified async interface for extracting text from PDFs, images, office documents, and more.

Why Kreuzberg?

Simple and Hassle-Free: Clean API that just works, without complex configuration
Local Processing: No external API calls or cloud dependencies required
Resource Efficient: Lightweight processing without GPU requirements
Small Package Size: Has few curated dependencies and a minimal footprint
Format Support: Comprehensive support for documents, images, and text formats
Modern Python: Built with async/await, type hints, and functional first approach
Permissive OSS: Kreuzberg and its dependencies have a permissive OSS license

Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, focusing on local processing with minimal dependencies. Its designed for modern async applications, serverless functions, and dockerized applications.

Installation

1. Install the Python Package

pip install kreuzberg

2. Install System Dependencies

Kreuzberg requires two system level dependencies:

Pandoc - For document format conversion. Minimum required version is Pandoc 2.
Tesseract OCR - For image and PDF OCR. Minimum required version is Tesseract 5.

You can install these with:

Linux (Ubuntu)

sudo apt-get install pandoc tesseract-ocr

MacOS

brew install tesseract pandoc

Windows

choco install -y tesseract pandoc

Notes:

in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.

Architecture

Kreuzberg integrates:

PDF Processing:
- pdfium2 for searchable PDFs
- Tesseract OCR for scanned content
Document Conversion:
- Pandoc for many document and markup formats
- python-pptx for PowerPoint files
- html-to-markdown for HTML content
- calamine for Excel spreadsheets (with multi-sheet support)
Text Processing:
- Smart encoding detection
- Markdown and plain text handling

Supported Formats

Document Formats

PDF (.pdf, both searchable and scanned)
Microsoft Word (.docx)
PowerPoint presentations (.pptx)
OpenDocument Text (.odt)
Rich Text Format (.rtf)
EPUB (.epub)
DocBook XML (.dbk, .xml)
FictionBook (.fb2)
LaTeX (.tex, .latex)
Typst (.typ)

Markup and Text Formats

HTML (.html, .htm)
Plain text (.txt) and Markdown (.md, .markdown)
reStructuredText (.rst)
Org-mode (.org)
DokuWiki (.txt)
Pod (.pod)
Troff/Man (.1, .2, etc.)

Data and Research Formats

Spreadsheets (.xlsx, .xls, .xlsm, .xlsb, .xlam, .xla, .ods)
CSV (.csv) and TSV (.tsv) files
OPML files (.opml)
Jupyter Notebooks (.ipynb)
BibTeX (.bib) and BibLaTeX (.bib)
CSL-JSON (.json)
EndNote and JATS XML (.xml)
RIS (.ris)

Image Formats

JPEG (.jpg, .jpeg, .pjpeg)
PNG (.png)
TIFF (.tiff, .tif)
BMP (.bmp)
GIF (.gif)
JPEG 2000 family (.jp2, .jpm, .jpx, .mj2)
WebP (.webp)
Portable anymap formats (.pbm, .pgm, .ppm, .pnm)

Usage

Kreuzberg provides both async and sync APIs for text extraction, including batch processing. The library exports the following main functions:

Single Item Processing:
- extract_file(): Async function to extract text from a file (accepts string path or pathlib.Path)
- extract_bytes(): Async function to extract text from bytes (accepts a byte string)
- extract_file_sync(): Synchronous version of extract_file()
- extract_bytes_sync(): Synchronous version of extract_bytes()
Batch Processing:
- batch_extract_file(): Async function to extract text from multiple files concurrently
- batch_extract_bytes(): Async function to extract text from multiple byte contents concurrently
- batch_extract_file_sync(): Synchronous version of batch_extract_file()
- batch_extract_bytes_sync(): Synchronous version of batch_extract_bytes()

Configuration Parameters

All extraction functions accept the following optional parameters for configuring OCR and performance:

OCR Configuration

force_ocr(default: False): Forces OCR processing even for searchable PDFs.
language (default: eng): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
- eng for English
- deu for German
- eng+deu for English and German
Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
psm (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.

Consult the Tesseract documentation for more information on both options.

Processing Configuration

max_processes (default: CPU count): Maximum number of concurrent processes for Tesseract.

Quick Start

from pathlib import Path
from kreuzberg import extract_file
from kreuzberg import ExtractionResult
from kreuzberg import PSMMode


# Basic file extraction
async def extract_document():
    # Extract from a PDF file with default settings
    pdf_result: ExtractionResult = await extract_file("document.pdf")
    print(f"Content: {pdf_result.content}")

    # Extract from an image with German language model
    img_result = await extract_file(
        "scan.png",
        language="deu",  # German language model
        psm=PSMMode.SINGLE_BLOCK,  # Treat as single block of text
        max_processes=4  # Limit concurrent processes
    )
    print(f"Image text: {img_result.content}")

    # Extract from Word document with metadata
    docx_result = await extract_file(Path("document.docx"))
    if docx_result.metadata:
        print(f"Title: {docx_result.metadata.get('title')}")
        print(f"Author: {docx_result.metadata.get('creator')}")

Extracting Bytes

from kreuzberg import extract_bytes
from kreuzberg import ExtractionResult


async def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:
    """Process uploaded file content with known MIME type."""
    return await extract_bytes(
        file_content,
        mime_type=mime_type,
    )


# Example usage with different file types
async def handle_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):
    # Process PDF upload
    pdf_result = await process_upload(pdf_bytes, mime_type="application/pdf")
    print(f"PDF content: {pdf_result.content}")
    print(f"PDF metadata: {pdf_result.metadata}")

    # Process image upload (will use OCR)
    img_result = await process_upload(image_bytes, mime_type="image/jpeg")
    print(f"Image text: {img_result.content}")

    # Process Word document upload
    docx_result = await process_upload(
        docx_bytes,
        mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
    print(f"Word content: {docx_result.content}")

Batch Processing

Kreuzberg supports efficient batch processing of multiple files or byte contents:

from pathlib import Path
from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync


# Process multiple files concurrently
async def process_documents(file_paths: list[Path]) -> None:
    # Extract from multiple files
    results = await batch_extract_file(file_paths)
    for path, result in zip(file_paths, results):
        print(f"File {path}: {result.content[:100]}...")


# Process multiple uploaded files concurrently
async def process_uploads(contents: list[tuple[bytes, str]]) -> None:
    # Each item is a tuple of (content, mime_type)
    results = await batch_extract_bytes(contents)
    for (_, mime_type), result in zip(contents, results):
        print(f"Upload {mime_type}: {result.content[:100]}...")


# Synchronous batch processing is also available
def process_documents_sync(file_paths: list[Path]) -> None:
    results = batch_extract_file_sync(file_paths)
    for path, result in zip(file_paths, results):
        print(f"File {path}: {result.content[:100]}...")

Features:

Ordered results
Concurrent processing
Error handling per item
Async and sync interfaces
Same options as single extraction

PDF Processing

Kreuzberg employs a smart approach to PDF text extraction:

Searchable Text Detection: First attempts to extract text directly from searchable PDFs using pdfium2.
Text Validation: Extracted text is validated for corruption by checking for:
- Control and non-printable characters
- Unicode replacement characters (�)
- Zero-width spaces and other invisible characters
- Empty or whitespace-only content
Automatic OCR Fallback: If the extracted text appears corrupted or if the PDF is image-based, automatically falls back to OCR using Tesseract.

This approach works well for searchable PDFs and standard text documents. For complex OCR (e.g., handwriting, photographs), use a specialized tool.

PDF Processing Options

You can control PDF processing behavior using optional parameters:

from kreuzberg import extract_file


async def process_pdf():
  # Default behavior: auto-detect and use OCR if needed
  # By default, max_processes=1 for safe operation
  result = await extract_file("document.pdf")
  print(result.content)

  # Force OCR even for searchable PDFs
  result = await extract_file("document.pdf", force_ocr=True)
  print(result.content)

  # Control OCR concurrency for large documents
  # Warning: High concurrency values can cause system resource exhaustion
  # Start with a low value and increase based on your system's capabilities
  result = await extract_file(
    "large_document.pdf",
    max_processes=4  # Process up to 4 pages concurrently
  )
  print(result.content)

  # Process a scanned PDF (automatically uses OCR)
  result = await extract_file("scanned.pdf")
  print(result.content)

ExtractionResult Object

All extraction functions return an ExtractionResult or a list thereof (for batch functions). The ExtractionResult object is a NamedTuple:

content: The extracted text (str)
mime_type: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
metadata: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.

from kreuzberg import extract_file, ExtractionResult, Metadata

async def process_document(path: str) -> tuple[str, str, Metadata]:
    # Access as a named tuple
    result: ExtractionResult = await extract_file(path)
    print(f"Content: {result.content}")
    print(f"Format: {result.mime_type}")

    # Or unpack as a tuple
    content, mime_type, metadata = await extract_file(path)
    return content, mime_type, metadata

Error Handling

Kreuzberg provides comprehensive error handling through several exception types, all inheriting from KreuzbergError. Each exception includes helpful context information for debugging.

from kreuzberg import (
    extract_file,
    ValidationError,
    ParsingError,
    OCRError,
    MissingDependencyError
)

async def safe_extract(path: str) -> str:
    try:
        result = await extract_file(path)
        return result.content

    except ValidationError as e:
        # Input validation issues
        # - Unsupported or undetectable MIME types
        # - Missing files
        # - Invalid input parameters
        print(f"Validation failed: {e}")

    except OCRError as e:
        # OCR-specific issues
        # - Tesseract processing failures
        # - Image conversion problems
        print(f"OCR failed: {e}")

    except MissingDependencyError as e:
        # System dependency issues
        # - Missing Tesseract OCR
        # - Missing Pandoc
        # - Incompatible versions
        print(f"Dependency missing: {e}")

    except ParsingError as e:
        # General processing errors
        # - PDF parsing failures
        # - Format conversion issues
        # - Encoding problems
        print(f"Processing failed: {e}")

    return ""

All exceptions include:

Error message
Context in the context attribute
String representation
Exception chaining

Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.

Local Development

Clone the repo
Install the system dependencies
Install the full dependencies with uv sync

Install the pre-commit hooks with:

pre-commit install && pre-commit install --hook-type commit-msg

Make your changes and submit a PR

License

This library uses the MIT license.

Project details

These details have been verified by PyPI

Project links

homepage

GitHub Statistics

Maintainers

nhirschfeld

These details have not been verified by PyPI

Release history Release notifications | RSS feed

4.7.2

Apr 4, 2026

4.7.1

Apr 3, 2026

4.7.0

Apr 3, 2026

4.6.3

Mar 27, 2026

4.6.2

Mar 26, 2026

4.6.1

Mar 25, 2026

4.6.0

Mar 24, 2026

4.5.4

Mar 23, 2026

4.5.3

Mar 22, 2026

4.5.2

Mar 21, 2026

4.5.1

Mar 20, 2026

4.4.6

Mar 13, 2026

4.4.5

Mar 10, 2026

4.4.4

Mar 7, 2026

4.4.3

Mar 6, 2026

4.4.2

Mar 4, 2026

4.4.1

Feb 28, 2026

4.4.0

Feb 27, 2026

4.3.8

Feb 21, 2026

4.3.7

Feb 20, 2026

4.3.6

Feb 19, 2026

4.3.5

Feb 17, 2026

4.3.4

Feb 16, 2026

4.3.3

Feb 14, 2026

4.3.2

Feb 13, 2026

4.3.1

Feb 12, 2026

4.3.0

Feb 11, 2026

4.2.15

Feb 8, 2026

4.2.14

Feb 7, 2026

4.2.13

Feb 7, 2026

4.2.12

Feb 6, 2026

4.2.11

Feb 6, 2026

4.2.10

Feb 5, 2026

4.2.9

Feb 3, 2026

4.2.8

Feb 2, 2026

4.2.7

Feb 1, 2026

4.2.6

Jan 31, 2026

4.2.5

Jan 30, 2026

4.2.4

Jan 29, 2026

4.2.3

Jan 29, 2026

4.2.2

Jan 28, 2026

4.2.1

Jan 27, 2026

4.2.0

Jan 26, 2026

4.1.2

Jan 25, 2026

4.1.1

Jan 23, 2026

4.1.0

Jan 22, 2026

4.0.8

Jan 17, 2026

4.0.7

Jan 16, 2026

4.0.6

Jan 14, 2026

4.0.5

Jan 14, 2026

4.0.4

Jan 13, 2026

4.0.3

Jan 13, 2026

4.0.2

Jan 12, 2026

4.0.1

Jan 11, 2026

4.0.0

Jan 11, 2026

4.0.0rc29 pre-release

Jan 9, 2026

4.0.0rc28 pre-release

Jan 7, 2026

4.0.0rc27 pre-release

Jan 4, 2026

4.0.0rc26 pre-release

Jan 3, 2026

4.0.0rc25 pre-release

Jan 3, 2026

4.0.0rc24 pre-release

Jan 1, 2026

4.0.0rc23 pre-release

Dec 30, 2025

4.0.0rc22 pre-release

Dec 28, 2025

4.0.0rc21 pre-release

Dec 26, 2025

4.0.0rc20 pre-release

Dec 25, 2025

4.0.0rc19 pre-release

Dec 24, 2025

4.0.0rc18 pre-release

Dec 23, 2025

4.0.0rc17 pre-release

Dec 22, 2025

4.0.0rc16 pre-release

Dec 21, 2025

4.0.0rc15 pre-release

Dec 20, 2025

4.0.0rc14 pre-release

Dec 20, 2025

4.0.0rc13 pre-release

Dec 19, 2025

4.0.0rc12 pre-release

Dec 19, 2025

4.0.0rc11 pre-release

Dec 19, 2025

4.0.0rc10 pre-release

Dec 17, 2025

4.0.0rc9 pre-release

Dec 15, 2025

4.0.0rc8 pre-release

Dec 14, 2025

4.0.0rc7 pre-release

Dec 12, 2025

4.0.0rc6 pre-release

Dec 10, 2025

4.0.0rc2 pre-release

Nov 30, 2025

4.0.0rc1 pre-release

Nov 23, 2025

3.22.0

Nov 27, 2025

3.21.0

Nov 5, 2025

3.20.2

Oct 11, 2025

3.20.1

Oct 11, 2025

3.20.0

Oct 11, 2025

3.19.1

Sep 30, 2025

3.19.0

Sep 29, 2025

3.18.0

Sep 27, 2025

3.17.3

Sep 23, 2025

3.17.2

Sep 22, 2025

3.17.1

Sep 19, 2025

3.17.0

Sep 17, 2025

3.16.0

Sep 16, 2025

3.15.0

Sep 14, 2025

3.14.1

Sep 13, 2025

3.14.0

Sep 13, 2025

3.13.3

Sep 10, 2025

3.13.2

Sep 4, 2025

3.13.1

Sep 4, 2025

3.13.0

Sep 4, 2025

3.11.4

Aug 24, 2025

3.11.3

Aug 24, 2025

3.11.2

Aug 15, 2025

3.11.1

Aug 13, 2025

3.11.0

Aug 1, 2025

3.10.1

Jul 31, 2025

3.10.0

Jul 29, 2025

3.9.1

Jul 29, 2025

3.9.0

Jul 17, 2025

3.8.2

Jul 13, 2025

3.8.1

Jul 13, 2025

3.8.0

Jul 12, 2025

3.7.0

Jul 11, 2025

3.6.2

Jul 11, 2025

3.6.1

Jul 4, 2025

3.6.0

Jul 4, 2025

3.5.0

Jul 4, 2025

3.4.2

Jul 3, 2025

3.4.1

Jul 3, 2025

3.4.0

Jul 3, 2025

3.3.0

Jul 2, 2025

3.2.0

Jun 23, 2025

3.1.7

Jun 9, 2025

3.1.6

May 26, 2025

3.1.5

May 13, 2025

3.1.4

Apr 26, 2025

3.1.3

Apr 10, 2025

3.1.2

Apr 8, 2025

3.1.1

Apr 2, 2025

3.1.0

Mar 28, 2025

3.0.1

Mar 26, 2025

3.0.0

Mar 23, 2025

This version

2.1.2

Mar 1, 2025

2.1.1

Mar 1, 2025

2.1.0

Feb 20, 2025

2.0.1

Feb 15, 2025

2.0.0

Feb 15, 2025

1.7.0

Feb 14, 2025

1.6.0

Feb 9, 2025

1.5.0

Feb 8, 2025

1.4.0

Feb 8, 2025

1.3.0

Feb 3, 2025

1.2.0

Feb 2, 2025

1.1.0

Feb 1, 2025

1.0.0

Feb 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-2.1.2.tar.gz (26.5 kB view details)

Uploaded Mar 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kreuzberg-2.1.2-py3-none-any.whl (27.0 kB view details)

Uploaded Mar 1, 2025 Python 3

File details

Details for the file kreuzberg-2.1.2.tar.gz.

File metadata

Download URL: kreuzberg-2.1.2.tar.gz
Upload date: Mar 1, 2025
Size: 26.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`4cee25446307aa5db191259c3ec02cde659a3a26f3f6b5d5f36be2fff7e0b552`
MD5	`2230fdbea27441edc94c2a36d8bb5858`
BLAKE2b-256	`9e815523dd244d3b0035404f516bc40d9ebc25766caeffaa15af6226b46afba2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-2.1.2.tar.gz:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-2.1.2.tar.gz
- Subject digest: 4cee25446307aa5db191259c3ec02cde659a3a26f3f6b5d5f36be2fff7e0b552
- Sigstore transparency entry: 175757324
- Sigstore integration time: Mar 1, 2025
Source repository:
- Permalink: Goldziher/kreuzberg@6bb73340f6e7272363b2384e8245162e25d00e62
- Branch / Tag: refs/tags/v2.1.2
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@6bb73340f6e7272363b2384e8245162e25d00e62
- Trigger Event: release

File details

Details for the file kreuzberg-2.1.2-py3-none-any.whl.

File metadata

Download URL: kreuzberg-2.1.2-py3-none-any.whl
Upload date: Mar 1, 2025
Size: 27.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for kreuzberg-2.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`57c06a2338b04ea7d1a463da030f568d85143f8083d36e6b7251c4692d744b9a`
MD5	`1117cff814f104d553508082a51c3c76`
BLAKE2b-256	`f1f882fb9cdb4e7ff11345474a1c1eb5a04603c6eb60d4d0057c3b74b2b992a2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-2.1.2-py3-none-any.whl:

Publisher: release.yaml on Goldziher/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg-2.1.2-py3-none-any.whl
- Subject digest: 57c06a2338b04ea7d1a463da030f568d85143f8083d36e6b7251c4692d744b9a
- Sigstore transparency entry: 175757326
- Sigstore integration time: Mar 1, 2025
Source repository:
- Permalink: Goldziher/kreuzberg@6bb73340f6e7272363b2384e8245162e25d00e62
- Branch / Tag: refs/tags/v2.1.2
- Owner: https://github.com/Goldziher
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@6bb73340f6e7272363b2384e8245162e25d00e62
- Trigger Event: release

kreuzberg 2.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Kreuzberg

Why Kreuzberg?

Installation

1. Install the Python Package

2. Install System Dependencies

Linux (Ubuntu)

MacOS

Windows

Architecture

Supported Formats

Document Formats

Markup and Text Formats

Data and Research Formats

Image Formats

Usage

Configuration Parameters

OCR Configuration

Processing Configuration

Quick Start

Extracting Bytes

Batch Processing

PDF Processing

PDF Processing Options

ExtractionResult Object

Error Handling

Contribution

Local Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance