Skip to main content

High-performance document intelligence library for Python. Extract text, metadata, and structured data from PDFs, Office documents, images, and 88+ formats. Powered by Rust core for 10-50x speed improvements.

Project description

Python

Banner2

Extract text, tables, images, and metadata from 91+ file formats and 248 programming languages including PDF, Office documents, and images. Native Python bindings with async/await support, multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), and extensible plugin system.

Installation

Package Installation

Install via pip:

pip install kreuzberg

For async support and additional features:

pip install kreuzberg[async]

System Requirements

  • Python 3.10+ required
  • Optional: ONNX Runtime version 1.22.x for embeddings support
  • Optional: Tesseract OCR for OCR functionality

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

import asyncio
from kreuzberg import extract_file, ExtractionConfig

async def main() -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")
    print(result.content)

asyncio.run(main())

Table Extraction

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")

    content: str = result.content
    tables: int = len(result.tables)
    format_type: str | None = result.metadata.format_type

    print(f"Content length: {len(content)} characters")
    print(f"Tables found: {tables}")
    print(f"Format: {format_type}")

asyncio.run(main())

Processing Multiple Files

import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="tesseract",
            language="eng",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(result.content)
    print(f"Detected Languages: {result.detected_languages}")

asyncio.run(main())

Async Processing

For non-blocking document processing:

import asyncio
from pathlib import Path
from kreuzberg import extract_file

async def main() -> None:
    file_path: Path = Path("document.pdf")

    result = await extract_file(file_path)

    print(f"Content: {result.content}")
    print(f"MIME Type: {result.metadata.format_type}")
    print(f"Tables: {len(result.tables)}")

asyncio.run(main())

Next Steps

Features

Supported File Formats (91+)

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category Formats Capabilities
Word Processing .docx, .docm, .dotx, .dotm, .dot, .odt Full text, tables, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods Sheet data, formulas, cell metadata, charts
Presentations .pptx, .pptm, .ppsx, .potx, .potm, .pot, .ppt Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources
Database .dbf Table data extraction, field type support
Hangul .hwp, .hwpx Korean document format, text extraction

Images (OCR-Enabled)

Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata
Vector .svg DOM parsing, embedded text, graphics metadata

Web & Data

Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .djot, .rst, .org, .rtf CommonMark, GFM, Djot, reStructuredText, Org Mode

Email & Archives

Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, threading
Archives .zip, .tar, .tgz, .gz, .7z File listing, nested archives, metadata

Academic & Scientific

Category Formats Features
Citations .bib, .biblatex, .ris, .nbib, .enw, .csl Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON
Scientific .tex, .latex, .typst, .jats, .ipynb, .docbook LaTeX, Jupyter notebooks, PubMed JATS
Documentation .opml, .pod, .mdoc, .troff Technical documentation formats

Code Intelligence (248 Languages)

Feature Description
Structure Extraction Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis Module dependencies, re-exports, wildcard imports
Symbol Extraction Variables, constants, type aliases, properties
Docstring Parsing Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics Parse errors with line/column positions
Syntax-Aware Chunking Split code by semantic boundaries, not arbitrary byte offsets

Powered by tree-sitter-language-packdocumentation.

Complete Format Reference

Key Capabilities

  • Text Extraction - Extract all text content with position and formatting information

  • Metadata Extraction - Retrieve document properties, creation date, author, etc.

  • Table Extraction - Parse tables with structure and cell content preservation

  • Image Extraction - Extract embedded images and render page previews

  • OCR Support - Integrate multiple OCR backends for scanned documents

  • Async/Await - Non-blocking document processing with concurrent operations

  • Plugin System - Extensible post-processing for custom text transformation

  • Embeddings - Generate vector embeddings using ONNX Runtime models

  • Batch Processing - Efficiently process multiple documents in parallel

  • Memory Efficient - Stream large files without loading entirely into memory

  • Language Detection - Detect and support multiple languages in documents

  • Code Intelligence - Extract structure, imports, exports, symbols, and docstrings from 248 programming languages via tree-sitter

  • Configuration - Fine-grained control over extraction behavior

Performance Characteristics

Format Speed Memory Notes
PDF (text) 10-100 MB/s ~50MB per doc Fastest extraction
Office docs 20-200 MB/s ~100MB per doc DOCX, XLSX, PPTX
Images (OCR) 1-5 MB/s Variable Depends on OCR backend
Archives 5-50 MB/s ~200MB per doc ZIP, TAR, etc.
Web formats 50-200 MB/s Streaming HTML, XML, JSON

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

  • Tesseract

  • Easyocr

  • Paddleocr

OCR Configuration Example

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")
    print(result.content)

asyncio.run(main())

Async Support

This binding provides full async/await support for non-blocking document processing:

import asyncio
from pathlib import Path
from kreuzberg import extract_file

async def main() -> None:
    file_path: Path = Path("document.pdf")

    result = await extract_file(file_path)

    print(f"Content: {result.content}")
    print(f"MIME Type: {result.metadata.format_type}")
    print(f"Tables: {len(result.tables)}")

asyncio.run(main())

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Embeddings Support

Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.

Embeddings Guide

Batch Processing

Process multiple documents efficiently:

import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="tesseract",
            language="eng",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(result.content)
    print(f"Detected Languages: {result.detected_languages}")

asyncio.run(main())

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

License

MIT License - see LICENSE file for details.

Support

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg-4.9.3.tar.gz (2.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kreuzberg-4.9.3-cp310-abi3-win_amd64.whl (33.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_x86_64.whl (35.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_aarch64.whl (33.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

kreuzberg-4.9.3-cp310-abi3-macosx_14_0_arm64.whl (28.7 MB view details)

Uploaded CPython 3.10+macOS 14.0+ ARM64

File details

Details for the file kreuzberg-4.9.3.tar.gz.

File metadata

  • Download URL: kreuzberg-4.9.3.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-4.9.3.tar.gz
Algorithm Hash digest
SHA256 24787bef2608016b2b742aeccc2d68c3e6f2e9ed08b7bfa0ddb529b5d31f4710
MD5 ab334e0cff76d7d591db751cf76f1543
BLAKE2b-256 1ee0235b1deb8eea72734f2f5728555371c3a248f58e5f806b960e40d043c1c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.3.tar.gz:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-4.9.3-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: kreuzberg-4.9.3-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 33.0 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kreuzberg-4.9.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4fcc4853e1eb98ee4e2d735f92cd8ad5b9c1706280e8dee267c91cdf1572f8ff
MD5 81dff0d189eee24eac96f60c6c50b5d1
BLAKE2b-256 6593e1755f5e1c6f6037422f2ca10e06873d0f20c0401cd25b78537273fda7b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.3-cp310-abi3-win_amd64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d736b318eab0e14fe183484d24962e30d1edf450d535d4d7099501c4197fffd4
MD5 63f89773d2b0697ecb508e2ad1cd4d25
BLAKE2b-256 d5525867b6c771a162d454f912003a2a89cb937253108fd316ed0ef00d0848e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_x86_64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0bcc3a4d26a7ba25fc12e2089002fea1aa6ddea5d237f09d0b031da6a5869615
MD5 65a97c11a4e1098778f46ca7090b6f63
BLAKE2b-256 ad1694409321c11f192ad8892abb7bb8f5efe8518683aa3fdf89c9f952312af9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.3-cp310-abi3-manylinux_2_28_aarch64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg-4.9.3-cp310-abi3-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for kreuzberg-4.9.3-cp310-abi3-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 e9ee2a4e0c7e8b6ef0d0952c1ee32e47e9e0e0125bc185a88e74f94b7c1c8fd0
MD5 39b5e2644c2c713cd6bd42b14baf4580
BLAKE2b-256 8206c2d88380c78e203ab8e788a2fa2831acfca95aae4a28a633a9c4eed79dcd

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg-4.9.3-cp310-abi3-macosx_14_0_arm64.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page