Skip to main content

Kreuzberg document extraction pipeline for txtai and other callable-based frameworks

Project description

kreuzberg-txtai

Kreuzberg Banner

A Kreuzberg-backed document extraction pipeline for txtai and any Python framework built around the __call__ convention.

KreuzbergPipeline replaces txtai's built-in Textractor (Apache Tika-based) with Kreuzberg's Rust-powered extraction stack, turning document paths into a list[dict] with content and metadata fields — surfacing title, MIME type, and page count that Tika flattens away.

Installation

pip install kreuzberg-txtai

For the txtai integration examples below:

pip install "kreuzberg-txtai[txtai]"

Requires Python 3.10+.

Quick Start

from kreuzberg_txtai import KreuzbergPipeline

pipeline = KreuzbergPipeline()
docs = pipeline(["doc1.pdf", "doc2.docx", "doc3.html"])

for doc in docs:
    print(doc["metadata"]["source"], "->", len(doc["content"]), "chars")

Each element in docs looks like:

{
    "content": "# Sample Document\n\nExtracted text...",
    "metadata": {
        "source": "doc1.pdf",
        "mime_type": "application/pdf",
        "title": "Sample Document",
        "page_count": 5,
    },
}

Features

  • 88+ file formats — PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and more via Kreuzberg
  • Stable dict contract — every extraction returns content + metadata with the same four keys, regardless of source format
  • Rich metadata — source path, MIME type, title, and page count surface directly
  • Batch support — pass a single path or a list[str]; output is always list[dict] in input order
  • Full Kreuzberg control — pass an ExtractionConfig to drive output format, OCR backend/language, force_ocr, and every other Kreuzberg knob
  • Framework-agnostic — txtai is an optional extra, not a hard dependency; the pipeline works in any framework that accepts a callable
  • Typed — ships with a py.typed marker; full mypy strict compatibility

Usage Examples

RAG ingestion with txtai.Embeddings

The dominant real-world pattern — extract, index, search:

from kreuzberg_txtai import KreuzbergPipeline
from txtai import Embeddings

pipeline = KreuzbergPipeline()
docs = pipeline(["doc1.pdf", "doc2.docx", "doc3.html"])

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True,
})
embeddings.index([(i, doc["content"], None) for i, doc in enumerate(docs)])

results = embeddings.search("query", limit=5)

Inside a txtai.workflow.Task

Task accepts any callable, so KreuzbergPipeline drops in without wrappers. Because the pipeline returns list[dict], downstream tasks that expect strings need a one-line adapter:

from txtai.workflow import Task, Workflow
from kreuzberg_txtai import KreuzbergPipeline

extract = KreuzbergPipeline()

wf = Workflow([
    Task(extract),
    Task(lambda docs: [d["content"] for d in docs]),  # flatten dicts -> strings
])

list(wf(["doc1.pdf", "doc2.pdf"]))

Framework-free loop

from kreuzberg import ExtractionConfig
from kreuzberg_txtai import KreuzbergPipeline

pipeline = KreuzbergPipeline(config=ExtractionConfig(output_format="plain"))
for doc in pipeline(["scan1.pdf", "scan2.pdf"]):
    print(doc["metadata"]["source"], "->", len(doc["content"]), "chars")

No txtai needed — the class works on just the core kreuzberg dependency.

Tuning extraction with ExtractionConfig

Every Kreuzberg knob — output format, OCR backend and language, force_ocr, chunking, custom mime handling — lives on ExtractionConfig. Build one and hand it to the pipeline:

from kreuzberg import ExtractionConfig, OcrConfig
from kreuzberg_txtai import KreuzbergPipeline

custom = ExtractionConfig(
    output_format="markdown",
    ocr=OcrConfig(backend="tesseract", language="eng+deu"),
    force_ocr=True,
)

pipeline = KreuzbergPipeline(config=custom)
docs = pipeline("scanned_report.pdf")

See the Kreuzberg docs for the full list of ExtractionConfig and OcrConfig fields.

Constructor

Parameter Type Default Notes
config ExtractionConfig | None None Drives output format, OCR settings, force_ocr, and every other Kreuzberg option. None falls back to Kreuzberg's defaults.

Return Shape

__call__ always returns list[dict] — a single-path input still returns a length-1 list. Each dict has exactly two top-level keys:

  • content — the extracted text in the format set by config.output_format (Kreuzberg's default when no config is passed)
  • metadata — a dict with exactly four keys: source, mime_type, title, page_count

Missing metadata fields are None (rather than omitted) to keep the dict shape stable across document types.

Related Projects

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg_txtai-0.1.0.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kreuzberg_txtai-0.1.0-py3-none-any.whl (6.6 kB view details)

Uploaded Python 3

File details

Details for the file kreuzberg_txtai-0.1.0.tar.gz.

File metadata

  • Download URL: kreuzberg_txtai-0.1.0.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kreuzberg_txtai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 886c194e4762205c90353d382d9f7725123e701b84169c7e7a88f97c7ef7e4c1
MD5 30b00100c88e45e4410b47bfb0cd651b
BLAKE2b-256 a4d7a7de0a2f3b89bb72156ad008646018394aa50f2e519dceb858a1030c7dea

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg_txtai-0.1.0.tar.gz:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg-txtai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kreuzberg_txtai-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for kreuzberg_txtai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e91f6a85c204d3845ae3961ecb45c3167f91b4280ba90b40d91e0603883e7b7
MD5 822e2f0e0fa52ec1935381a3677ba3f5
BLAKE2b-256 c1526432f2cea53b4daa130bacb8eb58de749fe034197b25a62bcef225add1ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg_txtai-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg-txtai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page