Kreuzberg document extraction pipeline for txtai and other callable-based frameworks

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

v-tan

These details have not been verified by PyPI

Project links

Project description

kreuzberg-txtai

A Kreuzberg-backed document extraction pipeline for txtai and any Python framework built around the __call__ convention.

KreuzbergPipeline replaces txtai's built-in Textractor (Apache Tika-based) with Kreuzberg's Rust-powered extraction stack, turning document paths into a list[dict] with content and metadata fields — surfacing title, MIME type, and page count that Tika flattens away.

Installation

pip install kreuzberg-txtai

For the txtai integration examples below:

pip install "kreuzberg-txtai[txtai]"

Requires Python 3.10+.

Quick Start

from kreuzberg_txtai import KreuzbergPipeline

pipeline = KreuzbergPipeline()
docs = pipeline(["doc1.pdf", "doc2.docx", "doc3.html"])

for doc in docs:
    print(doc["metadata"]["source"], "->", len(doc["content"]), "chars")

Each element in docs looks like:

{
    "content": "# Sample Document\n\nExtracted text...",
    "metadata": {
        "source": "doc1.pdf",
        "mime_type": "application/pdf",
        "title": "Sample Document",
        "page_count": 5,
    },
}

Features

88+ file formats — PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and more via Kreuzberg
Stable dict contract — every extraction returns content + metadata with the same four keys, regardless of source format
Rich metadata — source path, MIME type, title, and page count surface directly
Batch support — pass a single path or a list[str]; output is always list[dict] in input order
Full Kreuzberg control — pass an ExtractionConfig to drive output format, OCR backend/language, force_ocr, and every other Kreuzberg knob
Framework-agnostic — txtai is an optional extra, not a hard dependency; the pipeline works in any framework that accepts a callable
Typed — ships with a py.typed marker; full mypy strict compatibility

Usage Examples

RAG ingestion with `txtai.Embeddings`

The dominant real-world pattern — extract, index, search:

from kreuzberg_txtai import KreuzbergPipeline
from txtai import Embeddings

pipeline = KreuzbergPipeline()
docs = pipeline(["doc1.pdf", "doc2.docx", "doc3.html"])

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True,
})
embeddings.index([(i, doc["content"], None) for i, doc in enumerate(docs)])

results = embeddings.search("query", limit=5)

Inside a `txtai.workflow.Task`

Task accepts any callable, so KreuzbergPipeline drops in without wrappers. Because the pipeline returns list[dict], downstream tasks that expect strings need a one-line adapter:

from txtai.workflow import Task, Workflow
from kreuzberg_txtai import KreuzbergPipeline

extract = KreuzbergPipeline()

wf = Workflow([
    Task(extract),
    Task(lambda docs: [d["content"] for d in docs]),  # flatten dicts -> strings
])

list(wf(["doc1.pdf", "doc2.pdf"]))

Framework-free loop

from kreuzberg import ExtractionConfig
from kreuzberg_txtai import KreuzbergPipeline

pipeline = KreuzbergPipeline(config=ExtractionConfig(output_format="plain"))
for doc in pipeline(["scan1.pdf", "scan2.pdf"]):
    print(doc["metadata"]["source"], "->", len(doc["content"]), "chars")

No txtai needed — the class works on just the core kreuzberg dependency.

Tuning extraction with `ExtractionConfig`

Every Kreuzberg knob — output format, OCR backend and language, force_ocr, chunking, custom mime handling — lives on ExtractionConfig. Build one and hand it to the pipeline:

from kreuzberg import ExtractionConfig, OcrConfig
from kreuzberg_txtai import KreuzbergPipeline

custom = ExtractionConfig(
    output_format="markdown",
    ocr=OcrConfig(backend="tesseract", language="eng+deu"),
    force_ocr=True,
)

pipeline = KreuzbergPipeline(config=custom)
docs = pipeline("scanned_report.pdf")

See the Kreuzberg docs for the full list of ExtractionConfig and OcrConfig fields.

Constructor

Parameter	Type	Default	Notes
`config`	`ExtractionConfig \| None`	`None`	Drives output format, OCR settings, `force_ocr`, and every other Kreuzberg option. `None` falls back to Kreuzberg's defaults.

Return Shape

__call__ always returns list[dict] — a single-path input still returns a length-1 list. Each dict has exactly two top-level keys:

content — the extracted text in the format set by config.output_format (Kreuzberg's default when no config is passed)
metadata — a dict with exactly four keys: source, mime_type, title, page_count

Missing metadata fields are None (rather than omitted) to keep the dict shape stable across document types.

Related Projects

kreuzberg — the extraction engine powering this package
langchain-kreuzberg — Kreuzberg document loader for LangChain
llama-index-kreuzberg — LlamaIndex reader and node parser
kreuzberg-crewai — CrewAI agent tool
kreuzberg-surrealdb — SurrealDB ingestion connector

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

v-tan

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kreuzberg_txtai-0.1.0.tar.gz (6.5 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kreuzberg_txtai-0.1.0-py3-none-any.whl (6.6 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file kreuzberg_txtai-0.1.0.tar.gz.

File metadata

Download URL: kreuzberg_txtai-0.1.0.tar.gz
Upload date: Apr 16, 2026
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kreuzberg_txtai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`886c194e4762205c90353d382d9f7725123e701b84169c7e7a88f97c7ef7e4c1`
MD5	`30b00100c88e45e4410b47bfb0cd651b`
BLAKE2b-256	`a4d7a7de0a2f3b89bb72156ad008646018394aa50f2e519dceb858a1030c7dea`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg_txtai-0.1.0.tar.gz:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg-txtai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg_txtai-0.1.0.tar.gz
- Subject digest: 886c194e4762205c90353d382d9f7725123e701b84169c7e7a88f97c7ef7e4c1
- Sigstore transparency entry: 1319720815
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: kreuzberg-dev/kreuzberg-txtai@a5656dcfbd0c9a61631da01b95a8ed4297096ef8
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@a5656dcfbd0c9a61631da01b95a8ed4297096ef8
- Trigger Event: push

File details

Details for the file kreuzberg_txtai-0.1.0-py3-none-any.whl.

File metadata

Download URL: kreuzberg_txtai-0.1.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 6.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kreuzberg_txtai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8e91f6a85c204d3845ae3961ecb45c3167f91b4280ba90b40d91e0603883e7b7`
MD5	`822e2f0e0fa52ec1935381a3677ba3f5`
BLAKE2b-256	`c1526432f2cea53b4daa130bacb8eb58de749fe034197b25a62bcef225add1ae`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kreuzberg_txtai-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on kreuzberg-dev/kreuzberg-txtai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kreuzberg_txtai-0.1.0-py3-none-any.whl
- Subject digest: 8e91f6a85c204d3845ae3961ecb45c3167f91b4280ba90b40d91e0603883e7b7
- Sigstore transparency entry: 1319720919
- Sigstore integration time: Apr 16, 2026
Source repository:
- Permalink: kreuzberg-dev/kreuzberg-txtai@a5656dcfbd0c9a61631da01b95a8ed4297096ef8
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/kreuzberg-dev
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@a5656dcfbd0c9a61631da01b95a8ed4297096ef8
- Trigger Event: push

kreuzberg-txtai 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

kreuzberg-txtai

Installation

Quick Start

Features

Usage Examples

RAG ingestion with txtai.Embeddings

Inside a txtai.workflow.Task

Framework-free loop

Tuning extraction with ExtractionConfig

Constructor

Return Shape

Related Projects

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

RAG ingestion with `txtai.Embeddings`

Inside a `txtai.workflow.Task`

Tuning extraction with `ExtractionConfig`