Kreuzberg document extraction pipeline for txtai and other callable-based frameworks
Project description
kreuzberg-txtai
A Kreuzberg-backed document extraction pipeline for txtai and any Python framework built around the __call__ convention.
KreuzbergPipeline replaces txtai's built-in Textractor (Apache Tika-based) with Kreuzberg's Rust-powered extraction stack, turning document paths into a list[dict] with content and metadata fields — surfacing title, MIME type, and page count that Tika flattens away.
Installation
pip install kreuzberg-txtai
For the txtai integration examples below:
pip install "kreuzberg-txtai[txtai]"
Requires Python 3.10+.
Quick Start
from kreuzberg_txtai import KreuzbergPipeline
pipeline = KreuzbergPipeline()
docs = pipeline(["doc1.pdf", "doc2.docx", "doc3.html"])
for doc in docs:
print(doc["metadata"]["source"], "->", len(doc["content"]), "chars")
Each element in docs looks like:
{
"content": "# Sample Document\n\nExtracted text...",
"metadata": {
"source": "doc1.pdf",
"mime_type": "application/pdf",
"title": "Sample Document",
"page_count": 5,
},
}
Features
- 88+ file formats — PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and more via Kreuzberg
- Stable dict contract — every extraction returns
content+metadatawith the same four keys, regardless of source format - Rich metadata — source path, MIME type, title, and page count surface directly
- Batch support — pass a single path or a
list[str]; output is alwayslist[dict]in input order - Full Kreuzberg control — pass an
ExtractionConfigto drive output format, OCR backend/language,force_ocr, and every other Kreuzberg knob - Framework-agnostic — txtai is an optional extra, not a hard dependency; the pipeline works in any framework that accepts a callable
- Typed — ships with a
py.typedmarker; full mypy strict compatibility
Usage Examples
RAG ingestion with txtai.Embeddings
The dominant real-world pattern — extract, index, search:
from kreuzberg_txtai import KreuzbergPipeline
from txtai import Embeddings
pipeline = KreuzbergPipeline()
docs = pipeline(["doc1.pdf", "doc2.docx", "doc3.html"])
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True,
})
embeddings.index([(i, doc["content"], None) for i, doc in enumerate(docs)])
results = embeddings.search("query", limit=5)
Inside a txtai.workflow.Task
Task accepts any callable, so KreuzbergPipeline drops in without wrappers. Because the pipeline returns list[dict], downstream tasks that expect strings need a one-line adapter:
from txtai.workflow import Task, Workflow
from kreuzberg_txtai import KreuzbergPipeline
extract = KreuzbergPipeline()
wf = Workflow([
Task(extract),
Task(lambda docs: [d["content"] for d in docs]), # flatten dicts -> strings
])
list(wf(["doc1.pdf", "doc2.pdf"]))
Framework-free loop
from kreuzberg import ExtractionConfig
from kreuzberg_txtai import KreuzbergPipeline
pipeline = KreuzbergPipeline(config=ExtractionConfig(output_format="plain"))
for doc in pipeline(["scan1.pdf", "scan2.pdf"]):
print(doc["metadata"]["source"], "->", len(doc["content"]), "chars")
No txtai needed — the class works on just the core kreuzberg dependency.
Tuning extraction with ExtractionConfig
Every Kreuzberg knob — output format, OCR backend and language, force_ocr, chunking, custom mime handling — lives on ExtractionConfig. Build one and hand it to the pipeline:
from kreuzberg import ExtractionConfig, OcrConfig
from kreuzberg_txtai import KreuzbergPipeline
custom = ExtractionConfig(
output_format="markdown",
ocr=OcrConfig(backend="tesseract", language="eng+deu"),
force_ocr=True,
)
pipeline = KreuzbergPipeline(config=custom)
docs = pipeline("scanned_report.pdf")
See the Kreuzberg docs for the full list of ExtractionConfig and OcrConfig fields.
Constructor
| Parameter | Type | Default | Notes |
|---|---|---|---|
config |
ExtractionConfig | None |
None |
Drives output format, OCR settings, force_ocr, and every other Kreuzberg option. None falls back to Kreuzberg's defaults. |
Return Shape
__call__ always returns list[dict] — a single-path input still returns a length-1 list. Each dict has exactly two top-level keys:
content— the extracted text in the format set byconfig.output_format(Kreuzberg's default when no config is passed)metadata— a dict with exactly four keys:source,mime_type,title,page_count
Missing metadata fields are None (rather than omitted) to keep the dict shape stable across document types.
Related Projects
- kreuzberg — the extraction engine powering this package
- langchain-kreuzberg — Kreuzberg document loader for LangChain
- llama-index-kreuzberg — LlamaIndex reader and node parser
- kreuzberg-crewai — CrewAI agent tool
- kreuzberg-surrealdb — SurrealDB ingestion connector
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kreuzberg_txtai-0.1.0.tar.gz.
File metadata
- Download URL: kreuzberg_txtai-0.1.0.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
886c194e4762205c90353d382d9f7725123e701b84169c7e7a88f97c7ef7e4c1
|
|
| MD5 |
30b00100c88e45e4410b47bfb0cd651b
|
|
| BLAKE2b-256 |
a4d7a7de0a2f3b89bb72156ad008646018394aa50f2e519dceb858a1030c7dea
|
Provenance
The following attestation bundles were made for kreuzberg_txtai-0.1.0.tar.gz:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg-txtai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg_txtai-0.1.0.tar.gz -
Subject digest:
886c194e4762205c90353d382d9f7725123e701b84169c7e7a88f97c7ef7e4c1 - Sigstore transparency entry: 1319720815
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg-txtai@a5656dcfbd0c9a61631da01b95a8ed4297096ef8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@a5656dcfbd0c9a61631da01b95a8ed4297096ef8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file kreuzberg_txtai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kreuzberg_txtai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e91f6a85c204d3845ae3961ecb45c3167f91b4280ba90b40d91e0603883e7b7
|
|
| MD5 |
822e2f0e0fa52ec1935381a3677ba3f5
|
|
| BLAKE2b-256 |
c1526432f2cea53b4daa130bacb8eb58de749fe034197b25a62bcef225add1ae
|
Provenance
The following attestation bundles were made for kreuzberg_txtai-0.1.0-py3-none-any.whl:
Publisher:
publish.yaml on kreuzberg-dev/kreuzberg-txtai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kreuzberg_txtai-0.1.0-py3-none-any.whl -
Subject digest:
8e91f6a85c204d3845ae3961ecb45c3167f91b4280ba90b40d91e0603883e7b7 - Sigstore transparency entry: 1319720919
- Sigstore integration time:
-
Permalink:
kreuzberg-dev/kreuzberg-txtai@a5656dcfbd0c9a61631da01b95a8ed4297096ef8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/kreuzberg-dev
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@a5656dcfbd0c9a61631da01b95a8ed4297096ef8 -
Trigger Event:
push
-
Statement type: