RAG ingest + retrieval engine — section_table chunker, hybrid retrieval, optional rerank/enrichment
Project description
xrag
RAG ingest + retrieval engine for documents with text and tables. Section/table-aware chunking, hybrid retrieval, optional Cohere reranking, optional LLM-based enrichment, and end-to-end Q&A generation. Works as both a CLI and an installable Python library.
xrag is designed for teams that want one Python library for:
- ingesting real documents into a retrieval system
- indexing pre-chunked offline data
- retrieving typed chunks for downstream LLM apps
- running a complete retrieve-and-answer flow with one client
Install
pip install pyxrag
# or with optional extras:
pip install "pyxrag[rerank-cohere,cli]"
The distribution name on PyPI is pyxrag; the import name is xrag:
import xrag
For local development from source:
git clone https://github.com/henryle97/xrag.git
cd xrag
uv sync --group dev
30-Second Quickstart
If you want the main library path, start with Xrag:
from xrag import Xrag
client = Xrag(
qdrant_path="data/dev/xrag_quickstart",
generation="openai:gpt-4.1-nano",
)
doc = await client.documents.ingest("annual-report.docx", collection="ir_docs")
result = await client.rag.ask(
query="What is this document about?",
collection="ir_docs",
)
print(doc.id)
print(result.answer)
See the runnable example: examples/01_quickstart.py
For the full Python guide, see docs/library.md.
Choose Your Path
xrag has three main entrypoints:
Xrag: the higher-level async client for application coderun_ingest/run_rag: the lower-level functional API for callers that want direct control overAppConfigand pipeline wiring- CLI: local workflows, debugging, artifact inspection, and evaluation runs
Use Xrag when you want the application-facing library surface.
Use the functional API when you already work in terms of AppConfig.
Use the CLI when you are operating the pipeline directly from the shell.
Using Xrag
Create one client and reuse it across operations:
from xrag import Xrag
client = Xrag(
qdrant_url="http://localhost:6333",
embedding="openai:text-embedding-3-small",
generation="openai:gpt-4.1-nano",
)
Ingest A Document
Parse, chunk, embed, and index a source document:
doc = await client.documents.ingest(
"annual-report.docx",
collection="ir_docs",
)
print(doc.id)
Runnable example: examples/01_quickstart.py
Index From Offline Chunks
If you already have chunks prepared offline, skip parsing and index them directly:
doc = await client.documents.index_chunks(
"data/chunks.json",
collection="ir_docs",
name="f1-offline-chunks",
)
print(doc.id)
Runnable examples:
Retrieve Chunks
If you only want retrieval and already have your own LLM stack:
result = await client.retrievals.search(
query="What was revenue?",
collection="ir_docs",
top_k=5,
)
for chunk in result.chunks:
print(chunk.document_id, chunk.score, chunk.text[:120])
Runnable example: examples/03_retrieve_only.py
Ask A Question
Retrieve context and generate an answer:
result = await client.rag.ask(
query="What was revenue?",
collection="ir_docs",
)
print(result.answer)
Runnable example: examples/01_quickstart.py
Example Gallery
examples/01_quickstart.pyIngest one document and ask one question.examples/02_from_env.pyBuild the client from environment variables.examples/03_retrieve_only.pyUse xrag for retrieval only.examples/04_multi_tenant.pyScope operations by tenant withclient.for_tenant(...).examples/05_index_chunks.pyIndex pre-built chunks from Python objects.examples/05b_index_chunks_from_json.pyRe-index a chunks JSON file from an offline pipeline run.examples/06_export_and_validate.pyExport, validate, and round-trip chunks.examples/06b_export_chunks.pyExport chunks from an indexed collection.examples/07_error_handling.pyHandle typed xrag exceptions.examples/08_per_call_overrides.pyOverride ingest and retrieval behavior per call.examples/09_parse_docx_api_vs_local.pyCompare hosted and local DOCX parsing on the same file.
Configuring Xrag
Xrag(...) is the main application-facing configuration surface. In
practice, most users set storage, embedding, and optionally generation,
then reuse the same client for ingest and query operations.
Full Example
This shows the full constructor surface:
from xrag import Xrag
from xrag.config.models import RetrievalConfig
client = Xrag(
qdrant_url="http://localhost:6333",
qdrant_path=None,
embedding="openai:text-embedding-3-small",
generation="openai:gpt-4.1-nano",
enrichment=("auto_keywords", "auto_questions"),
reranker=None,
parser="unstructured",
chunker="section_table",
retrieval_defaults=RetrievalConfig(
provider="hybrid",
options={
"top_k": 10,
"bm25_candidates": 50,
"vector_candidates": 50,
"rrf_k": 60,
"dedup_family": True,
},
),
upload_dir="/tmp/xrag_uploads",
artifacts_dir=None,
timeout_s=30.0,
ingest_timeout_s=600.0,
tracing=None,
)
Parameter Reference
Storage
qdrant_url: str | None = NoneURL of a running Qdrant server. Use this for normal server-backed deployments.qdrant_path: str | Path | None = NoneLocal filesystem path for Qdrant local mode. Use this for local development or single-machine experiments.
You must set exactly one of qdrant_url or qdrant_path.
Core Model Configuration
embedding: str = "openai:text-embedding-3-small"Embedding provider string. Required for indexing and retrieval. Example values:openai:text-embedding-3-smallgeneration: str | None = NoneGeneration provider string forclient.rag.ask(...). Omit this for retrieve-only usage. Example values:openai:gpt-4.1-nanoreranker: str | None = NoneOptional reranker configuration. The client only uses it on calls where reranking is enabled. Example values:cohere:rerank-v3.5tracing: str | None = NoneOptional tracing backend. Example values:langsmith:my-project
Ingest Configuration
parser: str = "unstructured"Parser configuration used byclient.documents.ingest(...). Default is the hosted Unstructured API parser. Useunstructured_local:fastfor lightweight local parsing. Addxrag[parser-unstructured-local-pdf]only when you want local PDF/OCR parsing.chunker: str = "section_table"Chunking strategy used during ingest. Default is the section/table-aware chunker.enrichment: tuple[str, ...] | None = ("auto_keywords", "auto_questions")Ingest-time enrichment stages. SetNoneor()to disable enrichment. Available stage names currently include:auto_keywords,auto_questions,table_context,table_summaryupload_dir: str | Path = "/tmp/xrag_uploads"Local spool directory for byte inputs and file-like inputs passed todocuments.ingest(...).artifacts_dir: str | Path | None = NoneBase directory for ingest artifacts. When unset, xrag creates an ephemeral artifacts directory per ingest call.
Retrieval Configuration
retrieval_defaults: RetrievalConfig | None = NoneDefault retrieval settings used byclient.retrievals.search(...)andclient.rag.ask(...). If unset, xrag uses:
RetrievalConfig(
provider="hybrid",
options={
"top_k": 10,
"bm25_candidates": 50,
"vector_candidates": 50,
"rrf_k": 60,
"dedup_family": True,
},
)
By default, reranking is still off at call time unless you explicitly
pass rerank=True.
Timeouts
timeout_s: float = 30.0General request timeout for non-ingest operations.ingest_timeout_s: float = 600.0Longer timeout budget for ingest operations.
What Most Users Actually Change
For most applications, the parameters that matter most are:
qdrant_urlorqdrant_pathembeddinggenerationif you userag.ask(...)rerankerif you want reranked retrievalenrichmentif you want to disable or change ingest-time enrichmentartifacts_dirif you want stable ingest artifacts on disk
Environment-Based Setup
If you prefer environment variables:
from xrag import Xrag
client = Xrag.from_env()
Useful environment variables:
XRAG_QDRANT_URLXRAG_QDRANT_PATHXRAG_DEFAULT_EMBEDDINGXRAG_DEFAULT_GENERATIONXRAG_DEFAULT_RERANKERXRAG_UPLOAD_DIRXRAG_ARTIFACTS_DIR
Docs Map
Detailed docs live under docs/.
Recommended path:
- Read
docs/README.mdfor library-vs-CLI doc navigation. - Read
docs/library.mdfor the dedicated Python library guide. - Use
docs/command.mdfor exact CLI commands. - Use
docs/chunker.mdfor chunking strategies and output schema. - Use
docs/enrichment.mdfor indexing-time text enrichment. - Use
docs/rag-baseline.mdfor the simple baseline RAG flow. - Use
docs/sub-plans/xrag-public-api.mdfor the high-level Python client design. - Use
docs/plans.mdanddocs/sub-plans/for roadmap and implementation details.
Quickstart — library (v0.1, functional API)
from pathlib import Path
from xrag import (
AppConfig, IngestRequest,
load_config_yaml, run_ingest, run_rag,
)
cfg: AppConfig = load_config_yaml("configs/rag-ir-mvp3-all-docs.yml")
# Ingest
ingest_result = run_ingest(IngestRequest(config=cfg, output_dir=Path("./data/dev/ir_document")))
# Query
result = run_rag("What was Cash as of December 31, 2024?", cfg)
print(result.answer)
The high-level Xrag async client (client.documents.ingest(...), client.retrievals.search(...), client.rag.ask(...)) ships in v0.2 — see docs/sub-plans/xrag-public-api.md.
Quickstart — CLI
End-to-end pipeline:
source document
-> convert
-> parser
-> parser preprocess
-> chunk prepare
-> indexing enrichment (optional)
-> rag baseline OR indexing + query
-> eval create-dataset / eval run
Common commands:
uv run python -m xrag.cli convert --help
uv run python -m xrag.cli parser --help
uv run python -m xrag.cli parser preprocess --help
uv run python -m xrag.cli chunk prepare --help
uv run python -m xrag.cli rag baseline --help
uv run python -m xrag.cli --help
Smoke flow:
# Parse
uv run python -m xrag.cli parser \
--config configs/baseline.yml \
--output-dir data/dev/ir_document \
--pretty
# Preprocess
uv run python -m xrag.cli parser preprocess \
--input data/dev/ir_document/parsed/unstructured_elements.json \
--output-dir data/dev/ir_document/normalized
# Chunk
uv run python -m xrag.cli chunk prepare \
--input data/dev/ir_document/normalized/elements.json \
--config configs/chunker.yml \
--output-dir data/dev/ir_document/chunked
# Baseline RAG
uv run python -m xrag.cli rag baseline \
--input data/dev/ir_document/chunked/chunks.json \
--question "What was Cash as of December 31, 2024?" \
--config configs/rag-simple-baseline.yml \
--pretty
# Preview enrichment without embedding/vector-store writes
uv run python scripts/enrichment/preview.py \
--input data/dev/ir_document/chunked/chunks.json \
--config configs/enrichment/table-context.yml
Setup
Install dependencies:
uv sync
Environment for the Unstructured parser:
UNSTRUCTURED_API_KEY=...
UNSTRUCTURED_API_URL=... # optional
For DOCX-to-PDF conversion install LibreOffice so soffice / libreoffice is on PATH.
Phasing
| Version | Surface |
|---|---|
| v0.1 (current) | Functional API: run_ingest, run_rag, Pydantic models, configs, loaders. |
| v0.2 | High-level Xrag async client with resource-namespaced ops (documents.*, retrievals.*, rag.*). Tenant binding via for_tenant. Typed error hierarchy. |
| v0.3 | XragSync mirror. |
| v0.4 | Per-call config_overrides, batch ingest, persistent registry contract. |
| v0.5+ (backlog) | Streaming surfaces (documents.ingest_stream, rag.ask_stream). |
Full design: docs/sub-plans/xrag-public-api.md.
Optional dependency extras
xrag[parser-unstructured-local-pdf]— local PDF/image/OCR parsing with Unstructuredxrag[rerank-cohere]— Cohere v3.5 rerankerxrag[cli]— Typer-based CLI (rag-clientry point)xrag[chroma]— Chroma backend (alternative to Qdrant)xrag[all]— everything
The base xrag install includes the hosted Unstructured API path, parser
preprocess support, and lightweight local parsing such as DOCX/text/HTML.
Add xrag[parser-unstructured-local-pdf] only if you want local PDF/image/OCR
parsing on the same machine.
Dev-only tooling (tools/)
Eval scoring (RAGAS), dataset generation, and HTML viewers live under tools/ at the repo root. They are not packaged into the wheel — wheel consumers don't need them. Running them requires a source clone plus the dev dependency group:
uv sync --group dev --extra cli --extra chroma --extra rerank-cohere
Add the heavy local PDF parser stack only when you need it:
uv sync --group dev --extra cli --extra chroma --extra rerank-cohere --extra parser-unstructured-local-pdf
make eval, make unit-test, python -m xrag.cli eval ... and similar dev commands rely on tools/ being on sys.path (pytest is configured for this).
Key defaults
- Parser provider:
unstructured. - Preprocess config:
configs/preprocess.yml. - Default chunker:
section_tableinconfigs/chunker.yml. - Alternative chunker:
ragflowinconfigs/chunker-ragflow.yml. - Simple one-shot RAG config:
configs/rag-simple-baseline.yml. - Persistent index/query config:
configs/rag-baseline.yml. - Enrichment scenario configs:
configs/enrichment/.
Project layout
xrag/cli.py: Typer CLI entrypoint (gated behind[cli]extra).xrag/config/: YAML config and settings.xrag/core/: parser, chunker, enrichment, embedding, retrieval, reranker, generation, and tracing subsystems.xrag/pipelines/: library pipelines —parser,parser_preprocess,chunk_prepare,indexing,ingest,rag(single-query, batch, and eval-shaped retrieve+generate). Eval scoring/dataset/HTML-viewer pipelines live undertools/.tools/: dev-only — eval scoring, dataset generation, HTML viewers, eval datasets. Excluded from the wheel.configs/: runtime YAML configs.docs/: user docs, plans, surveys, and terminology.data/: local inputs and generated artifacts.
Development checks
make lint
make check
make unit-test
After changing behavior, run a live CLI command that exercises the changed path.
CI/CD
GitHub Actions now verifies the same core paths contributors should run locally:
lint: Ruff lint and format checks on a locked dev environmentunit-test: full-extras unit tests plus a realpython -m xrag.cli --helpsmoke checkpackage: wheel + sdist build plustwine checkcontract: installed-wheel compatibility against Python3.11and3.12, with both the pinned LangChain0.2.17floor and the0.3line
Tag pushes matching v* reuse the verified package artifacts and publish them to the GitHub Release instead of rebuilding a second time during release.
License
MIT — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyxrag-0.5.3.tar.gz.
File metadata
- Download URL: pyxrag-0.5.3.tar.gz
- Upload date:
- Size: 115.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c124a287b8f19c3deb899362bba5276b1d4c2ee471c2c46cfed1d1176c42143d
|
|
| MD5 |
5e057fb9e8d71f091b309487a8040da7
|
|
| BLAKE2b-256 |
6aad0d1bb98a1ccc5b68d272539cfba883696e4c3e4d9b34629555b22de2e993
|
Provenance
The following attestation bundles were made for pyxrag-0.5.3.tar.gz:
Publisher:
ci.yml on henryle97/xrag
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyxrag-0.5.3.tar.gz -
Subject digest:
c124a287b8f19c3deb899362bba5276b1d4c2ee471c2c46cfed1d1176c42143d - Sigstore transparency entry: 1440156820
- Sigstore integration time:
-
Permalink:
henryle97/xrag@370189227efb6547e2fdb152789c4499334d5f12 -
Branch / Tag:
refs/tags/v0.5.3 - Owner: https://github.com/henryle97
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@370189227efb6547e2fdb152789c4499334d5f12 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pyxrag-0.5.3-py3-none-any.whl.
File metadata
- Download URL: pyxrag-0.5.3-py3-none-any.whl
- Upload date:
- Size: 163.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c9acb9415641da45f51438f9f8539ee95be9280bb360baf76923a4776d453b6
|
|
| MD5 |
3680fc2e8983e91a69d4aafb1fe62d97
|
|
| BLAKE2b-256 |
7b408f1cec1529392cf6b330433a2739c76b1335e2e356f70cf253134ed36ae9
|
Provenance
The following attestation bundles were made for pyxrag-0.5.3-py3-none-any.whl:
Publisher:
ci.yml on henryle97/xrag
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyxrag-0.5.3-py3-none-any.whl -
Subject digest:
6c9acb9415641da45f51438f9f8539ee95be9280bb360baf76923a4776d453b6 - Sigstore transparency entry: 1440156830
- Sigstore integration time:
-
Permalink:
henryle97/xrag@370189227efb6547e2fdb152789c4499334d5f12 -
Branch / Tag:
refs/tags/v0.5.3 - Owner: https://github.com/henryle97
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@370189227efb6547e2fdb152789c4499334d5f12 -
Trigger Event:
push
-
Statement type: