RAG ingest + retrieval engine — section_table chunker, hybrid retrieval, optional rerank/enrichment

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

henryle.dev

These details have not been verified by PyPI

Project description

xrag

RAG ingest + retrieval engine for documents with text and tables. Section/table-aware chunking, hybrid retrieval, optional Cohere reranking, optional LLM-based enrichment, and end-to-end Q&A generation. Works as both a CLI and an installable Python library.

xrag is designed for teams that want one Python library for:

ingesting real documents into a retrieval system
indexing pre-chunked offline data
retrieving typed chunks for downstream LLM apps
running a complete retrieve-and-answer flow with one client

Install

pip install pyxrag
# or with optional extras:
pip install "pyxrag[rerank-cohere,cli]"

The distribution name on PyPI is pyxrag; the import name is xrag:

import xrag

For local development from source:

git clone https://github.com/henryle97/xrag.git
cd xrag
uv sync --group dev

30-Second Quickstart

If you want the main library path, start with Xrag:

from xrag import Xrag

client = Xrag(
    qdrant_path="data/dev/xrag_quickstart",
    generation="openai:gpt-4.1-nano",
)

doc = await client.documents.ingest("annual-report.docx", collection="ir_docs")
result = await client.rag.ask(
    query="What is this document about?",
    collection="ir_docs",
)

print(doc.id)
print(result.answer)

See the runnable example: examples/01_quickstart.py For the full Python guide, see docs/library.md.

Choose Your Path

xrag has three main entrypoints:

Xrag: the higher-level async client for application code
run_ingest / run_rag: the lower-level functional API for callers that want direct control over AppConfig and pipeline wiring
CLI: local workflows, debugging, artifact inspection, and evaluation runs

Use Xrag when you want the application-facing library surface. Use the functional API when you already work in terms of AppConfig. Use the CLI when you are operating the pipeline directly from the shell.

Using `Xrag`

Create one client and reuse it across operations:

from xrag import Xrag

client = Xrag(
    qdrant_url="http://localhost:6333",
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
)

Ingest A Document

Parse, chunk, embed, and index a source document:

doc = await client.documents.ingest(
    "annual-report.docx",
    collection="ir_docs",
)

print(doc.id)

Runnable example: examples/01_quickstart.py

Index From Offline Chunks

If you already have chunks prepared offline, skip parsing and index them directly:

doc = await client.documents.index_chunks(
    "data/chunks.json",
    collection="ir_docs",
    name="f1-offline-chunks",
)

print(doc.id)

Runnable examples:

Retrieve Chunks

If you only want retrieval and already have your own LLM stack:

result = await client.retrievals.search(
    query="What was revenue?",
    collection="ir_docs",
    top_k=5,
)

for chunk in result.chunks:
    print(chunk.document_id, chunk.score, chunk.text[:120])

Runnable example: examples/03_retrieve_only.py

Ask A Question

Retrieve context and generate an answer:

result = await client.rag.ask(
    query="What was revenue?",
    collection="ir_docs",
)

print(result.answer)

Runnable example: examples/01_quickstart.py

Example Gallery

examples/01_quickstart.py Ingest one document and ask one question.
examples/02_from_env.py Build the client from environment variables.
examples/03_retrieve_only.py Use xrag for retrieval only.
examples/04_multi_tenant.py Scope operations by tenant with client.for_tenant(...).
examples/05_index_chunks.py Index pre-built chunks from Python objects.
examples/05b_index_chunks_from_json.py Re-index a chunks JSON file from an offline pipeline run.
examples/06_export_and_validate.py Export, validate, and round-trip chunks.
examples/06b_export_chunks.py Export chunks from an indexed collection.
examples/07_error_handling.py Handle typed xrag exceptions.
examples/08_per_call_overrides.py Override ingest and retrieval behavior per call.
examples/09_parse_docx_api_vs_local.py Compare hosted and local DOCX parsing on the same file.

Configuring `Xrag`

Xrag(...) is the main application-facing configuration surface. In practice, most users set storage, embedding, and optionally generation, then reuse the same client for ingest and query operations.

Full Example

This shows the full constructor surface:

from xrag import Xrag
from xrag.config.models import RetrievalConfig

client = Xrag(
    qdrant_url="http://localhost:6333",
    qdrant_path=None,
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
    enrichment=("auto_keywords", "auto_questions"),
    reranker=None,
    parser="unstructured",
    chunker="section_table",
    retrieval_defaults=RetrievalConfig(
        provider="hybrid",
        options={
            "top_k": 10,
            "bm25_candidates": 50,
            "vector_candidates": 50,
            "rrf_k": 60,
            "dedup_family": True,
        },
    ),
    upload_dir="/tmp/xrag_uploads",
    artifacts_dir=None,
    timeout_s=30.0,
    ingest_timeout_s=600.0,
    tracing=None,
)

Parameter Reference

Storage

qdrant_url: str | None = None URL of a running Qdrant server. Use this for normal server-backed deployments.
qdrant_path: str | Path | None = None Local filesystem path for Qdrant local mode. Use this for local development or single-machine experiments.

You must set exactly one of qdrant_url or qdrant_path.

Core Model Configuration

embedding: str = "openai:text-embedding-3-small" Embedding provider string. Required for indexing and retrieval. Example values: openai:text-embedding-3-small
generation: str | None = None Generation provider string for client.rag.ask(...). Omit this for retrieve-only usage. Example values: openai:gpt-4.1-nano
reranker: str | None = None Optional reranker configuration. The client only uses it on calls where reranking is enabled. Example values: cohere:rerank-v3.5
tracing: str | None = None Optional tracing backend. Example values: langsmith:my-project

Ingest Configuration

parser: str = "unstructured" Parser configuration used by client.documents.ingest(...). Default is the hosted Unstructured API parser. Use unstructured_local:fast for lightweight local parsing. Add xrag[parser-unstructured-local-pdf] only when you want local PDF/OCR parsing.
chunker: str = "section_table" Chunking strategy used during ingest. Default is the section/table-aware chunker.
enrichment: tuple[str, ...] | None = ("auto_keywords", "auto_questions") Ingest-time enrichment stages. Set None or () to disable enrichment. Available stage names currently include: auto_keywords, auto_questions, table_context, table_summary
upload_dir: str | Path = "/tmp/xrag_uploads" Local spool directory for byte inputs and file-like inputs passed to documents.ingest(...).
artifacts_dir: str | Path | None = None Base directory for ingest artifacts. When unset, xrag creates an ephemeral artifacts directory per ingest call.

Retrieval Configuration

retrieval_defaults: RetrievalConfig | None = None Default retrieval settings used by client.retrievals.search(...) and client.rag.ask(...). If unset, xrag uses:

RetrievalConfig(
    provider="hybrid",
    options={
        "top_k": 10,
        "bm25_candidates": 50,
        "vector_candidates": 50,
        "rrf_k": 60,
        "dedup_family": True,
    },
)

By default, reranking is still off at call time unless you explicitly pass rerank=True.

Timeouts

timeout_s: float = 30.0 General request timeout for non-ingest operations.
ingest_timeout_s: float = 600.0 Longer timeout budget for ingest operations.

What Most Users Actually Change

For most applications, the parameters that matter most are:

qdrant_url or qdrant_path
embedding
generation if you use rag.ask(...)
reranker if you want reranked retrieval
enrichment if you want to disable or change ingest-time enrichment
artifacts_dir if you want stable ingest artifacts on disk

Environment-Based Setup

If you prefer environment variables:

from xrag import Xrag

client = Xrag.from_env()

Useful environment variables:

XRAG_QDRANT_URL
XRAG_QDRANT_PATH
XRAG_DEFAULT_EMBEDDING
XRAG_DEFAULT_GENERATION
XRAG_DEFAULT_RERANKER
XRAG_UPLOAD_DIR
XRAG_ARTIFACTS_DIR

Docs Map

Detailed docs live under docs/.

Recommended path:

Read docs/README.md for library-vs-CLI doc navigation.
Read docs/library.md for the dedicated Python library guide.
Use docs/command.md for exact CLI commands.
Use docs/chunker.md for chunking strategies and output schema.
Use docs/enrichment.md for indexing-time text enrichment.
Use docs/rag-baseline.md for the simple baseline RAG flow.
Use docs/sub-plans/xrag-public-api.md for the high-level Python client design.
Use docs/plans.md and docs/sub-plans/ for roadmap and implementation details.

Quickstart — library (v0.1, functional API)

from pathlib import Path

from xrag import (
    AppConfig, IngestRequest,
    load_config_yaml, run_ingest, run_rag,
)

cfg: AppConfig = load_config_yaml("configs/rag-ir-mvp3-all-docs.yml")

# Ingest
ingest_result = run_ingest(IngestRequest(config=cfg, output_dir=Path("./data/dev/ir_document")))

# Query
result = run_rag("What was Cash as of December 31, 2024?", cfg)
print(result.answer)

The high-level Xrag async client (client.documents.ingest(...), client.retrievals.search(...), client.rag.ask(...)) ships in v0.2 — see docs/sub-plans/xrag-public-api.md.

Quickstart — CLI

End-to-end pipeline:

source document
  -> convert
  -> parser
  -> parser preprocess
  -> chunk prepare
  -> indexing enrichment (optional)
  -> rag baseline OR indexing + query
  -> eval create-dataset / eval run

Common commands:

uv run python -m xrag.cli convert --help
uv run python -m xrag.cli parser --help
uv run python -m xrag.cli parser preprocess --help
uv run python -m xrag.cli chunk prepare --help
uv run python -m xrag.cli rag baseline --help
uv run python -m xrag.cli --help

Smoke flow:

# Parse
uv run python -m xrag.cli parser \
  --config configs/baseline.yml \
  --output-dir data/dev/ir_document \
  --pretty

# Preprocess
uv run python -m xrag.cli parser preprocess \
  --input data/dev/ir_document/parsed/unstructured_elements.json \
  --output-dir data/dev/ir_document/normalized

# Chunk
uv run python -m xrag.cli chunk prepare \
  --input data/dev/ir_document/normalized/elements.json \
  --config configs/chunker.yml \
  --output-dir data/dev/ir_document/chunked

# Baseline RAG
uv run python -m xrag.cli rag baseline \
  --input data/dev/ir_document/chunked/chunks.json \
  --question "What was Cash as of December 31, 2024?" \
  --config configs/rag-simple-baseline.yml \
  --pretty

# Preview enrichment without embedding/vector-store writes
uv run python scripts/enrichment/preview.py \
  --input data/dev/ir_document/chunked/chunks.json \
  --config configs/enrichment/table-context.yml

Setup

Install dependencies:

uv sync

Environment for the Unstructured parser:

UNSTRUCTURED_API_KEY=...
UNSTRUCTURED_API_URL=...   # optional

For DOCX-to-PDF conversion install LibreOffice so soffice / libreoffice is on PATH.

Phasing

Version	Surface
v0.1 (current)	Functional API: `run_ingest`, `run_rag`, Pydantic models, configs, loaders.
v0.2	High-level `Xrag` async client with resource-namespaced ops (`documents.`, `retrievals.`, `rag.*`). Tenant binding via `for_tenant`. Typed error hierarchy.
v0.3	`XragSync` mirror.
v0.4	Per-call `config_overrides`, batch ingest, persistent registry contract.
v0.5+ (backlog)	Streaming surfaces (`documents.ingest_stream`, `rag.ask_stream`).

Full design: docs/sub-plans/xrag-public-api.md.

Optional dependency extras

xrag[parser-unstructured-local-pdf] — local PDF/image/OCR parsing with Unstructured
xrag[rerank-cohere] — Cohere v3.5 reranker
xrag[cli] — Typer-based CLI (rag-cli entry point)
xrag[chroma] — Chroma backend (alternative to Qdrant)
xrag[all] — everything

The base xrag install includes the hosted Unstructured API path, parser preprocess support, and lightweight local parsing such as DOCX/text/HTML. Add xrag[parser-unstructured-local-pdf] only if you want local PDF/image/OCR parsing on the same machine.

Dev-only tooling (`tools/`)

Eval scoring (RAGAS), dataset generation, and HTML viewers live under tools/ at the repo root. They are not packaged into the wheel — wheel consumers don't need them. Running them requires a source clone plus the dev dependency group:

uv sync --group dev --extra cli --extra chroma --extra rerank-cohere

Add the heavy local PDF parser stack only when you need it:

uv sync --group dev --extra cli --extra chroma --extra rerank-cohere --extra parser-unstructured-local-pdf

make eval, make unit-test, python -m xrag.cli eval ... and similar dev commands rely on tools/ being on sys.path (pytest is configured for this).

Key defaults

Parser provider: unstructured.
Preprocess config: configs/preprocess.yml.
Default chunker: section_table in configs/chunker.yml.
Alternative chunker: ragflow in configs/chunker-ragflow.yml.
Simple one-shot RAG config: configs/rag-simple-baseline.yml.
Persistent index/query config: configs/rag-baseline.yml.
Enrichment scenario configs: configs/enrichment/.

Project layout

xrag/cli.py: Typer CLI entrypoint (gated behind [cli] extra).
xrag/config/: YAML config and settings.
xrag/core/: parser, chunker, enrichment, embedding, retrieval, reranker, generation, and tracing subsystems.
xrag/pipelines/: library pipelines — parser, parser_preprocess, chunk_prepare, indexing, ingest, rag (single-query, batch, and eval-shaped retrieve+generate). Eval scoring/dataset/HTML-viewer pipelines live under tools/.
tools/: dev-only — eval scoring, dataset generation, HTML viewers, eval datasets. Excluded from the wheel.
configs/: runtime YAML configs.
docs/: user docs, plans, surveys, and terminology.
data/: local inputs and generated artifacts.

Development checks

make lint
make check
make unit-test

After changing behavior, run a live CLI command that exercises the changed path.

CI/CD

GitHub Actions now verifies the same core paths contributors should run locally:

lint: Ruff lint and format checks on a locked dev environment
unit-test: full-extras unit tests plus a real python -m xrag.cli --help smoke check
package: wheel + sdist build plus twine check
contract: installed-wheel compatibility against Python 3.11 and 3.12, with both the pinned LangChain 0.2.17 floor and the 0.3 line

Tag pushes matching v* reuse the verified package artifacts and publish them to the GitHub Release instead of rebuilding a second time during release.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

henryle.dev

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.3

May 5, 2026

0.5.2

May 5, 2026

0.5.1

May 5, 2026

0.5.0

May 4, 2026

0.4.1

May 4, 2026

0.4.0

May 3, 2026

0.3.0

May 1, 2026

0.2.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxrag-0.5.3.tar.gz (115.9 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyxrag-0.5.3-py3-none-any.whl (163.0 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file pyxrag-0.5.3.tar.gz.

File metadata

Download URL: pyxrag-0.5.3.tar.gz
Upload date: May 5, 2026
Size: 115.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyxrag-0.5.3.tar.gz
Algorithm	Hash digest
SHA256	`c124a287b8f19c3deb899362bba5276b1d4c2ee471c2c46cfed1d1176c42143d`
MD5	`5e057fb9e8d71f091b309487a8040da7`
BLAKE2b-256	`6aad0d1bb98a1ccc5b68d272539cfba883696e4c3e4d9b34629555b22de2e993`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyxrag-0.5.3.tar.gz:

Publisher: ci.yml on henryle97/xrag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyxrag-0.5.3.tar.gz
- Subject digest: c124a287b8f19c3deb899362bba5276b1d4c2ee471c2c46cfed1d1176c42143d
- Sigstore transparency entry: 1440156820
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: henryle97/xrag@370189227efb6547e2fdb152789c4499334d5f12
- Branch / Tag: refs/tags/v0.5.3
- Owner: https://github.com/henryle97
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@370189227efb6547e2fdb152789c4499334d5f12
- Trigger Event: push

File details

Details for the file pyxrag-0.5.3-py3-none-any.whl.

File metadata

Download URL: pyxrag-0.5.3-py3-none-any.whl
Upload date: May 5, 2026
Size: 163.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyxrag-0.5.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c9acb9415641da45f51438f9f8539ee95be9280bb360baf76923a4776d453b6`
MD5	`3680fc2e8983e91a69d4aafb1fe62d97`
BLAKE2b-256	`7b408f1cec1529392cf6b330433a2739c76b1335e2e356f70cf253134ed36ae9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyxrag-0.5.3-py3-none-any.whl:

Publisher: ci.yml on henryle97/xrag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyxrag-0.5.3-py3-none-any.whl
- Subject digest: 6c9acb9415641da45f51438f9f8539ee95be9280bb360baf76923a4776d453b6
- Sigstore transparency entry: 1440156830
- Sigstore integration time: May 5, 2026
Source repository:
- Permalink: henryle97/xrag@370189227efb6547e2fdb152789c4499334d5f12
- Branch / Tag: refs/tags/v0.5.3
- Owner: https://github.com/henryle97
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@370189227efb6547e2fdb152789c4499334d5f12
- Trigger Event: push

pyxrag 0.5.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

xrag

Install

30-Second Quickstart

Choose Your Path

Using Xrag

Ingest A Document

Index From Offline Chunks

Retrieve Chunks

Ask A Question

Example Gallery

Configuring Xrag

Full Example

Parameter Reference

Storage

Core Model Configuration

Ingest Configuration

Retrieval Configuration

Timeouts

What Most Users Actually Change

Environment-Based Setup

Docs Map

Quickstart — library (v0.1, functional API)

Quickstart — CLI

Setup

Phasing

Optional dependency extras

Dev-only tooling (tools/)

Key defaults

Project layout

Development checks

CI/CD

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Using `Xrag`

Configuring `Xrag`

Dev-only tooling (`tools/`)