Skip to main content

RAG ingest + retrieval engine — section_table chunker, hybrid retrieval, optional rerank/enrichment

Project description

xrag

RAG ingest + retrieval engine for documents with text and tables. Section/table-aware chunking, hybrid retrieval, optional Cohere reranking, optional LLM-based enrichment, and end-to-end Q&A generation. Works as both a CLI and an installable Python library.

xrag is designed for teams that want one Python library for:

  • ingesting real documents into a retrieval system
  • indexing pre-chunked offline data
  • retrieving typed chunks for downstream LLM apps
  • running a complete retrieve-and-answer flow with one client

Install

pip install pyxrag
# or with optional extras:
pip install "pyxrag[rerank-cohere,cli]"

The distribution name on PyPI is pyxrag; the import name is xrag:

import xrag

For local development from source:

git clone https://github.com/henryle97/xrag.git
cd xrag
uv sync --group dev

30-Second Quickstart

If you want the main library path, start with Xrag:

from xrag import Xrag

client = Xrag(
    qdrant_path="data/dev/xrag_quickstart",
    generation="openai:gpt-4.1-nano",
)

doc = await client.documents.ingest("annual-report.docx", collection="ir_docs")
result = await client.rag.ask(
    query="What is this document about?",
    collection="ir_docs",
)

print(doc.id)
print(result.answer)

See the runnable example: examples/01_quickstart.py For the full Python guide, see docs/library.md.

Choose Your Path

xrag has three main entrypoints:

  • Xrag: the higher-level async client for application code
  • run_ingest / run_rag: the lower-level functional API for callers that want direct control over AppConfig and pipeline wiring
  • CLI: local workflows, debugging, artifact inspection, and evaluation runs

Use Xrag when you want the application-facing library surface. Use the functional API when you already work in terms of AppConfig. Use the CLI when you are operating the pipeline directly from the shell.

Using Xrag

Create one client and reuse it across operations:

from xrag import Xrag

client = Xrag(
    qdrant_url="http://localhost:6333",
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
)

Ingest A Document

Parse, chunk, embed, and index a source document:

doc = await client.documents.ingest(
    "annual-report.docx",
    collection="ir_docs",
)

print(doc.id)

Runnable example: examples/01_quickstart.py

Index From Offline Chunks

If you already have chunks prepared offline, skip parsing and index them directly:

doc = await client.documents.index_chunks(
    "data/chunks.json",
    collection="ir_docs",
    name="f1-offline-chunks",
)

print(doc.id)

Runnable examples:

Retrieve Chunks

If you only want retrieval and already have your own LLM stack:

result = await client.retrievals.search(
    query="What was revenue?",
    collection="ir_docs",
    top_k=5,
)

for chunk in result.chunks:
    print(chunk.document_id, chunk.score, chunk.text[:120])

Runnable example: examples/03_retrieve_only.py

Ask A Question

Retrieve context and generate an answer:

result = await client.rag.ask(
    query="What was revenue?",
    collection="ir_docs",
)

print(result.answer)

Runnable example: examples/01_quickstart.py

Example Gallery

Configuring Xrag

Xrag(...) is the main application-facing configuration surface. In practice, most users set storage, embedding, and optionally generation, then reuse the same client for ingest and query operations.

Full Example

This shows the full constructor surface:

from xrag import Xrag
from xrag.config.models import RetrievalConfig

client = Xrag(
    qdrant_url="http://localhost:6333",
    qdrant_path=None,
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
    enrichment=("auto_keywords", "auto_questions"),
    reranker=None,
    parser="unstructured",
    chunker="section_table",
    retrieval_defaults=RetrievalConfig(
        provider="hybrid",
        options={
            "top_k": 10,
            "bm25_candidates": 50,
            "vector_candidates": 50,
            "rrf_k": 60,
            "dedup_family": True,
        },
    ),
    upload_dir="/tmp/xrag_uploads",
    artifacts_dir=None,
    timeout_s=30.0,
    ingest_timeout_s=600.0,
    tracing=None,
)

Parameter Reference

Storage

  • qdrant_url: str | None = None URL of a running Qdrant server. Use this for normal server-backed deployments.
  • qdrant_path: str | Path | None = None Local filesystem path for Qdrant local mode. Use this for local development or single-machine experiments.

You must set exactly one of qdrant_url or qdrant_path.

Core Model Configuration

  • embedding: str = "openai:text-embedding-3-small" Embedding provider string. Required for indexing and retrieval. Example values: openai:text-embedding-3-small
  • generation: str | None = None Generation provider string for client.rag.ask(...). Omit this for retrieve-only usage. Example values: openai:gpt-4.1-nano
  • reranker: str | None = None Optional reranker configuration. The client only uses it on calls where reranking is enabled. Example values: cohere:rerank-v3.5
  • tracing: str | None = None Optional tracing backend. Example values: langsmith:my-project

Ingest Configuration

  • parser: str = "unstructured" Parser configuration used by client.documents.ingest(...). Default is the hosted Unstructured API parser. Use unstructured_local:fast for lightweight local parsing. Add xrag[parser-unstructured-local-pdf] only when you want local PDF/OCR parsing.
  • chunker: str = "section_table" Chunking strategy used during ingest. Default is the section/table-aware chunker.
  • enrichment: tuple[str, ...] | None = ("auto_keywords", "auto_questions") Ingest-time enrichment stages. Set None or () to disable enrichment. Available stage names currently include: auto_keywords, auto_questions, table_context, table_summary
  • upload_dir: str | Path = "/tmp/xrag_uploads" Local spool directory for byte inputs and file-like inputs passed to documents.ingest(...).
  • artifacts_dir: str | Path | None = None Base directory for ingest artifacts. When unset, xrag creates an ephemeral artifacts directory per ingest call.

Retrieval Configuration

  • retrieval_defaults: RetrievalConfig | None = None Default retrieval settings used by client.retrievals.search(...) and client.rag.ask(...). If unset, xrag uses:
RetrievalConfig(
    provider="hybrid",
    options={
        "top_k": 10,
        "bm25_candidates": 50,
        "vector_candidates": 50,
        "rrf_k": 60,
        "dedup_family": True,
    },
)

By default, reranking is still off at call time unless you explicitly pass rerank=True.

Timeouts

  • timeout_s: float = 30.0 General request timeout for non-ingest operations.
  • ingest_timeout_s: float = 600.0 Longer timeout budget for ingest operations.

What Most Users Actually Change

For most applications, the parameters that matter most are:

  • qdrant_url or qdrant_path
  • embedding
  • generation if you use rag.ask(...)
  • reranker if you want reranked retrieval
  • enrichment if you want to disable or change ingest-time enrichment
  • artifacts_dir if you want stable ingest artifacts on disk

Environment-Based Setup

If you prefer environment variables:

from xrag import Xrag

client = Xrag.from_env()

Useful environment variables:

  • XRAG_QDRANT_URL
  • XRAG_QDRANT_PATH
  • XRAG_DEFAULT_EMBEDDING
  • XRAG_DEFAULT_GENERATION
  • XRAG_DEFAULT_RERANKER
  • XRAG_UPLOAD_DIR
  • XRAG_ARTIFACTS_DIR

Docs Map

Detailed docs live under docs/.

Recommended path:

  1. Read docs/README.md for library-vs-CLI doc navigation.
  2. Read docs/library.md for the dedicated Python library guide.
  3. Use docs/command.md for exact CLI commands.
  4. Use docs/chunker.md for chunking strategies and output schema.
  5. Use docs/enrichment.md for indexing-time text enrichment.
  6. Use docs/rag-baseline.md for the simple baseline RAG flow.
  7. Use docs/sub-plans/xrag-public-api.md for the high-level Python client design.
  8. Use docs/plans.md and docs/sub-plans/ for roadmap and implementation details.

Quickstart — library (v0.1, functional API)

from pathlib import Path

from xrag import (
    AppConfig, IngestRequest,
    load_config_yaml, run_ingest, run_rag,
)

cfg: AppConfig = load_config_yaml("configs/rag-ir-mvp3-all-docs.yml")

# Ingest
ingest_result = run_ingest(IngestRequest(config=cfg, output_dir=Path("./data/dev/ir_document")))

# Query
result = run_rag("What was Cash as of December 31, 2024?", cfg)
print(result.answer)

The high-level Xrag async client (client.documents.ingest(...), client.retrievals.search(...), client.rag.ask(...)) ships in v0.2 — see docs/sub-plans/xrag-public-api.md.

Quickstart — CLI

End-to-end pipeline:

source document
  -> convert
  -> parser
  -> parser preprocess
  -> chunk prepare
  -> indexing enrichment (optional)
  -> rag baseline OR indexing + query
  -> eval create-dataset / eval run

Common commands:

uv run python -m xrag.cli convert --help
uv run python -m xrag.cli parser --help
uv run python -m xrag.cli parser preprocess --help
uv run python -m xrag.cli chunk prepare --help
uv run python -m xrag.cli rag baseline --help
uv run python -m xrag.cli --help

Smoke flow:

# Parse
uv run python -m xrag.cli parser \
  --config configs/baseline.yml \
  --output-dir data/dev/ir_document \
  --pretty

# Preprocess
uv run python -m xrag.cli parser preprocess \
  --input data/dev/ir_document/parsed/unstructured_elements.json \
  --output-dir data/dev/ir_document/normalized

# Chunk
uv run python -m xrag.cli chunk prepare \
  --input data/dev/ir_document/normalized/elements.json \
  --config configs/chunker.yml \
  --output-dir data/dev/ir_document/chunked

# Baseline RAG
uv run python -m xrag.cli rag baseline \
  --input data/dev/ir_document/chunked/chunks.json \
  --question "What was Cash as of December 31, 2024?" \
  --config configs/rag-simple-baseline.yml \
  --pretty

# Preview enrichment without embedding/vector-store writes
uv run python scripts/enrichment/preview.py \
  --input data/dev/ir_document/chunked/chunks.json \
  --config configs/enrichment/table-context.yml

Setup

Install dependencies:

uv sync

Environment for the Unstructured parser:

UNSTRUCTURED_API_KEY=...
UNSTRUCTURED_API_URL=...   # optional

For DOCX-to-PDF conversion install LibreOffice so soffice / libreoffice is on PATH.

Phasing

Version Surface
v0.1 (current) Functional API: run_ingest, run_rag, Pydantic models, configs, loaders.
v0.2 High-level Xrag async client with resource-namespaced ops (documents.*, retrievals.*, rag.*). Tenant binding via for_tenant. Typed error hierarchy.
v0.3 XragSync mirror.
v0.4 Per-call config_overrides, batch ingest, persistent registry contract.
v0.5+ (backlog) Streaming surfaces (documents.ingest_stream, rag.ask_stream).

Full design: docs/sub-plans/xrag-public-api.md.

Optional dependency extras

  • xrag[parser-unstructured-local-pdf] — local PDF/image/OCR parsing with Unstructured
  • xrag[rerank-cohere] — Cohere v3.5 reranker
  • xrag[cli] — Typer-based CLI (rag-cli entry point)
  • xrag[chroma] — Chroma backend (alternative to Qdrant)
  • xrag[all] — everything

The base xrag install includes the hosted Unstructured API path, parser preprocess support, and lightweight local parsing such as DOCX/text/HTML. Add xrag[parser-unstructured-local-pdf] only if you want local PDF/image/OCR parsing on the same machine.

Dev-only tooling (tools/)

Eval scoring (RAGAS), dataset generation, and HTML viewers live under tools/ at the repo root. They are not packaged into the wheel — wheel consumers don't need them. Running them requires a source clone plus the dev dependency group:

uv sync --group dev --extra cli --extra chroma --extra rerank-cohere

Add the heavy local PDF parser stack only when you need it:

uv sync --group dev --extra cli --extra chroma --extra rerank-cohere --extra parser-unstructured-local-pdf

make eval, make unit-test, python -m xrag.cli eval ... and similar dev commands rely on tools/ being on sys.path (pytest is configured for this).

Key defaults

Project layout

  • xrag/cli.py: Typer CLI entrypoint (gated behind [cli] extra).
  • xrag/config/: YAML config and settings.
  • xrag/core/: parser, chunker, enrichment, embedding, retrieval, reranker, generation, and tracing subsystems.
  • xrag/pipelines/: library pipelines — parser, parser_preprocess, chunk_prepare, indexing, ingest, rag (single-query, batch, and eval-shaped retrieve+generate). Eval scoring/dataset/HTML-viewer pipelines live under tools/.
  • tools/: dev-only — eval scoring, dataset generation, HTML viewers, eval datasets. Excluded from the wheel.
  • configs/: runtime YAML configs.
  • docs/: user docs, plans, surveys, and terminology.
  • data/: local inputs and generated artifacts.

Development checks

make lint
make check
make unit-test

After changing behavior, run a live CLI command that exercises the changed path.

CI/CD

GitHub Actions now verifies the same core paths contributors should run locally:

  • lint: Ruff lint and format checks on a locked dev environment
  • unit-test: full-extras unit tests plus a real python -m xrag.cli --help smoke check
  • package: wheel + sdist build plus twine check
  • contract: installed-wheel compatibility against Python 3.11 and 3.12, with both the pinned LangChain 0.2.17 floor and the 0.3 line

Tag pushes matching v* reuse the verified package artifacts and publish them to the GitHub Release instead of rebuilding a second time during release.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxrag-0.5.1.tar.gz (115.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyxrag-0.5.1-py3-none-any.whl (162.0 kB view details)

Uploaded Python 3

File details

Details for the file pyxrag-0.5.1.tar.gz.

File metadata

  • Download URL: pyxrag-0.5.1.tar.gz
  • Upload date:
  • Size: 115.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyxrag-0.5.1.tar.gz
Algorithm Hash digest
SHA256 0f63cf9a5abe3b1c33cc9365212528f6ca5263dbe819bc7f6682a6a202e4ba18
MD5 10edf7bd9d1014b931039e1b6f047afe
BLAKE2b-256 f8935a161ab203861c82b9d7fb6989f2888b7f50bee0c8eb478a3eb9f1152b9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyxrag-0.5.1.tar.gz:

Publisher: ci.yml on henryle97/xrag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyxrag-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: pyxrag-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 162.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyxrag-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1f0d0378b37e7a458754277a07807d4ac128d7acf8d2806a21eec999ed9f7f39
MD5 e39e3959c50f5d4b8e94a3750c6577c4
BLAKE2b-256 c53110487f06a1f749ee965400ab4e78cc5f4fbc5273e0494664bbf0d8794dce

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyxrag-0.5.1-py3-none-any.whl:

Publisher: ci.yml on henryle97/xrag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page