Skip to main content

RAG ingest + retrieval engine — section_table chunker, hybrid retrieval, optional rerank/enrichment

Project description

xrag

RAG ingest + retrieval engine for documents with text and tables. Section/table-aware chunking, hybrid retrieval, optional Cohere reranking, optional LLM-based enrichment, and end-to-end Q&A generation. Works as both a CLI and an installable Python library.

xrag is designed for teams that want one Python library for:

  • ingesting real documents into a retrieval system
  • indexing pre-chunked offline data
  • retrieving typed chunks for downstream LLM apps
  • running a complete retrieve-and-answer flow with one client

Install

pip install pyxrag
# or with optional extras:
pip install "pyxrag[rerank-cohere,cli]"

The distribution name on PyPI is pyxrag; the import name is xrag:

import xrag

For local development from source:

git clone https://github.com/henryle97/xrag.git
cd xrag
uv sync --group dev

30-Second Quickstart

If you want the main library path, start with Xrag:

from xrag import Xrag

client = Xrag(
    qdrant_path="data/dev/xrag_quickstart",
    generation="openai:gpt-4.1-nano",
)

doc = await client.documents.ingest("annual-report.docx", collection="ir_docs")
result = await client.rag.ask(
    query="What is this document about?",
    collection="ir_docs",
)

print(doc.id)
print(result.answer)

See the runnable example: examples/01_quickstart.py For the full Python guide, see docs/library.md.

Choose Your Path

xrag has three main entrypoints:

  • Xrag: the higher-level async client for application code
  • run_ingest / run_rag: the lower-level functional API for callers that want direct control over AppConfig and pipeline wiring
  • CLI: local workflows, debugging, artifact inspection, and evaluation runs

Use Xrag when you want the application-facing library surface. Use the functional API when you already work in terms of AppConfig. Use the CLI when you are operating the pipeline directly from the shell.

Using Xrag

Create one client and reuse it across operations:

from xrag import Xrag

client = Xrag(
    qdrant_url="http://localhost:6333",
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
)

Ingest A Document

Parse, chunk, embed, and index a source document:

doc = await client.documents.ingest(
    "annual-report.docx",
    collection="ir_docs",
)

print(doc.id)

Runnable example: examples/01_quickstart.py

Index From Offline Chunks

If you already have chunks prepared offline, skip parsing and index them directly:

doc = await client.documents.index_chunks(
    "data/chunks.json",
    collection="ir_docs",
    name="f1-offline-chunks",
)

print(doc.id)

Runnable examples:

Retrieve Chunks

If you only want retrieval and already have your own LLM stack:

result = await client.retrievals.search(
    query="What was revenue?",
    collection="ir_docs",
    top_k=5,
)

for chunk in result.chunks:
    print(chunk.document_id, chunk.score, chunk.text[:120])

Runnable example: examples/03_retrieve_only.py

Ask A Question

Retrieve context and generate an answer:

result = await client.rag.ask(
    query="What was revenue?",
    collection="ir_docs",
)

print(result.answer)

Runnable example: examples/01_quickstart.py

Example Gallery

Configuring Xrag

Xrag(...) is the main application-facing configuration surface. In practice, most users set storage, embedding, and optionally generation, then reuse the same client for ingest and query operations.

Full Example

This shows the full constructor surface:

from xrag import Xrag
from xrag.config.models import RetrievalConfig

client = Xrag(
    qdrant_url="http://localhost:6333",
    qdrant_path=None,
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
    enrichment=("auto_keywords", "auto_questions"),
    reranker=None,
    parser="unstructured",
    chunker="section_table",
    retrieval_defaults=RetrievalConfig(
        provider="hybrid",
        options={
            "top_k": 10,
            "bm25_candidates": 50,
            "vector_candidates": 50,
            "rrf_k": 60,
            "dedup_family": True,
        },
    ),
    upload_dir="/tmp/xrag_uploads",
    artifacts_dir=None,
    timeout_s=30.0,
    ingest_timeout_s=600.0,
    tracing=None,
)

Parameter Reference

Storage

  • qdrant_url: str | None = None URL of a running Qdrant server. Use this for normal server-backed deployments.
  • qdrant_path: str | Path | None = None Local filesystem path for Qdrant local mode. Use this for local development or single-machine experiments.

You must set exactly one of qdrant_url or qdrant_path.

Core Model Configuration

  • embedding: str = "openai:text-embedding-3-small" Embedding provider string. Required for indexing and retrieval. Example values: openai:text-embedding-3-small
  • generation: str | None = None Generation provider string for client.rag.ask(...). Omit this for retrieve-only usage. Example values: openai:gpt-4.1-nano
  • reranker: str | None = None Optional reranker configuration. The client only uses it on calls where reranking is enabled. Example values: cohere:rerank-v3.5
  • tracing: str | None = None Optional tracing backend. Example values: langsmith:my-project

Ingest Configuration

  • parser: str = "unstructured" Parser configuration used by client.documents.ingest(...). Default is the hosted Unstructured API parser. Use unstructured_local:fast for lightweight local parsing. Add xrag[parser-unstructured-local-pdf] only when you want local PDF/OCR parsing.
  • chunker: str = "section_table" Chunking strategy used during ingest. Default is the section/table-aware chunker.
  • enrichment: tuple[str, ...] | None = ("auto_keywords", "auto_questions") Ingest-time enrichment stages. Set None or () to disable enrichment. Available stage names currently include: auto_keywords, auto_questions, table_context, table_summary
  • upload_dir: str | Path = "/tmp/xrag_uploads" Local spool directory for byte inputs and file-like inputs passed to documents.ingest(...).
  • artifacts_dir: str | Path | None = None Base directory for ingest artifacts. When unset, xrag creates an ephemeral artifacts directory per ingest call.

Retrieval Configuration

  • retrieval_defaults: RetrievalConfig | None = None Default retrieval settings used by client.retrievals.search(...) and client.rag.ask(...). If unset, xrag uses:
RetrievalConfig(
    provider="hybrid",
    options={
        "top_k": 10,
        "bm25_candidates": 50,
        "vector_candidates": 50,
        "rrf_k": 60,
        "dedup_family": True,
    },
)

By default, reranking is still off at call time unless you explicitly pass rerank=True.

Timeouts

  • timeout_s: float = 30.0 General request timeout for non-ingest operations.
  • ingest_timeout_s: float = 600.0 Longer timeout budget for ingest operations.

What Most Users Actually Change

For most applications, the parameters that matter most are:

  • qdrant_url or qdrant_path
  • embedding
  • generation if you use rag.ask(...)
  • reranker if you want reranked retrieval
  • enrichment if you want to disable or change ingest-time enrichment
  • artifacts_dir if you want stable ingest artifacts on disk

Environment-Based Setup

If you prefer environment variables:

from xrag import Xrag

client = Xrag.from_env()

Useful environment variables:

  • XRAG_QDRANT_URL
  • XRAG_QDRANT_PATH
  • XRAG_DEFAULT_EMBEDDING
  • XRAG_DEFAULT_GENERATION
  • XRAG_DEFAULT_RERANKER
  • XRAG_UPLOAD_DIR
  • XRAG_ARTIFACTS_DIR

Docs Map

Detailed docs live under docs/.

Recommended path:

  1. Read docs/README.md for library-vs-CLI doc navigation.
  2. Read docs/library.md for the dedicated Python library guide.
  3. Use docs/command.md for exact CLI commands.
  4. Use docs/chunker.md for chunking strategies and output schema.
  5. Use docs/enrichment.md for indexing-time text enrichment.
  6. Use docs/rag-baseline.md for the simple baseline RAG flow.
  7. Use docs/sub-plans/xrag-public-api.md for the high-level Python client design.
  8. Use docs/plans.md and docs/sub-plans/ for roadmap and implementation details.

Quickstart — library (v0.1, functional API)

from pathlib import Path

from xrag import (
    AppConfig, IngestRequest,
    load_config_yaml, run_ingest, run_rag,
)

cfg: AppConfig = load_config_yaml("configs/rag-ir-mvp3-all-docs.yml")

# Ingest
ingest_result = run_ingest(IngestRequest(config=cfg, output_dir=Path("./data/dev/ir_document")))

# Query
result = run_rag("What was Cash as of December 31, 2024?", cfg)
print(result.answer)

The high-level Xrag async client (client.documents.ingest(...), client.retrievals.search(...), client.rag.ask(...)) ships in v0.2 — see docs/sub-plans/xrag-public-api.md.

Quickstart — CLI

End-to-end pipeline:

source document
  -> convert
  -> parser
  -> parser preprocess
  -> chunk prepare
  -> indexing enrichment (optional)
  -> rag baseline OR indexing + query
  -> eval create-dataset / eval run

Common commands:

uv run python -m xrag.cli convert --help
uv run python -m xrag.cli parser --help
uv run python -m xrag.cli parser preprocess --help
uv run python -m xrag.cli chunk prepare --help
uv run python -m xrag.cli rag baseline --help
uv run python -m xrag.cli --help

Smoke flow:

# Parse
uv run python -m xrag.cli parser \
  --config configs/baseline.yml \
  --output-dir data/dev/ir_document \
  --pretty

# Preprocess
uv run python -m xrag.cli parser preprocess \
  --input data/dev/ir_document/parsed/unstructured_elements.json \
  --output-dir data/dev/ir_document/normalized

# Chunk
uv run python -m xrag.cli chunk prepare \
  --input data/dev/ir_document/normalized/elements.json \
  --config configs/chunker.yml \
  --output-dir data/dev/ir_document/chunked

# Baseline RAG
uv run python -m xrag.cli rag baseline \
  --input data/dev/ir_document/chunked/chunks.json \
  --question "What was Cash as of December 31, 2024?" \
  --config configs/rag-simple-baseline.yml \
  --pretty

# Preview enrichment without embedding/vector-store writes
uv run python scripts/enrichment/preview.py \
  --input data/dev/ir_document/chunked/chunks.json \
  --config configs/enrichment/table-context.yml

Setup

Install dependencies:

uv sync

Environment for the Unstructured parser:

UNSTRUCTURED_API_KEY=...
UNSTRUCTURED_API_URL=...   # optional

For DOCX-to-PDF conversion install LibreOffice so soffice / libreoffice is on PATH.

Phasing

Version Surface
v0.1 (current) Functional API: run_ingest, run_rag, Pydantic models, configs, loaders.
v0.2 High-level Xrag async client with resource-namespaced ops (documents.*, retrievals.*, rag.*). Tenant binding via for_tenant. Typed error hierarchy.
v0.3 XragSync mirror.
v0.4 Per-call config_overrides, batch ingest, persistent registry contract.
v0.5+ (backlog) Streaming surfaces (documents.ingest_stream, rag.ask_stream).

Full design: docs/sub-plans/xrag-public-api.md.

Optional dependency extras

  • xrag[parser-unstructured-local-pdf] — local PDF/image/OCR parsing with Unstructured
  • xrag[rerank-cohere] — Cohere v3.5 reranker
  • xrag[cli] — Typer-based CLI (rag-cli entry point)
  • xrag[chroma] — Chroma backend (alternative to Qdrant)
  • xrag[all] — everything

The base xrag install includes the hosted Unstructured API path, parser preprocess support, and lightweight local parsing such as DOCX/text/HTML. Add xrag[parser-unstructured-local-pdf] only if you want local PDF/image/OCR parsing on the same machine.

Dev-only tooling (tools/)

Eval scoring (RAGAS), dataset generation, and HTML viewers live under tools/ at the repo root. They are not packaged into the wheel — wheel consumers don't need them. Running them requires a source clone plus the dev dependency group:

uv sync --group dev --extra cli --extra chroma --extra rerank-cohere

Add the heavy local PDF parser stack only when you need it:

uv sync --group dev --extra cli --extra chroma --extra rerank-cohere --extra parser-unstructured-local-pdf

make eval, make unit-test, python -m xrag.cli eval ... and similar dev commands rely on tools/ being on sys.path (pytest is configured for this).

Key defaults

Project layout

  • xrag/cli.py: Typer CLI entrypoint (gated behind [cli] extra).
  • xrag/config/: YAML config and settings.
  • xrag/core/: parser, chunker, enrichment, embedding, retrieval, reranker, generation, and tracing subsystems.
  • xrag/pipelines/: library pipelines — parser, parser_preprocess, chunk_prepare, indexing, ingest, rag (single-query, batch, and eval-shaped retrieve+generate). Eval scoring/dataset/HTML-viewer pipelines live under tools/.
  • tools/: dev-only — eval scoring, dataset generation, HTML viewers, eval datasets. Excluded from the wheel.
  • configs/: runtime YAML configs.
  • docs/: user docs, plans, surveys, and terminology.
  • data/: local inputs and generated artifacts.

Development checks

make lint
make check
make unit-test

After changing behavior, run a live CLI command that exercises the changed path.

CI/CD

GitHub Actions now verifies the same core paths contributors should run locally:

  • lint: Ruff lint and format checks on a locked dev environment
  • unit-test: full-extras unit tests plus a real python -m xrag.cli --help smoke check
  • package: wheel + sdist build plus twine check
  • contract: installed-wheel compatibility against Python 3.11 and 3.12, with both the pinned LangChain 0.2.17 floor and the 0.3 line

Tag pushes matching v* reuse the verified package artifacts and publish them to the GitHub Release instead of rebuilding a second time during release.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyxrag-0.4.0.tar.gz (113.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyxrag-0.4.0-py3-none-any.whl (159.3 kB view details)

Uploaded Python 3

File details

Details for the file pyxrag-0.4.0.tar.gz.

File metadata

  • Download URL: pyxrag-0.4.0.tar.gz
  • Upload date:
  • Size: 113.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyxrag-0.4.0.tar.gz
Algorithm Hash digest
SHA256 eac7952963c7b2dcbb5c5fe716079ec1c02c84208d85c4d2ec53b23be06023c4
MD5 840b748f3bf3c144d2f21a31ba61f6d7
BLAKE2b-256 e044bd5cb1e699976b448131f98414bf1f93b344a38482538c0e7250cc6ab08b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyxrag-0.4.0.tar.gz:

Publisher: ci.yml on henryle97/xrag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyxrag-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: pyxrag-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 159.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pyxrag-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec470a3ba77f7fbd25016c174ee8d0626b506571fbc229ef3455d45bf4070182
MD5 dd0c0fb27c578e0fc89d2e6f317e440c
BLAKE2b-256 a49f278d7b47ff250386a9341e94bf74bfe437728baee8cd40fe9415bd89128c

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyxrag-0.4.0-py3-none-any.whl:

Publisher: ci.yml on henryle97/xrag

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page