Compare and evaluate RAG systems side-by-side with LLM evaluation. Use as a library or CLI tool.

These details have not been verified by PyPI

Project links

Project description

RAGDiff v2.0

A domain-based framework for comparing Retrieval-Augmented Generation (RAG) systems with LLM evaluation support.

What's New in v2.0

RAGDiff v2.0 introduces a domain-based architecture that organizes RAG system comparison around problem domains:

Domains: Separate workspaces for different problem areas (e.g., tafsir, legal, medical)
Systems: RAG system configurations that can be version-controlled
Query Sets: Reusable collections of test queries
Runs: Reproducible executions with config snapshots
Comparisons: LLM-based evaluations with detailed analysis

This replaces the v1.x adapter-based approach with a more structured, reproducible workflow perfect for systematic RAG system development and A/B testing.

Features

Domain-Driven Organization: Separate workspaces for different problem domains
Reproducible Runs: Config and query set snapshots for full reproducibility
Multi-System Support: Compare Vectara, MongoDB, Agentset, and more
LLM Evaluation: Subjective quality assessment via LiteLLM (GPT, Claude, Gemini)
Rich CLI: Beautiful terminal output with progress bars and summary tables
Multiple Output Formats: Table, JSON, and Markdown reports
Comprehensive Testing: 78 tests ensuring reliability
Parallel Execution: Fast query execution with configurable concurrency

Installation

Prerequisites

Python 3.9+
uv - Fast Python package installer and resolver

To install uv:

# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or with Homebrew
brew install uv

# Or with pip
pip install uv

Setup

# Clone the repository
git clone https://github.com/ansari-project/ragdiff.git
cd ragdiff

# Install dependencies with uv
uv sync --all-extras  # Install all dependencies including dev tools

# Install the package in editable mode
uv pip install -e .

# Copy environment template
cp .env.example .env
# Edit .env and add your API keys

Quick Start

1. Create a Domain

# Create domain directory structure
mkdir -p domains/my-domain/{systems,query-sets,runs,comparisons}

# Create domain config
cat > domains/my-domain/domain.yaml <<EOF
name: my-domain
description: My RAG comparison domain
evaluator:
  model: gpt-4
  temperature: 0.0
  prompt_template: |
    Compare these RAG results for relevance and accuracy.
    Query: {query}

    Results:
    {results}

    Provide winner and analysis.
EOF

2. Configure Systems

# Create Vectara system config
cat > domains/my-domain/systems/vectara-default.yaml <<EOF
name: vectara-default
tool: vectara
config:
  api_key: \${VECTARA_API_KEY}
  corpus_id: \${VECTARA_CORPUS_ID}
  timeout: 30
EOF

# Create MongoDB system config
cat > domains/my-domain/systems/mongodb-local.yaml <<EOF
name: mongodb-local
tool: mongodb
config:
  connection_uri: \${MONGODB_URI}
  database: my_db
  collection: documents
  index_name: vector_index
  embedding_model: all-MiniLM-L6-v2
  timeout: 60
EOF

3. Create Query Sets

# Create test queries
cat > domains/my-domain/query-sets/test-queries.txt <<EOF
What is machine learning?
Explain neural networks
How does backpropagation work?
EOF

4. Run Comparisons

# Execute query sets against different systems
uv run ragdiff run my-domain vectara-default test-queries
uv run ragdiff run my-domain mongodb-local test-queries

# Compare the runs (use run IDs from output or check domains/my-domain/runs/)
uv run ragdiff compare my-domain <run-id-1> <run-id-2>

# Export comparison to different formats
uv run ragdiff compare my-domain <run-id-1> <run-id-2> --format markdown --output report.md
uv run ragdiff compare my-domain <run-id-1> <run-id-2> --format json --output comparison.json

CLI Commands

RAGDiff v2.0 provides two main CLI commands:

`run` - Execute Query Sets

Execute a query set against a system:

# Basic usage
uv run ragdiff run <domain> <system> <query-set>

# Examples
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff run tafsir mongodb-local test-queries --concurrency 5

# With options
uv run ragdiff run tafsir vectara-default test-queries \
  --domains-dir ./domains \
  --concurrency 10 \
  --timeout 30 \
  --quiet

What it does:

Loads system config from domains/<domain>/systems/<system>.yaml
Loads queries from domains/<domain>/query-sets/<query-set>.txt
Executes all queries with progress bar
Saves results to domains/<domain>/runs/<run-id>.json
Displays summary table

Options:

--concurrency N: Max concurrent queries (default: 10)
--timeout N: Timeout per query in seconds (default: 30.0)
--domains-dir PATH: Custom domains directory (default: ./domains)
--quiet: Suppress progress output

`compare` - Evaluate Runs

Compare multiple runs using LLM evaluation:

# Basic usage
uv run ragdiff compare <domain> <run-id-1> <run-id-2> [<run-id-3> ...]

# Examples
uv run ragdiff compare tafsir abc123 def456
uv run ragdiff compare tafsir abc123 def456 --format json --output comparison.json

# With options
uv run ragdiff compare tafsir abc123 def456 \
  --model gpt-4 \
  --temperature 0.0 \
  --format markdown \
  --output report.md

What it does:

Loads runs from domains/<domain>/runs/
Uses LLM (via LiteLLM) for evaluation
Saves comparison to domains/<domain>/comparisons/<comparison-id>.json
Outputs in specified format

Output formats:

table: Rich console table (default)
json: JSON output
markdown: Markdown report

Options:

--model MODEL: Override LLM model
--temperature N: Override temperature
--format FORMAT: Output format (table, json, markdown)
--output PATH: Save to file
--domains-dir PATH: Custom domains directory
--quiet: Suppress progress output

Domain Directory Structure

domains/
├── tafsir/                    # Domain: Islamic tafsir
│   ├── domain.yaml            # Domain config (evaluator settings)
│   ├── systems/               # System configurations
│   │   ├── vectara-default.yaml
│   │   ├── mongodb-local.yaml
│   │   └── agentset-prod.yaml
│   ├── query-sets/            # Query collections
│   │   ├── test-queries.txt
│   │   └── production-queries.txt
│   ├── runs/                  # Run results (auto-created)
│   │   ├── <run-id-1>.json
│   │   └── <run-id-2>.json
│   └── comparisons/           # Comparison results (auto-created)
│       └── <comparison-id>.json
└── legal/                     # Domain: Legal documents
    ├── domain.yaml
    ├── systems/
    └── query-sets/

Configuration

Domain Configuration

domains/<domain>/domain.yaml:

name: tafsir
description: Islamic tafsir RAG systems
evaluator:
  model: gpt-4                    # LLM model for evaluation
  temperature: 0.0                # Temperature for evaluation
  prompt_template: |              # Evaluation prompt template
    Compare these RAG results for the query: {query}

    Results:
    {results}

    Determine which system provided better results and explain why.

System Configuration

domains/<domain>/systems/<system>.yaml:

Vectara:

name: vectara-default
tool: vectara
config:
  api_key: ${VECTARA_API_KEY}
  corpus_id: ${VECTARA_CORPUS_ID}
  timeout: 30

MongoDB:

name: mongodb-local
tool: mongodb
config:
  connection_uri: ${MONGODB_URI}
  database: my_db
  collection: documents
  index_name: vector_index
  embedding_model: all-MiniLM-L6-v2  # sentence-transformers model
  timeout: 60

Agentset:

name: agentset-prod
tool: agentset
config:
  api_token: ${AGENTSET_API_TOKEN}
  namespace_id: ${AGENTSET_NAMESPACE_ID}
  rerank: true
  timeout: 60

Query Sets

domains/<domain>/query-sets/<name>.txt:

Simple text files with one query per line:

What is Islamic inheritance law?
Explain the concept of zakat
What are the five pillars of Islam?

Environment Variables

Create a .env file with:

# Vectara
VECTARA_API_KEY=your_key
VECTARA_CORPUS_ID=your_corpus_id

# MongoDB Atlas
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/

# Agentset
AGENTSET_API_TOKEN=your_token
AGENTSET_NAMESPACE_ID=your_namespace_id

# LLM Providers (for evaluation via LiteLLM)
OPENAI_API_KEY=your_key          # For GPT models
ANTHROPIC_API_KEY=your_key       # For Claude models
GEMINI_API_KEY=your_key          # For Gemini models

Supported RAG Systems

RAGDiff v2.0 supports the following RAG systems:

Vectara: Enterprise RAG platform with built-in neural search
MongoDB Atlas: Vector search with MongoDB Atlas and sentence-transformers
Agentset: RAG-as-a-Service platform

Adding New Systems

Create system implementation in src/ragdiff/systems/:

from ..core.models_v2 import RetrievedChunk
from ..core.errors import ConfigError, RunError
from .abc import System

class MySystem(System):
    def __init__(self, config: dict):
        super().__init__(config)
        if "api_key" not in config:
            raise ConfigError("Missing required field: api_key")
        self.api_key = config["api_key"]

    def search(self, query: str, top_k: int = 5) -> list[RetrievedChunk]:
        # Implement search logic
        results = self._call_api(query, top_k)
        return [
            RetrievedChunk(
                content=r["text"],
                score=r["score"],
                metadata={"source": r["source"]}
            )
            for r in results
        ]

# Register the system
from .registry import register_tool
register_tool("mysystem", MySystem)

Import in src/ragdiff/systems/__init__.py:

from . import mysystem  # noqa: F401

Add tests in tests/test_systems.py

Example Workflows

A/B Testing Different System Configurations

# Create two MongoDB variants with different embedding models
cat > domains/ml/systems/mongodb-minilm.yaml <<EOF
name: mongodb-minilm
tool: mongodb
config:
  connection_uri: \${MONGODB_URI}
  database: ml_docs
  collection: articles
  index_name: vector_index
  embedding_model: all-MiniLM-L6-v2
EOF

cat > domains/ml/systems/mongodb-mpnet.yaml <<EOF
name: mongodb-mpnet
tool: mongodb
config:
  connection_uri: \${MONGODB_URI}
  database: ml_docs
  collection: articles
  index_name: vector_index
  embedding_model: all-mpnet-base-v2
EOF

# Run both systems
uv run ragdiff run ml mongodb-minilm test-queries
uv run ragdiff run ml mongodb-mpnet test-queries

# Compare results
uv run ragdiff compare ml <run-id-1> <run-id-2> --format markdown --output ab-test-results.md

Systematic RAG System Development

# 1. Create baseline run
uv run ragdiff run legal vectara-baseline prod-queries

# 2. Make improvements to your RAG system
# (update embeddings, indexing, etc.)

# 3. Create new run with improved system
uv run ragdiff run legal vectara-improved prod-queries

# 4. Compare baseline vs improved
uv run ragdiff compare legal <baseline-id> <improved-id> --format markdown --output improvements.md

# 5. If improved system is better, make it the new baseline
cp domains/legal/systems/vectara-improved.yaml domains/legal/systems/vectara-baseline.yaml

Multi-System Comparison

# Run same query set across all systems
uv run ragdiff run tafsir vectara-default test-queries
uv run ragdiff run tafsir mongodb-local test-queries
uv run ragdiff run tafsir agentset-prod test-queries

# Compare all three
uv run ragdiff compare tafsir <vectara-id> <mongodb-id> <agentset-id> \
  --format markdown \
  --output three-way-comparison.md

Development

Running Tests

# Run all tests
uv run pytest tests/

# Run v2.0 tests only
uv run pytest tests/test_core_v2.py tests/test_systems.py tests/test_execution.py tests/test_cli_v2.py

# Run with coverage
uv run pytest tests/ --cov=src

# Run with verbose output
uv run pytest tests/ -v

Code Quality

The project uses pre-commit hooks:

ruff for linting and formatting
pytest for testing
Whitespace and YAML validation

# Install pre-commit hooks
pre-commit install

# Run manually
pre-commit run --all-files

Project Structure

ragdiff/
├── src/ragdiff/              # Main package
│   ├── cli.py                # Main CLI entry point
│   ├── cli_v2.py             # v2.0 CLI implementation
│   ├── core/                 # Core v2.0 models
│   │   ├── models_v2.py      # Domain-based models
│   │   ├── loaders.py        # File loading utilities
│   │   ├── storage.py        # Persistence utilities
│   │   └── errors.py         # Custom exceptions
│   ├── systems/              # System implementations
│   │   ├── abc.py            # System abstract base class
│   │   ├── registry.py       # System registration
│   │   ├── vectara.py        # Vectara system
│   │   ├── mongodb.py        # MongoDB system
│   │   └── agentset.py       # Agentset system
│   ├── execution/            # Run execution engine
│   └── comparison/           # Comparison engine
├── tests/                    # Test suite (78 v2.0 tests)
├── domains/                  # Domain workspaces
└── pyproject.toml            # Package configuration

Architecture

RAGDiff v2.0 follows the SPIDER protocol for systematic development:

Specification: Clear goals documented in codev/specs/
Planning: Phased implementation (6 phases)
Implementation: Clean domain-based architecture
Defense: Comprehensive test coverage (78 v2.0 tests)
Evaluation: Code reviews and validation
Reflection: Architecture documentation

Key Design Principles

Domain-Driven: Organize work around problem domains
Reproducibility: Snapshot configs and queries in runs
Fail Fast: Clear error messages, no silent failures
Type Safety: Pydantic models with validation
Testability: Every feature has tests
Separation of Concerns: Clean module boundaries

License

MIT License - see LICENSE file for details

Contributing

Contributions welcome! Please:

Follow existing code style (ruff formatting)
Add tests for new features
Update documentation
Ensure all tests pass

Acknowledgments

Built following the SPIDER protocol for systematic development.

Supported RAG platforms: Vectara, MongoDB Atlas, Agentset

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.0

Oct 30, 2025

2.1.0

Oct 30, 2025

2.0.0

Oct 29, 2025

1.3.0

Oct 28, 2025

This version

1.2.1

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragdiff-1.2.1.tar.gz (14.2 MB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragdiff-1.2.1-py3-none-any.whl (133.6 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file ragdiff-1.2.1.tar.gz.

File metadata

Download URL: ragdiff-1.2.1.tar.gz
Upload date: Oct 28, 2025
Size: 14.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.5

File hashes

Hashes for ragdiff-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`ffe074911800587f853986d57944d30f6ea38e6b69dae8929320bc179c1d2d07`
MD5	`b6767945aed5629470181d477c35941f`
BLAKE2b-256	`b36db1bee5920c148605be7cc55be41188b31ddfbefb83d7e84c5d7db0437511`

See more details on using hashes here.

File details

Details for the file ragdiff-1.2.1-py3-none-any.whl.

File metadata

Download URL: ragdiff-1.2.1-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 133.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.5

File hashes

Hashes for ragdiff-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c6bc4104d5d85f144b73f853a1bb4afec2c5fffe1d26c1d5627b1451c25f725e`
MD5	`8913c3356644851f18f9f985494ef53c`
BLAKE2b-256	`b32823d4d2e7e4d1805a3946f9adc5eeb2977dd941d7a964949f5a86690c3bac`

See more details on using hashes here.

ragdiff 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAGDiff v2.0

What's New in v2.0

Features

Installation

Prerequisites

Setup

Quick Start

1. Create a Domain

2. Configure Systems

3. Create Query Sets

4. Run Comparisons

CLI Commands

run - Execute Query Sets

compare - Evaluate Runs

Domain Directory Structure

Configuration

Domain Configuration

System Configuration

Query Sets

Environment Variables

Supported RAG Systems

Adding New Systems

Example Workflows

A/B Testing Different System Configurations

Systematic RAG System Development

Multi-System Comparison

Development

Running Tests

Code Quality

Project Structure

Architecture

Key Design Principles

License

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`run` - Execute Query Sets

`compare` - Evaluate Runs