Skip to main content

A comprehensive Python library for testing and evaluating Retrieval-Augmented Generation (RAG) systems with LLM-generated questions and automated evaluation metrics

Project description

RAGtester

Python 3.9+ License: Apache 2.0 PyPI version

A comprehensive Python library for testing and evaluating Retrieval-Augmented Generation (RAG) systems with LLM-generated questions and automated evaluation metrics.

๐Ÿ“‹ Table of Contents

๐ŸŽฏ Overview

RAGtester is a powerful evaluation framework designed to assess the quality, reliability, and safety of RAG systems through automated testing. It generates context-aware questions from your documents and evaluates responses across multiple dimensions using state-of-the-art LLM judges.

Why RAGtester?

  • ๐Ÿ” Comprehensive Evaluation: 5-dimensional assessment covering faithfulness, quality, toxicity, robustness, and security
  • ๐Ÿค– LLM-Powered: Uses advanced language models for intelligent question generation and evaluation
  • ๐Ÿ”„ Multi-Provider Support: Works with OpenAI, Anthropic, AWS Bedrock, and local models
  • ๐Ÿ“Š Rich Reporting: Detailed CSV, JSON, and Markdown reports with actionable insights
  • โšก Easy Integration: Simple API that works with any RAG system

๐Ÿš€ Key Features

๐Ÿ“Š 5-Dimensional Evaluation System

Dimension Description What It Tests
Faithfulness How well responses match provided context Factual accuracy, hallucination detection
Answer Quality Fluency, clarity, and conciseness Response coherence, completeness
Toxicity Detection of harmful content Safety, appropriateness, bias
Robustness & Reliability System behavior under stress Error handling, edge cases
Security & Safety Resistance to malicious inputs Prompt injection, data protection

๐ŸŽฏ Smart Question Generation

  • Context-Aware: Questions tailored to specific document content
  • Random Page Selection: Each question uses different document pages
  • Metric-Specific: Questions designed for each evaluation dimension
  • Behavior Testing: General questions to test system behavior

๐Ÿค– Multiple LLM Support

Provider
OpenAI
Anthropic
AWS Bedrock
Grok (xAI)
Google Gemini
Mistral AI
Cohere
Hugging Face
Fireworks AI
Together AI
Perplexity
Local

๐Ÿ“ Document Support

  • PDF Files: Automatic text extraction and page selection
  • Text Files: Direct processing with encoding detection
  • Markdown Files: Full support with formatting preservation
  • Extensible: Easy to add new document types

๐Ÿ“ฆ Installation

Basic Installation

pip install ragtester

With Optional Dependencies

# For specific LLM providers
pip install ragtester[openai]        # OpenAI API support
pip install ragtester[anthropic]     # Anthropic API support
pip install ragtester[bedrock]       # AWS Bedrock support
pip install ragtester[grok]          # Grok (xAI) API support
pip install ragtester[gemini]        # Google Gemini API support
pip install ragtester[mistral]       # Mistral AI API support
pip install ragtester[cohere]        # Cohere API support
pip install ragtester[huggingface]   # Hugging Face Inference API support
pip install ragtester[fireworks]     # Fireworks AI API support
pip install ragtester[together]      # Together AI API support
pip install ragtester[perplexity]    # Perplexity AI API support
pip install ragtester[deepseek]      # DeepSeek API support
pip install ragtester[reka]          # Reka AI API support
pip install ragtester[qwen]          # Qwen (Alibaba) API support
pip install ragtester[moonshot]      # Moonshot AI API support
pip install ragtester[zhipu]         # Zhipu AI API support
pip install ragtester[baidu]         # Baidu ERNIE API support
pip install ragtester[zeroone]       # 01.AI API support


# Local model support
pip install ragtester[local-llama] # llamma models
pip install ragtester[ollama]        # Ollama local models
pip install ragtester[local-transformers]  # Local transformers models

### From Source

```bash
git clone https://github.com/abhilashms230/ragtester.git
cd ragtester
pip install -e .

๐ŸŽฏ Quick Start

1. Basic RAG Evaluation

from ragtester import RAGTester, RAGTestConfig, LLMConfig
from ragtester.config import GenerationPlan
from ragtester.types import TestCategory

def my_rag_function(question: str) -> str:
    """Your RAG system implementation"""
    # Your retrieval and generation logic here
    return "Generated answer based on documents"

# Configure the evaluation
config = RAGTestConfig(
    llm=LLMConfig(
        provider="openai",  # or "anthropic", "grok", "gemini", "mistral", "cohere", "huggingface", "fireworks", "together", "perplexity", "bedrock", "local"
        model="gpt-4o-mini",
        api_key="your-api-key",
        temperature=0.7, # configurable by user
        max_tokens=2048, # configurable by user
    ),
    generation=GenerationPlan(
        per_category={
            TestCategory.FAITHFULNESS: 10, # configurable by user
            TestCategory.ANSWER_QUALITY: 10, # configurable by user
            TestCategory.TOXICITY: 10, # configurable by user
            TestCategory.ROBUSTNESS_RELIABILITY: 10, # configurable by user
            TestCategory.SECURITY_SAFETY: 10, # configurable by user
        }
    )
)

# Create tester and run evaluation
tester = RAGTester(rag_callable=my_rag_function, config=config)
tester.upload_documents(["docs/manual.pdf", "docs/guide.txt"])
results = tester.run_all_tests()
# Export results to CSV
csv_path = "rag_test_results.csv"
print(f"\n๐Ÿ’พ Exporting detailed results to: {csv_path}")
tester.export_results(results, csv_path)
# View results
tester.print_summary(results)

2. API-Based RAG Evaluation

from ragtester import RAGTester, RAGTestConfig, LLMConfig

config = RAGTestConfig(
    llm=LLMConfig(provider="anthropic", model="claude-3-5-sonnet-20241022")
)

tester = RAGTester(
    rag_api_url="https://your-rag-api.com/query",
    config=config
)

tester.upload_documents(["docs/knowledge_base.pdf"])
results = tester.run_all_tests()
tester.print_summary(results)

3. Local Model RAG

from ragtester import RAGTester, RAGTestConfig, LLMConfig
from ragtester.config import GenerationPlan
from ragtester.types import TestCategory

def my_rag_function(question: str) -> str:
    """Your RAG system implementation"""
    # Your retrieval and generation logic here
    return "Generated answer based on documents"

# Configure the evaluation
config = RAGTestConfig(
    llm = LLMConfig(
        provider="local",
        model="path/to/your/model.gguf",  # Replace with actual path
        temperature=0.7,
        max_tokens=2048, # configurable by user
        extra={
            "n_ctx": 4096 # configurable by user
        }
    ),
    generation=GenerationPlan(
        per_category={
            TestCategory.FAITHFULNESS: 5, # configurable by user
            TestCategory.ANSWER_QUALITY: 5, # configurable by user
            TestCategory.TOXICITY: 3, # configurable by user
            TestCategory.ROBUSTNESS_RELIABILITY: 3, # configurable by user
            TestCategory.SECURITY_SAFETY: 3, # configurable by user
        }
    )
)

# Create tester and run evaluation
tester = RAGTester(rag_callable=my_rag_function, config=config)
tester.upload_documents(["docs/manual.pdf", "docs/guide.txt"])
results = tester.run_all_tests()
# Export results to CSV
csv_path = "rag_test_results.csv"
print(f"\n๐Ÿ’พ Exporting detailed results to: {csv_path}")
tester.export_results(results, csv_path)
# View results
tester.print_summary(results)

๐Ÿ“Š Output Formats

Console Summary

============================================================
RAG EVALUATION RESULTS
============================================================
Overall Score: 3.8/5.0

๐Ÿ“Š Detailed Breakdown:
โ”œโ”€โ”€ Faithfulness: 4.2/5.0 (5 questions)
โ”œโ”€โ”€ Answer Quality: 3.6/5.0 (5 questions)  
โ”œโ”€โ”€ Toxicity: 4.0/5.0 (3 questions)
โ”œโ”€โ”€ Robustness & Reliability: 3.4/5.0 (3 questions)
โ””โ”€โ”€ Security & Safety: 3.8/5.0 (3 questions)

โœ… 19/19 tests completed successfully
โฑ๏ธ  Total evaluation time: 2m 34s

CSV Export

from ragtester.reporter import export_csv

# Export detailed results
export_csv(results, "rag_evaluation_results.csv")

CSV Columns:

  • category: The metric being evaluated
  • question: The generated question
  • rag_Answer: Your RAG system's response
  • score: Integer score (1-5)
  • reasoning: Detailed evaluation reasoning
  • page_number: Context

JSON Export

from ragtester.reporter import export_json

export_json(results, "results.json")

Markdown Report

from ragtester.reporter import export_markdown

export_markdown(results, "results.md")

๐ŸŒ Provider Setup

Anthropic Claude Setup

  1. Install Dependencies

    pip install anthropic
    
  2. Get API Key

  3. Configure Environment

    export ANTHROPIC_API_KEY=your_api_key_here
    

AWS Bedrock Setup

  1. Install Dependencies

    pip install boto3
    
  2. Configure Credentials

    # Environment variables
    export AWS_ACCESS_KEY_ID=your_access_key
    export AWS_SECRET_ACCESS_KEY=your_secret_key
    export AWS_DEFAULT_REGION=us-east-1
    
  3. Enable Model Access

    • Go to AWS Bedrock console
    • Navigate to "Model access"
    • Request access to desired models

Local Model Setup

  1. Download GGUF Model

    # Example: Download Vicuna 7B
    wget https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q4_K_M.gguf
    
  2. Configure Local Provider

    config = RAGTestConfig(
        llm=LLMConfig(
            provider="local",
            model="path/to/vicuna-7b-v1.5.Q4_K_M.gguf",
            temperature=0.7,
            max_tokens=2048,
            extra={
                "n_ctx": 4096,
                "n_gpu_layers": -1,  # Use GPU if available
            }
        )
    )
    

๐Ÿ“ Project Structure

ragtester/
โ”œโ”€โ”€ __init__.py                 # Main package exports
โ”œโ”€โ”€ config.py                  # Configuration classes
โ”œโ”€โ”€ tester.py                  # Main RAGTester class
โ”œโ”€โ”€ types.py                   # Data structures and enums
โ”œโ”€โ”€ utils.py                   # Utility functions
โ”œโ”€โ”€ document_loader/           # Document processing
โ”‚   โ”œโ”€โ”€ base.py               # Base document loader
โ”‚   โ”œโ”€โ”€ simple_loaders.py     # PDF, text, markdown loaders
โ”‚   โ””โ”€โ”€ random_page_loader.py # Advanced page selection
โ”œโ”€โ”€ evaluator/                # Response evaluation
โ”‚   โ”œโ”€โ”€ base.py              # Base evaluator interface
โ”‚   โ””โ”€โ”€ metrics_judge.py     # LLM-based evaluation
โ”œโ”€โ”€ llm/                     # LLM providers
โ”‚   โ”œโ”€โ”€ base.py             # Base LLM interface
โ”‚   โ”œโ”€โ”€ providers.py        # Provider factory
โ”‚   โ”œโ”€โ”€ providers_openai.py # OpenAI integration
โ”‚   โ”œโ”€โ”€ providers_anthropic.py # Anthropic integration
โ”‚   โ”œโ”€โ”€ providers_bedrock.py # AWS Bedrock integration
โ”‚   โ””โ”€โ”€ providers_local.py  # Local model support
โ”œโ”€โ”€ question_generator/      # Question generation
โ”‚   โ”œโ”€โ”€ base.py            # Base generator interface
โ”‚   โ””โ”€โ”€ generators.py      # LLM-based generation
โ”œโ”€โ”€ rag_client/            # RAG system clients
โ”‚   โ”œโ”€โ”€ base.py           # Base client interface
โ”‚   โ””โ”€โ”€ clients.py        # API and callable clients
โ””โ”€โ”€ reporter/             # Result reporting
    โ”œโ”€โ”€ base.py          # Base reporter interface
    โ””โ”€โ”€ reporter.py      # CSV, JSON, Markdown export

๐Ÿ”ง Troubleshooting

Common Issues

AWS Bedrock Model Access Issues

CUDA/llama-cpp-python Issues

# Install CPU-only version (recommended)
pip install llama-cpp-python --force-reinstall --no-cache-dir

# For GPU support (if you have CUDA)
pip install llama-cpp-python[server] --force-reinstall --no-cache-dir

Import Errors

# Check Python version (3.9+ required)
python --version

API Key Issues

# Set environment variables
import os
os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["ANTHROPIC_API_KEY"] = "your-key"
os.environ["XAI_API_KEY"] = "your-key"
os.environ["GOOGLE_API_KEY"] = "your-key"
os.environ["MISTRAL_API_KEY"] = "your-key"
os.environ["COHERE_API_KEY"] = "your-key"
os.environ["HF_TOKEN"] = "your-key"
os.environ["FIREWORKS_API_KEY"] = "your-key"
os.environ["TOGETHER_API_KEY"] = "your-key"
os.environ["PERPLEXITY_API_KEY"] = "your-key"
os.environ["DEEPSEEK_API_KEY"] = "your-key"
os.environ["REKA_API_KEY"] = "your-key"
os.environ["QWEN_API_KEY"] = "your-key"
os.environ["MOONSHOT_API_KEY"] = "your-key"
os.environ["ZHIPU_API_KEY"] = "your-key"
os.environ["BAIDU_API_KEY"] = "your-key"
os.environ["ZEROONE_API_KEY"] = "your-key"

# Or pass directly
config = RAGTestConfig(
    llm=LLMConfig(
        provider="openai",  # or "anthropic", "grok", "gemini", "mistral", "cohere", "huggingface", "fireworks", "together", "perplexity", "deepseek", "reka", "qwen", "moonshot", "zhipu", "baidu", "zeroone"
        api_key="your-key"
    )
)

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ†˜ Support

๐Ÿ™ Acknowledgments

  • Built with โค๏ธ for the RAG community
  • Inspired by the need for standardized RAG evaluation
  • Thanks to all contributors and users

Made with โค๏ธ by ABHILASH M S

Star โญ this repository if you find it helpful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragtester-0.2.0.tar.gz (94.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragtester-0.2.0-py3-none-any.whl (122.4 kB view details)

Uploaded Python 3

File details

Details for the file ragtester-0.2.0.tar.gz.

File metadata

  • Download URL: ragtester-0.2.0.tar.gz
  • Upload date:
  • Size: 94.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ragtester-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8124c3f14ddd6ccb0360e8bf6cda794336b24b11a44d84f8e0a47942957a1db8
MD5 f41076f098e63178f032b88740210845
BLAKE2b-256 762636c33f811b2d0da3fa49199ed822d0227d636c659af474ffe11c05345ba6

See more details on using hashes here.

File details

Details for the file ragtester-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ragtester-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 122.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ragtester-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce44e169c98ab5c92b89629b65185636c6d530a6d114f80be5a612d098ca126a
MD5 6bb1381bfd1625251cd16710f1d23ddb
BLAKE2b-256 ece584658c084d360c0431541eee5d931a811fc140deed4f700352a00f6d8e1f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page