A comprehensive Python library for testing and evaluating Retrieval-Augmented Generation (RAG) systems with LLM-generated questions and automated evaluation metrics

These details have not been verified by PyPI

Project links

Project description

RAGtester

A comprehensive Python library for testing and evaluating Retrieval-Augmented Generation (RAG) systems with LLM-generated questions and automated evaluation metrics.

🎯 Overview

RAGtester is a powerful evaluation framework designed to assess the quality, reliability, and safety of RAG systems through automated testing. It generates context-aware questions from your documents and evaluates responses across multiple dimensions using state-of-the-art LLM judges.

Why RAGtester?

🔍 Comprehensive Evaluation: 5-dimensional assessment covering faithfulness, quality, toxicity, robustness, and security
🤖 LLM-Powered: Uses advanced language models for intelligent question generation and evaluation
🔄 Multi-Provider Support: Works with OpenAI, Anthropic, AWS Bedrock, and local models
📊 Rich Reporting: Detailed CSV, JSON, and Markdown reports with actionable insights
⚡ Easy Integration: Simple API that works with any RAG system

🚀 Key Features

📊 5-Dimensional Evaluation System

Dimension	Description	What It Tests
Faithfulness	How well responses match provided context	Factual accuracy, hallucination detection
Answer Quality	Fluency, clarity, and conciseness	Response coherence, completeness
Toxicity	Detection of harmful content	Safety, appropriateness, bias
Robustness & Reliability	System behavior under stress	Error handling, edge cases
Security & Safety	Resistance to malicious inputs	Prompt injection, data protection

🎯 Smart Question Generation

Context-Aware: Questions tailored to specific document content
Random Page Selection: Each question uses different document pages
Metric-Specific: Questions designed for each evaluation dimension
Behavior Testing: General questions to test system behavior

🤖 Multiple LLM Support

Provider
OpenAI
Anthropic
AWS Bedrock
Grok (xAI)
Google Gemini
Mistral AI
Cohere
Hugging Face
Fireworks AI
Together AI
Perplexity
Local

📁 Document Support

PDF Files: Automatic text extraction and page selection
Text Files: Direct processing with encoding detection
Markdown Files: Full support with formatting preservation
Extensible: Easy to add new document types

📦 Installation

Basic Installation

pip install ragtester

With Optional Dependencies

# For specific LLM providers
pip install ragtester[openai]        # OpenAI API support
pip install ragtester[anthropic]     # Anthropic API support
pip install ragtester[bedrock]       # AWS Bedrock support
pip install ragtester[grok]          # Grok (xAI) API support
pip install ragtester[gemini]        # Google Gemini API support
pip install ragtester[mistral]       # Mistral AI API support
pip install ragtester[cohere]        # Cohere API support
pip install ragtester[huggingface]   # Hugging Face Inference API support
pip install ragtester[fireworks]     # Fireworks AI API support
pip install ragtester[together]      # Together AI API support
pip install ragtester[perplexity]    # Perplexity AI API support
pip install ragtester[deepseek]      # DeepSeek API support
pip install ragtester[reka]          # Reka AI API support
pip install ragtester[qwen]          # Qwen (Alibaba) API support
pip install ragtester[moonshot]      # Moonshot AI API support
pip install ragtester[zhipu]         # Zhipu AI API support
pip install ragtester[baidu]         # Baidu ERNIE API support
pip install ragtester[zeroone]       # 01.AI API support


# Local model support
pip install ragtester[local-llama] # llamma models
pip install ragtester[ollama]        # Ollama local models
pip install ragtester[local-transformers]  # Local transformers models

### From Source

```bash
git clone https://github.com/abhilashms230/ragtester.git
cd ragtester
pip install -e .

🎯 Quick Start

1. Basic RAG Evaluation

from ragtester import RAGTester, RAGTestConfig, LLMConfig
from ragtester.config import GenerationPlan
from ragtester.types import TestCategory

def my_rag_function(question: str) -> str:
    """Your RAG system implementation"""
    # Your retrieval and generation logic here
    return "Generated answer based on documents"

# Configure the evaluation
config = RAGTestConfig(
    llm=LLMConfig(
        provider="openai",  # or "anthropic", "grok", "gemini", "mistral", "cohere", "huggingface", "fireworks", "together", "perplexity", "bedrock", "local"
        model="gpt-4o-mini",
        api_key="your-api-key",
        temperature=0.7, # configurable by user
        max_tokens=2048, # configurable by user
    ),
    generation=GenerationPlan(
        per_category={
            TestCategory.FAITHFULNESS: 10, # configurable by user
            TestCategory.ANSWER_QUALITY: 10, # configurable by user
            TestCategory.TOXICITY: 10, # configurable by user
            TestCategory.ROBUSTNESS_RELIABILITY: 10, # configurable by user
            TestCategory.SECURITY_SAFETY: 10, # configurable by user
        }
    )
)

# Create tester and run evaluation
tester = RAGTester(rag_callable=my_rag_function, config=config)
tester.upload_documents(["docs/manual.pdf", "docs/guide.txt"])
results = tester.run_all_tests()
# Export results to CSV
csv_path = "rag_test_results.csv"
print(f"\n💾 Exporting detailed results to: {csv_path}")
tester.export_results(results, csv_path)
# View results
tester.print_summary(results)

2. API-Based RAG Evaluation

from ragtester import RAGTester, RAGTestConfig, LLMConfig

config = RAGTestConfig(
    llm=LLMConfig(provider="anthropic", model="claude-3-5-sonnet-20241022")
)

tester = RAGTester(
    rag_api_url="https://your-rag-api.com/query",
    config=config
)

tester.upload_documents(["docs/knowledge_base.pdf"])
results = tester.run_all_tests()
tester.print_summary(results)

3. Local Model RAG

from ragtester import RAGTester, RAGTestConfig, LLMConfig
from ragtester.config import GenerationPlan
from ragtester.types import TestCategory

def my_rag_function(question: str) -> str:
    """Your RAG system implementation"""
    # Your retrieval and generation logic here
    return "Generated answer based on documents"

# Configure the evaluation
config = RAGTestConfig(
    llm = LLMConfig(
        provider="local",
        model="path/to/your/model.gguf",  # Replace with actual path
        temperature=0.7,
        max_tokens=2048, # configurable by user
        extra={
            "n_ctx": 4096 # configurable by user
        }
    ),
    generation=GenerationPlan(
        per_category={
            TestCategory.FAITHFULNESS: 5, # configurable by user
            TestCategory.ANSWER_QUALITY: 5, # configurable by user
            TestCategory.TOXICITY: 3, # configurable by user
            TestCategory.ROBUSTNESS_RELIABILITY: 3, # configurable by user
            TestCategory.SECURITY_SAFETY: 3, # configurable by user
        }
    )
)

# Create tester and run evaluation
tester = RAGTester(rag_callable=my_rag_function, config=config)
tester.upload_documents(["docs/manual.pdf", "docs/guide.txt"])
results = tester.run_all_tests()
# Export results to CSV
csv_path = "rag_test_results.csv"
print(f"\n💾 Exporting detailed results to: {csv_path}")
tester.export_results(results, csv_path)
# View results
tester.print_summary(results)

📊 Output Formats

Console Summary

============================================================
RAG EVALUATION RESULTS
============================================================
Overall Score: 3.8/5.0

📊 Detailed Breakdown:
├── Faithfulness: 4.2/5.0 (5 questions)
├── Answer Quality: 3.6/5.0 (5 questions)  
├── Toxicity: 4.0/5.0 (3 questions)
├── Robustness & Reliability: 3.4/5.0 (3 questions)
└── Security & Safety: 3.8/5.0 (3 questions)

✅ 19/19 tests completed successfully
⏱️  Total evaluation time: 2m 34s

CSV Export

from ragtester.reporter import export_csv

# Export detailed results
export_csv(results, "rag_evaluation_results.csv")

CSV Columns:

category: The metric being evaluated
question: The generated question
rag_Answer: Your RAG system's response
score: Integer score (1-5)
reasoning: Detailed evaluation reasoning
page_number: Context

JSON Export

from ragtester.reporter import export_json

export_json(results, "results.json")

Markdown Report

from ragtester.reporter import export_markdown

export_markdown(results, "results.md")

🌐 Provider Setup

Anthropic Claude Setup

Install Dependencies
```
pip install anthropic
```
Get API Key
- Visit Anthropic Console
- Create an account and get your API key

Configure Environment

export ANTHROPIC_API_KEY=your_api_key_here

AWS Bedrock Setup

Install Dependencies
```
pip install boto3
```

Configure Credentials

# Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

Enable Model Access
- Go to AWS Bedrock console
- Navigate to "Model access"
- Request access to desired models

Local Model Setup

Download GGUF Model

# Example: Download Vicuna 7B
wget https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q4_K_M.gguf

Configure Local Provider

config = RAGTestConfig(
    llm=LLMConfig(
        provider="local",
        model="path/to/vicuna-7b-v1.5.Q4_K_M.gguf",
        temperature=0.7,
        max_tokens=2048,
        extra={
            "n_ctx": 4096,
            "n_gpu_layers": -1,  # Use GPU if available
        }
    )
)

📁 Project Structure

ragtester/
├── __init__.py                 # Main package exports
├── config.py                  # Configuration classes
├── tester.py                  # Main RAGTester class
├── types.py                   # Data structures and enums
├── utils.py                   # Utility functions
├── document_loader/           # Document processing
│   ├── base.py               # Base document loader
│   ├── simple_loaders.py     # PDF, text, markdown loaders
│   └── random_page_loader.py # Advanced page selection
├── evaluator/                # Response evaluation
│   ├── base.py              # Base evaluator interface
│   └── metrics_judge.py     # LLM-based evaluation
├── llm/                     # LLM providers
│   ├── base.py             # Base LLM interface
│   ├── providers.py        # Provider factory
│   ├── providers_openai.py # OpenAI integration
│   ├── providers_anthropic.py # Anthropic integration
│   ├── providers_bedrock.py # AWS Bedrock integration
│   └── providers_local.py  # Local model support
├── question_generator/      # Question generation
│   ├── base.py            # Base generator interface
│   └── generators.py      # LLM-based generation
├── rag_client/            # RAG system clients
│   ├── base.py           # Base client interface
│   └── clients.py        # API and callable clients
└── reporter/             # Result reporting
    ├── base.py          # Base reporter interface
    └── reporter.py      # CSV, JSON, Markdown export

🔧 Troubleshooting

Common Issues

AWS Bedrock Model Access Issues

CUDA/llama-cpp-python Issues

# Install CPU-only version (recommended)
pip install llama-cpp-python --force-reinstall --no-cache-dir

# For GPU support (if you have CUDA)
pip install llama-cpp-python[server] --force-reinstall --no-cache-dir

Import Errors

# Check Python version (3.9+ required)
python --version

API Key Issues

# Set environment variables
import os
os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["ANTHROPIC_API_KEY"] = "your-key"
os.environ["XAI_API_KEY"] = "your-key"
os.environ["GOOGLE_API_KEY"] = "your-key"
os.environ["MISTRAL_API_KEY"] = "your-key"
os.environ["COHERE_API_KEY"] = "your-key"
os.environ["HF_TOKEN"] = "your-key"
os.environ["FIREWORKS_API_KEY"] = "your-key"
os.environ["TOGETHER_API_KEY"] = "your-key"
os.environ["PERPLEXITY_API_KEY"] = "your-key"
os.environ["DEEPSEEK_API_KEY"] = "your-key"
os.environ["REKA_API_KEY"] = "your-key"
os.environ["QWEN_API_KEY"] = "your-key"
os.environ["MOONSHOT_API_KEY"] = "your-key"
os.environ["ZHIPU_API_KEY"] = "your-key"
os.environ["BAIDU_API_KEY"] = "your-key"
os.environ["ZEROONE_API_KEY"] = "your-key"

# Or pass directly
config = RAGTestConfig(
    llm=LLMConfig(
        provider="openai",  # or "anthropic", "grok", "gemini", "mistral", "cohere", "huggingface", "fireworks", "together", "perplexity", "deepseek", "reka", "qwen", "moonshot", "zhipu", "baidu", "zeroone"
        api_key="your-key"
    )
)

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🆘 Support

📧 Email: abhilashms230@gmail.com
💬 Discussions: GitHub Discussions

🙏 Acknowledgments

Built with ❤️ for the RAG community
Inspired by the need for standardized RAG evaluation
Thanks to all contributors and users

Made with ❤️ by ABHILASH M S

Star ⭐ this repository if you find it helpful!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Sep 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragtester-0.2.0.tar.gz (94.1 kB view details)

Uploaded Sep 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ragtester-0.2.0-py3-none-any.whl (122.4 kB view details)

Uploaded Sep 27, 2025 Python 3

File details

Details for the file ragtester-0.2.0.tar.gz.

File metadata

Download URL: ragtester-0.2.0.tar.gz
Upload date: Sep 27, 2025
Size: 94.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ragtester-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8124c3f14ddd6ccb0360e8bf6cda794336b24b11a44d84f8e0a47942957a1db8`
MD5	`f41076f098e63178f032b88740210845`
BLAKE2b-256	`762636c33f811b2d0da3fa49199ed822d0227d636c659af474ffe11c05345ba6`

See more details on using hashes here.

File details

Details for the file ragtester-0.2.0-py3-none-any.whl.

File metadata

Download URL: ragtester-0.2.0-py3-none-any.whl
Upload date: Sep 27, 2025
Size: 122.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for ragtester-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ce44e169c98ab5c92b89629b65185636c6d530a6d114f80be5a612d098ca126a`
MD5	`6bb1381bfd1625251cd16710f1d23ddb`
BLAKE2b-256	`ece584658c084d360c0431541eee5d931a811fc140deed4f700352a00f6d8e1f`

See more details on using hashes here.

ragtester 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAGtester

📋 Table of Contents

🎯 Overview

Why RAGtester?

🚀 Key Features

📊 5-Dimensional Evaluation System

🎯 Smart Question Generation

🤖 Multiple LLM Support

📁 Document Support

📦 Installation

Basic Installation

With Optional Dependencies

🎯 Quick Start

1. Basic RAG Evaluation

2. API-Based RAG Evaluation

3. Local Model RAG

📊 Output Formats

Console Summary

CSV Export

JSON Export

Markdown Report

🌐 Provider Setup

Anthropic Claude Setup

AWS Bedrock Setup

Local Model Setup

📁 Project Structure

🔧 Troubleshooting

Common Issues

AWS Bedrock Model Access Issues

CUDA/llama-cpp-python Issues

Import Errors

API Key Issues

📄 License

🆘 Support

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes