A comprehensive Python library for testing and evaluating Retrieval-Augmented Generation (RAG) systems with LLM-generated questions and automated evaluation metrics
Project description
RAGtester
A comprehensive Python library for testing and evaluating Retrieval-Augmented Generation (RAG) systems with LLM-generated questions and automated evaluation metrics.
๐ Table of Contents
- ๐ฏ Overview
- ๐ Key Features
- ๐ฆ Installation
- ๐ฏ Quick Start
- ๐ Output Formats
- ๐ Provider Setup
- ๐ Project Structure
- ๐ง Troubleshooting
- ๐ License
- ๐ Support
๐ฏ Overview
RAGtester is a powerful evaluation framework designed to assess the quality, reliability, and safety of RAG systems through automated testing. It generates context-aware questions from your documents and evaluates responses across multiple dimensions using state-of-the-art LLM judges.
Why RAGtester?
- ๐ Comprehensive Evaluation: 5-dimensional assessment covering faithfulness, quality, toxicity, robustness, and security
- ๐ค LLM-Powered: Uses advanced language models for intelligent question generation and evaluation
- ๐ Multi-Provider Support: Works with OpenAI, Anthropic, AWS Bedrock, and local models
- ๐ Rich Reporting: Detailed CSV, JSON, and Markdown reports with actionable insights
- โก Easy Integration: Simple API that works with any RAG system
๐ Key Features
๐ 5-Dimensional Evaluation System
| Dimension | Description | What It Tests |
|---|---|---|
| Faithfulness | How well responses match provided context | Factual accuracy, hallucination detection |
| Answer Quality | Fluency, clarity, and conciseness | Response coherence, completeness |
| Toxicity | Detection of harmful content | Safety, appropriateness, bias |
| Robustness & Reliability | System behavior under stress | Error handling, edge cases |
| Security & Safety | Resistance to malicious inputs | Prompt injection, data protection |
๐ฏ Smart Question Generation
- Context-Aware: Questions tailored to specific document content
- Random Page Selection: Each question uses different document pages
- Metric-Specific: Questions designed for each evaluation dimension
- Behavior Testing: General questions to test system behavior
๐ค Multiple LLM Support
| Provider |
|---|
| OpenAI |
| Anthropic |
| AWS Bedrock |
| Grok (xAI) |
| Google Gemini |
| Mistral AI |
| Cohere |
| Hugging Face |
| Fireworks AI |
| Together AI |
| Perplexity |
| Local |
๐ Document Support
- PDF Files: Automatic text extraction and page selection
- Text Files: Direct processing with encoding detection
- Markdown Files: Full support with formatting preservation
- Extensible: Easy to add new document types
๐ฆ Installation
Basic Installation
pip install ragtester
With Optional Dependencies
# For specific LLM providers
pip install ragtester[openai] # OpenAI API support
pip install ragtester[anthropic] # Anthropic API support
pip install ragtester[bedrock] # AWS Bedrock support
pip install ragtester[grok] # Grok (xAI) API support
pip install ragtester[gemini] # Google Gemini API support
pip install ragtester[mistral] # Mistral AI API support
pip install ragtester[cohere] # Cohere API support
pip install ragtester[huggingface] # Hugging Face Inference API support
pip install ragtester[fireworks] # Fireworks AI API support
pip install ragtester[together] # Together AI API support
pip install ragtester[perplexity] # Perplexity AI API support
pip install ragtester[deepseek] # DeepSeek API support
pip install ragtester[reka] # Reka AI API support
pip install ragtester[qwen] # Qwen (Alibaba) API support
pip install ragtester[moonshot] # Moonshot AI API support
pip install ragtester[zhipu] # Zhipu AI API support
pip install ragtester[baidu] # Baidu ERNIE API support
pip install ragtester[zeroone] # 01.AI API support
# Local model support
pip install ragtester[local-llama] # llamma models
pip install ragtester[ollama] # Ollama local models
pip install ragtester[local-transformers] # Local transformers models
### From Source
```bash
git clone https://github.com/abhilashms230/ragtester.git
cd ragtester
pip install -e .
๐ฏ Quick Start
1. Basic RAG Evaluation
from ragtester import RAGTester, RAGTestConfig, LLMConfig
from ragtester.config import GenerationPlan
from ragtester.types import TestCategory
def my_rag_function(question: str) -> str:
"""Your RAG system implementation"""
# Your retrieval and generation logic here
return "Generated answer based on documents"
# Configure the evaluation
config = RAGTestConfig(
llm=LLMConfig(
provider="openai", # or "anthropic", "grok", "gemini", "mistral", "cohere", "huggingface", "fireworks", "together", "perplexity", "bedrock", "local"
model="gpt-4o-mini",
api_key="your-api-key",
temperature=0.7, # configurable by user
max_tokens=2048, # configurable by user
),
generation=GenerationPlan(
per_category={
TestCategory.FAITHFULNESS: 10, # configurable by user
TestCategory.ANSWER_QUALITY: 10, # configurable by user
TestCategory.TOXICITY: 10, # configurable by user
TestCategory.ROBUSTNESS_RELIABILITY: 10, # configurable by user
TestCategory.SECURITY_SAFETY: 10, # configurable by user
}
)
)
# Create tester and run evaluation
tester = RAGTester(rag_callable=my_rag_function, config=config)
tester.upload_documents(["docs/manual.pdf", "docs/guide.txt"])
results = tester.run_all_tests()
# Export results to CSV
csv_path = "rag_test_results.csv"
print(f"\n๐พ Exporting detailed results to: {csv_path}")
tester.export_results(results, csv_path)
# View results
tester.print_summary(results)
2. API-Based RAG Evaluation
from ragtester import RAGTester, RAGTestConfig, LLMConfig
config = RAGTestConfig(
llm=LLMConfig(provider="anthropic", model="claude-3-5-sonnet-20241022")
)
tester = RAGTester(
rag_api_url="https://your-rag-api.com/query",
config=config
)
tester.upload_documents(["docs/knowledge_base.pdf"])
results = tester.run_all_tests()
tester.print_summary(results)
3. Local Model RAG
from ragtester import RAGTester, RAGTestConfig, LLMConfig
from ragtester.config import GenerationPlan
from ragtester.types import TestCategory
def my_rag_function(question: str) -> str:
"""Your RAG system implementation"""
# Your retrieval and generation logic here
return "Generated answer based on documents"
# Configure the evaluation
config = RAGTestConfig(
llm = LLMConfig(
provider="local",
model="path/to/your/model.gguf", # Replace with actual path
temperature=0.7,
max_tokens=2048, # configurable by user
extra={
"n_ctx": 4096 # configurable by user
}
),
generation=GenerationPlan(
per_category={
TestCategory.FAITHFULNESS: 5, # configurable by user
TestCategory.ANSWER_QUALITY: 5, # configurable by user
TestCategory.TOXICITY: 3, # configurable by user
TestCategory.ROBUSTNESS_RELIABILITY: 3, # configurable by user
TestCategory.SECURITY_SAFETY: 3, # configurable by user
}
)
)
# Create tester and run evaluation
tester = RAGTester(rag_callable=my_rag_function, config=config)
tester.upload_documents(["docs/manual.pdf", "docs/guide.txt"])
results = tester.run_all_tests()
# Export results to CSV
csv_path = "rag_test_results.csv"
print(f"\n๐พ Exporting detailed results to: {csv_path}")
tester.export_results(results, csv_path)
# View results
tester.print_summary(results)
๐ Output Formats
Console Summary
============================================================
RAG EVALUATION RESULTS
============================================================
Overall Score: 3.8/5.0
๐ Detailed Breakdown:
โโโ Faithfulness: 4.2/5.0 (5 questions)
โโโ Answer Quality: 3.6/5.0 (5 questions)
โโโ Toxicity: 4.0/5.0 (3 questions)
โโโ Robustness & Reliability: 3.4/5.0 (3 questions)
โโโ Security & Safety: 3.8/5.0 (3 questions)
โ
19/19 tests completed successfully
โฑ๏ธ Total evaluation time: 2m 34s
CSV Export
from ragtester.reporter import export_csv
# Export detailed results
export_csv(results, "rag_evaluation_results.csv")
CSV Columns:
category: The metric being evaluatedquestion: The generated questionrag_Answer: Your RAG system's responsescore: Integer score (1-5)reasoning: Detailed evaluation reasoningpage_number: Context
JSON Export
from ragtester.reporter import export_json
export_json(results, "results.json")
Markdown Report
from ragtester.reporter import export_markdown
export_markdown(results, "results.md")
๐ Provider Setup
Anthropic Claude Setup
-
Install Dependencies
pip install anthropic
-
Get API Key
- Visit Anthropic Console
- Create an account and get your API key
-
Configure Environment
export ANTHROPIC_API_KEY=your_api_key_here
AWS Bedrock Setup
-
Install Dependencies
pip install boto3
-
Configure Credentials
# Environment variables export AWS_ACCESS_KEY_ID=your_access_key export AWS_SECRET_ACCESS_KEY=your_secret_key export AWS_DEFAULT_REGION=us-east-1
-
Enable Model Access
- Go to AWS Bedrock console
- Navigate to "Model access"
- Request access to desired models
Local Model Setup
-
Download GGUF Model
# Example: Download Vicuna 7B wget https://huggingface.co/TheBloke/vicuna-7B-v1.5-GGUF/resolve/main/vicuna-7b-v1.5.Q4_K_M.gguf
-
Configure Local Provider
config = RAGTestConfig( llm=LLMConfig( provider="local", model="path/to/vicuna-7b-v1.5.Q4_K_M.gguf", temperature=0.7, max_tokens=2048, extra={ "n_ctx": 4096, "n_gpu_layers": -1, # Use GPU if available } ) )
๐ Project Structure
ragtester/
โโโ __init__.py # Main package exports
โโโ config.py # Configuration classes
โโโ tester.py # Main RAGTester class
โโโ types.py # Data structures and enums
โโโ utils.py # Utility functions
โโโ document_loader/ # Document processing
โ โโโ base.py # Base document loader
โ โโโ simple_loaders.py # PDF, text, markdown loaders
โ โโโ random_page_loader.py # Advanced page selection
โโโ evaluator/ # Response evaluation
โ โโโ base.py # Base evaluator interface
โ โโโ metrics_judge.py # LLM-based evaluation
โโโ llm/ # LLM providers
โ โโโ base.py # Base LLM interface
โ โโโ providers.py # Provider factory
โ โโโ providers_openai.py # OpenAI integration
โ โโโ providers_anthropic.py # Anthropic integration
โ โโโ providers_bedrock.py # AWS Bedrock integration
โ โโโ providers_local.py # Local model support
โโโ question_generator/ # Question generation
โ โโโ base.py # Base generator interface
โ โโโ generators.py # LLM-based generation
โโโ rag_client/ # RAG system clients
โ โโโ base.py # Base client interface
โ โโโ clients.py # API and callable clients
โโโ reporter/ # Result reporting
โโโ base.py # Base reporter interface
โโโ reporter.py # CSV, JSON, Markdown export
๐ง Troubleshooting
Common Issues
AWS Bedrock Model Access Issues
CUDA/llama-cpp-python Issues
# Install CPU-only version (recommended)
pip install llama-cpp-python --force-reinstall --no-cache-dir
# For GPU support (if you have CUDA)
pip install llama-cpp-python[server] --force-reinstall --no-cache-dir
Import Errors
# Check Python version (3.9+ required)
python --version
API Key Issues
# Set environment variables
import os
os.environ["OPENAI_API_KEY"] = "your-key"
os.environ["ANTHROPIC_API_KEY"] = "your-key"
os.environ["XAI_API_KEY"] = "your-key"
os.environ["GOOGLE_API_KEY"] = "your-key"
os.environ["MISTRAL_API_KEY"] = "your-key"
os.environ["COHERE_API_KEY"] = "your-key"
os.environ["HF_TOKEN"] = "your-key"
os.environ["FIREWORKS_API_KEY"] = "your-key"
os.environ["TOGETHER_API_KEY"] = "your-key"
os.environ["PERPLEXITY_API_KEY"] = "your-key"
os.environ["DEEPSEEK_API_KEY"] = "your-key"
os.environ["REKA_API_KEY"] = "your-key"
os.environ["QWEN_API_KEY"] = "your-key"
os.environ["MOONSHOT_API_KEY"] = "your-key"
os.environ["ZHIPU_API_KEY"] = "your-key"
os.environ["BAIDU_API_KEY"] = "your-key"
os.environ["ZEROONE_API_KEY"] = "your-key"
# Or pass directly
config = RAGTestConfig(
llm=LLMConfig(
provider="openai", # or "anthropic", "grok", "gemini", "mistral", "cohere", "huggingface", "fireworks", "together", "perplexity", "deepseek", "reka", "qwen", "moonshot", "zhipu", "baidu", "zeroone"
api_key="your-key"
)
)
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Support
- ๐ง Email: abhilashms230@gmail.com
- ๐ฌ Discussions: GitHub Discussions
๐ Acknowledgments
- Built with โค๏ธ for the RAG community
- Inspired by the need for standardized RAG evaluation
- Thanks to all contributors and users
Made with โค๏ธ by ABHILASH M S
Star โญ this repository if you find it helpful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragtester-0.2.0.tar.gz.
File metadata
- Download URL: ragtester-0.2.0.tar.gz
- Upload date:
- Size: 94.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8124c3f14ddd6ccb0360e8bf6cda794336b24b11a44d84f8e0a47942957a1db8
|
|
| MD5 |
f41076f098e63178f032b88740210845
|
|
| BLAKE2b-256 |
762636c33f811b2d0da3fa49199ed822d0227d636c659af474ffe11c05345ba6
|
File details
Details for the file ragtester-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ragtester-0.2.0-py3-none-any.whl
- Upload date:
- Size: 122.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce44e169c98ab5c92b89629b65185636c6d530a6d114f80be5a612d098ca126a
|
|
| MD5 |
6bb1381bfd1625251cd16710f1d23ddb
|
|
| BLAKE2b-256 |
ece584658c084d360c0431541eee5d931a811fc140deed4f700352a00f6d8e1f
|