A robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems.
Project description
RAG Evaluation
RAG Evaluation is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses or outputs generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, Llama etc.
It integrates easily with the OpenAI API via the openai package and automatically handles environment variable-based API key loading through python-dotenv.
Features
- Multi-Metric Evaluation: Evaluate RAG system output using the following metrics:
- Query Relevance Measures how well the RAG system output addresses the user’s query.
- Factual Accuracy Ensures the response is factually correct with respect to the source document.
- Coverage Checks that all key points from the source relevant to the query are included.
- Coherence Assesses the logical flow and organization of ideas in the RAG system output
- Fluency Assesses readability, grammar, and naturalness of the language.
- Standardized Prompting: Uses a well-defined prompt template to consistently assess RAG system.
- Customizable Weighting: Allows users to adjust the relative importance of each metric to tailor the overall accuracy to their specific priorities.
- Easy Integration: Provides a high-level function to integrate evaluation into your RAG system pipelines.
Installation
pip install rag_evaluation
API-Key Management
1. Environment / .env file – zero code changes
# .env or exported in the shell
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIzaSy.
The package loads these automatically via python-dotenv.
2. One-time, in-memory key (per Python session)
import rag_evaluation as rag_eval
rag_eval.set_api_key("openai", "sk-live...")
rag_eval.set_api_key("gemini", "AIzaSy...")
# any subsequent call picks them up automatically
# takes precedence over environment variables.
3. Explicit lookup / fallback
from rag_eval.config import get_api_key
key = get_api_key("openai", default_key="sk-fallback...")
# Priority inside get_api_key
# cache (set_api_key) ➜ env/.env ➜ default_key ➜ ValueError.
Usage
Open-Source Local Models (Ollama models; does not require external APIs)
Currently, the package supports Llama, Mistral, and Qwen.
Step 1: Download Ollama. Check here for instructions
Step 2: Check from the list of models and download using ollama pull <model_name>
#check the list of models available on your local PC
from openai import OpenAI
client = OpenAI(
api_key='ollama',
base_url="http://localhost:11434/v1"
)
# List all available models
models = client.models.list()
print(models.to_json())
Usage with Open-Source Models (Ollama models)
from rag_eval.evaluator import evaluate_response
# Define the inputs
query = "Which large language model is currently the largest and most capable?"
response_text = """The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are
designed to handle complex language tasks, and their vast number of parameters gives them the ability to
understand and generate human-like text."""
document = """A large language model (LLM) is a type of machine learning model designed for natural language processing
tasks such as language generation. LLMs are language models with many parameters, and are trained with
self-supervised learning on a vast amount of text. The largest and most capable LLMs are
generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided
by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies
inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in."""
# Llama usage (ollama pull llama3.2:1b to download from terminal)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="ollama",
model_name='llama3.2:1b
)
print(report)
# Mistral usage (ollama pull mistral to download from terminal)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="ollama",
model_name='mistral:latest',
metric_weights=[0.1, 0., 0.9, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)
# Qwen usage (ollama pull qwen to download from terminal)
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="ollama",
model_name='qwen:latest',
)
print(report)
For API-based Models (GPT and Gemini)
# Define the inputs (same as above)
# OpenAI usage
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="openai",
model_name='gpt-4.1',
)
print(report)
# Gemini usage
report = evaluate_response(
query=query,
response=response_text,
document=document,
model_type="gemini",
model_name='gemini-2.5-flash',
metric_weights=[0.1, 0.4, 0.5, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)
Customizing Metric Weights
By default, RAG Evaluation balances the five core metrics as follows:
| Metric | Default Weight |
|---|---|
| Query Relevance | 0.25 |
| Factual Accuracy | 0.25 |
| Coverage | 0.25 |
| Coherence | 0.125 |
| Fluency | 0.125 |
These defaults reflect our view that modern LLMs (GPT, Llama, Gemini, etc.) already excel at coherence and fluency, so we place greater emphasis on the metrics that most impact RAG system accuracy.
If you’d like to emphasize certain aspects of your RAG system’s output - say you care twice as much about factual accuracy as coherence, you can supply your own weights via the metric_weights parameter in evaluate_response. Your custom weights must:
- Be a list of five floats.
- Sum to 1.0 (ensuring they form a valid weighted average).
- Follow this order:
[Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
Example
from rag_eval.evaluator import evaluate_response
# Define the inputs
query = "..."
response = "..."
document = "..."
# Emphasize factual accuracy and coverage
custom_weights = [0.1, 0.4, 0.4, 0.05, 0.05]
report = evaluate_response(
query=query,
response=response,
document=document,
model_type="openai",
model_name="gpt-4.1",
metric_weights=custom_weights
)
print(report)
This simple mechanism let users tailor the Overall Accuracy score to whatever aspects matter most in their evaluation scenario
Output
The evaluate_response function returns a pandas DataFrame with:
- Metric Names: Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.
- Normalized Scores: A 0–1 score for each metric.
- Percentage Scores: The normalized score expressed as a percentage.
- Overall Accuracy: A weighted average score across all metrics.
Need help?
- Open an issue or pull request on GitHub
- For more examples of how to use the package, see the example notebook
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_evaluation-0.2.3.tar.gz.
File metadata
- Download URL: rag_evaluation-0.2.3.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38fed2202bc77549936e8bcd9e5719f82d13c3ce3eff5c4ca86fa05477d458a6
|
|
| MD5 |
213a1430ce13806b46a3a083b9bd370f
|
|
| BLAKE2b-256 |
0d9d4750dd2e391ca3a133dd175ba035cb2aadbc21f50f098c2aeffad322cd67
|
File details
Details for the file rag_evaluation-0.2.3-py3-none-any.whl.
File metadata
- Download URL: rag_evaluation-0.2.3-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed9ac79fc4125cb8d7c43e2cecb4aa778e79ef7a6d75cdfe355b71ca2fd0e32
|
|
| MD5 |
67a2cfbdc517225799be30b904bc4ddc
|
|
| BLAKE2b-256 |
45e70cd3cc8d118adf4a3225b7fac84f268977eecf5a32b2fe8290491e37fbc9
|