A robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems.

These details have not been verified by PyPI

Project links

Homepage

Project description

RAG Evaluation

RAG Evaluation is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses or outputs generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, Llama etc.

It integrates easily with the OpenAI API via the openai package and automatically handles environment variable-based API key loading through python-dotenv.

Features

Multi-Metric Evaluation: Evaluate RAG system output using the following metrics:
- Query Relevance Measures how well the RAG system output addresses the user’s query.
- Factual Accuracy Ensures the response is factually correct with respect to the source document.
- Coverage Checks that all key points from the source relevant to the query are included.
- Coherence Assesses the logical flow and organization of ideas in the RAG system output
- Fluency Assesses readability, grammar, and naturalness of the language.
Standardized Prompting: Uses a well-defined prompt template to consistently assess RAG system.
Customizable Weighting: Allows users to adjust the relative importance of each metric to tailor the overall accuracy to their specific priorities.
Easy Integration: Provides a high-level function to integrate evaluation into your RAG system pipelines.

Installation

pip install rag_evaluation

API-Key Management

1. Environment / .env file – zero code changes

# .env  or exported in the shell
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIzaSy.

The package loads these automatically via python-dotenv.

2. One-time, in-memory key (per Python session)

import rag_evaluation as rag_eval

rag_eval.set_api_key("openai", "sk-live...")
rag_eval.set_api_key("gemini", "AIzaSy...")

# any subsequent call picks them up automatically
# takes precedence over environment variables.

3. Explicit lookup / fallback

from rag_eval.config import get_api_key

key = get_api_key("openai", default_key="sk-fallback...")

# Priority inside get_api_key
# cache (set_api_key) ➜ env/.env ➜ default_key ➜ ValueError.

Usage

Open-Source Local Models (Ollama models; does not require external APIs)

Currently, the package supports Llama, Mistral, and Qwen.

Step 1: Download Ollama. Check here for instructions

Step 2: Check from the list of models and download using ollama pull <model_name>

#check the list of models available on your local PC
from openai import OpenAI

client = OpenAI(
    api_key='ollama',
    base_url="http://localhost:11434/v1"
)

# List all available models
models = client.models.list()
print(models.to_json())

Usage with Open-Source Models (Ollama models)

from rag_eval.evaluator import evaluate_response

# Define the inputs
query = "Which large language model is currently the largest and most capable?"

response_text = """The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are 
                designed to handle complex language tasks, and their vast number of parameters gives them the ability to 
                understand and generate human-like text."""
                 
document = """A large language model (LLM) is a type of machine learning model designed for natural language processing 
            tasks such as language generation. LLMs are language models with many parameters, and are trained with 
            self-supervised learning on a vast amount of text. The largest and most capable LLMs are 
            generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided 
            by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies 
            inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in."""

# Llama usage (ollama pull llama3.2:1b to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='llama3.2:1b
)
print(report)

# Mistral usage (ollama pull mistral to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='mistral:latest',
    metric_weights=[0.1, 0., 0.9, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)

# Qwen usage (ollama pull qwen to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='qwen:latest',
)
print(report)

For API-based Models (GPT and Gemini)

# Define the inputs (same as above)

# OpenAI usage 
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="openai",
    model_name='gpt-4.1',
)
print(report)

# Gemini usage 
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="gemini",
    model_name='gemini-2.5-flash',
    metric_weights=[0.1, 0.4, 0.5, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)

Customizing Metric Weights

By default, RAG Evaluation balances the five core metrics as follows:

Metric	Default Weight
Query Relevance	0.25
Factual Accuracy	0.25
Coverage	0.25
Coherence	0.125
Fluency	0.125

These defaults reflect our view that modern LLMs (GPT, Llama, Gemini, etc.) already excel at coherence and fluency, so we place greater emphasis on the metrics that most impact RAG system accuracy.

If you’d like to emphasize certain aspects of your RAG system’s output - say you care twice as much about factual accuracy as coherence, you can supply your own weights via the metric_weights parameter in evaluate_response. Your custom weights must:

Be a list of five floats.
Sum to 1.0 (ensuring they form a valid weighted average).
Follow this order:
[Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]

Example

from rag_eval.evaluator import evaluate_response

# Define the inputs
query = "..."
response = "..."
document = "..."

# Emphasize factual accuracy and coverage
custom_weights = [0.1, 0.4, 0.4, 0.05, 0.05]

report = evaluate_response(
    query=query,
    response=response,
    document=document,
    model_type="openai",
    model_name="gpt-4.1",
    metric_weights=custom_weights
)

print(report)

This simple mechanism let users tailor the Overall Accuracy score to whatever aspects matter most in their evaluation scenario

Output

The evaluate_response function returns a pandas DataFrame with:

Metric Names: Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.
Normalized Scores: A 0–1 score for each metric.
Percentage Scores: The normalized score expressed as a percentage.
Overall Accuracy: A weighted average score across all metrics.

Need help?

Open an issue or pull request on GitHub
For more examples of how to use the package, see the example notebook

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.3

Jul 17, 2025

0.2.2

Jul 17, 2025

0.2.1

Jul 12, 2025

0.2.0

Jul 12, 2025

0.1.0

Apr 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_evaluation-0.2.3.tar.gz (14.8 kB view details)

Uploaded Jul 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_evaluation-0.2.3-py3-none-any.whl (11.7 kB view details)

Uploaded Jul 17, 2025 Python 3

File details

Details for the file rag_evaluation-0.2.3.tar.gz.

File metadata

Download URL: rag_evaluation-0.2.3.tar.gz
Upload date: Jul 17, 2025
Size: 14.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for rag_evaluation-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`38fed2202bc77549936e8bcd9e5719f82d13c3ce3eff5c4ca86fa05477d458a6`
MD5	`213a1430ce13806b46a3a083b9bd370f`
BLAKE2b-256	`0d9d4750dd2e391ca3a133dd175ba035cb2aadbc21f50f098c2aeffad322cd67`

See more details on using hashes here.

File details

Details for the file rag_evaluation-0.2.3-py3-none-any.whl.

File metadata

Download URL: rag_evaluation-0.2.3-py3-none-any.whl
Upload date: Jul 17, 2025
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for rag_evaluation-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fed9ac79fc4125cb8d7c43e2cecb4aa778e79ef7a6d75cdfe355b71ca2fd0e32`
MD5	`67a2cfbdc517225799be30b904bc4ddc`
BLAKE2b-256	`45e70cd3cc8d118adf4a3225b7fac84f268977eecf5a32b2fe8290491e37fbc9`

See more details on using hashes here.

rag-evaluation 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG Evaluation

Features

Installation

API-Key Management

Usage

Open-Source Local Models (Ollama models; does not require external APIs)

Usage with Open-Source Models (Ollama models)

For API-based Models (GPT and Gemini)

Customizing Metric Weights

Example

Output

Need help?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes