Skip to main content

A robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems.

Project description

RAG Evaluation

RAG Evaluation is a Python package designed for evaluating Retrieval-Augmented Generation (RAG) systems. It provides a systematic way to score and analyze the quality of responses or outputs generated by RAG systems. This package is particularly suited for projects leveraging Large Language Models (LLMs) such as GPT, Gemini, Llama etc.

It integrates easily with the OpenAI API via the openai package and automatically handles environment variable-based API key loading through python-dotenv.

Features

  • Multi-Metric Evaluation: Evaluate RAG system output using the following metrics:
    • Query Relevance Measures how well the RAG system output addresses the user’s query.
    • Factual Accuracy Ensures the response is factually correct with respect to the source document.
    • Coverage Checks that all key points from the source relevant to the query are included.
    • Coherence Assesses the logical flow and organization of ideas in the RAG system output
    • Fluency Assesses readability, grammar, and naturalness of the language.
  • Standardized Prompting: Uses a well-defined prompt template to consistently assess RAG system.
  • Customizable Weighting: Allows users to adjust the relative importance of each metric to tailor the overall accuracy to their specific priorities.
  • Easy Integration: Provides a high-level function to integrate evaluation into your RAG system pipelines.

Installation

pip install rag_evaluation

API-Key Management

1. Environment / .env file – zero code changes

# .env  or exported in the shell
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=AIzaSy.

The package loads these automatically via python-dotenv.

2. One-time, in-memory key (per Python session)

import rag_evaluation as rag_eval

rag_eval.set_api_key("openai", "sk-live...")
rag_eval.set_api_key("gemini", "AIzaSy...")

# any subsequent call picks them up automatically
# takes precedence over environment variables.

3. Explicit lookup / fallback

from rag_eval.config import get_api_key

key = get_api_key("openai", default_key="sk-fallback...")

# Priority inside get_api_key
# cache (set_api_key) ➜ env/.env ➜ default_key ➜ ValueError.

Usage

Open-Source Local Models (Ollama models; does not require external APIs)

Currently, the package supports Llama, Mistral, and Qwen.

Step 1: Download Ollama. Check here for instructions

Step 2: Check from the list of models and download using ollama pull <model_name>

#check the list of models available on your local PC
from openai import OpenAI

client = OpenAI(
    api_key='ollama',
    base_url="http://localhost:11434/v1"
)

# List all available models
models = client.models.list()
print(models.to_json())

Usage with Open-Source Models (Ollama models)

from rag_eval.evaluator import evaluate_response

# Define the inputs
query = "Which large language model is currently the largest and most capable?"

response_text = """The largest and most capable LLMs are the generative pretrained transformers (GPTs). These models are 
                designed to handle complex language tasks, and their vast number of parameters gives them the ability to 
                understand and generate human-like text."""
                 
document = """A large language model (LLM) is a type of machine learning model designed for natural language processing 
            tasks such as language generation. LLMs are language models with many parameters, and are trained with 
            self-supervised learning on a vast amount of text. The largest and most capable LLMs are 
            generative pretrained transformers (GPTs). Modern models can be fine-tuned for specific tasks or guided 
            by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies 
            inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in."""

# Llama usage (ollama pull llama3.2:1b to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='llama3.2:1b
)
print(report)

# Mistral usage (ollama pull mistral to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='mistral:latest',
    metric_weights=[0.1, 0., 0.9, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)

# Qwen usage (ollama pull qwen to download from terminal)
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="ollama",
    model_name='qwen:latest',
)
print(report)

For API-based Models (GPT and Gemini)

# Define the inputs (same as above)

# OpenAI usage 
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="openai",
    model_name='gpt-4.1',
)
print(report)

# Gemini usage 
report = evaluate_response(
    query=query,
    response=response_text,
    document=document,
    model_type="gemini",
    model_name='gemini-2.5-flash',
    metric_weights=[0.1, 0.4, 0.5, 0., 0.] # optional metric_weights [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]
)
print(report)

Customizing Metric Weights

By default, RAG Evaluation balances the five core metrics as follows:

Metric Default Weight
Query Relevance 0.25
Factual Accuracy 0.25
Coverage 0.25
Coherence 0.125
Fluency 0.125

These defaults reflect our view that modern LLMs (GPT, Llama, Gemini, etc.) already excel at coherence and fluency, so we place greater emphasis on the metrics that most impact RAG system accuracy.

If you’d like to emphasize certain aspects of your RAG system’s output - say you care twice as much about factual accuracy as coherence, you can supply your own weights via the metric_weights parameter in evaluate_response. Your custom weights must:

  1. Be a list of five floats.
  2. Sum to 1.0 (ensuring they form a valid weighted average).
  3. Follow this order:
    [Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency]

Example

from rag_eval.evaluator import evaluate_response

# Define the inputs
query = "..."
response = "..."
document = "..."

# Emphasize factual accuracy and coverage
custom_weights = [0.1, 0.4, 0.4, 0.05, 0.05]

report = evaluate_response(
    query=query,
    response=response,
    document=document,
    model_type="openai",
    model_name="gpt-4.1",
    metric_weights=custom_weights
)

print(report)

This simple mechanism let users tailor the Overall Accuracy score to whatever aspects matter most in their evaluation scenario

Output

The evaluate_response function returns a pandas DataFrame with:

  • Metric Names: Query Relevance, Factual Accuracy, Coverage, Coherence, Fluency.
  • Normalized Scores: A 0–1 score for each metric.
  • Percentage Scores: The normalized score expressed as a percentage.
  • Overall Accuracy: A weighted average score across all metrics.

Need help?

  • Open an issue or pull request on GitHub
  • For more examples of how to use the package, see the example notebook

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_evaluation-0.2.3.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_evaluation-0.2.3-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file rag_evaluation-0.2.3.tar.gz.

File metadata

  • Download URL: rag_evaluation-0.2.3.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for rag_evaluation-0.2.3.tar.gz
Algorithm Hash digest
SHA256 38fed2202bc77549936e8bcd9e5719f82d13c3ce3eff5c4ca86fa05477d458a6
MD5 213a1430ce13806b46a3a083b9bd370f
BLAKE2b-256 0d9d4750dd2e391ca3a133dd175ba035cb2aadbc21f50f098c2aeffad322cd67

See more details on using hashes here.

File details

Details for the file rag_evaluation-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: rag_evaluation-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.1

File hashes

Hashes for rag_evaluation-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fed9ac79fc4125cb8d7c43e2cecb4aa778e79ef7a6d75cdfe355b71ca2fd0e32
MD5 67a2cfbdc517225799be30b904bc4ddc
BLAKE2b-256 45e70cd3cc8d118adf4a3225b7fac84f268977eecf5a32b2fe8290491e37fbc9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page