Library for RAG evaluation
Project description
AI DIAL RAG EVAL
Overview
Library designed for RAG (Retrieval-Augmented Generation) evaluation, where retrieval and generation metrics are calculated.
Usage
Install the library using pip:
pip install aidial-rag-eval
spaCy language model
The generation metrics require the English language model for spaCy. Download it after installation:
python -m spacy download en_core_web_sm
Alternatively, you can install the model directly via URL:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
Or as a Poetry dependency:
en-core-web-sm = {url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl"}
Example
The example of how to get retrieval metrics along with answer inference based on the context.
import pandas as pd
from langchain_openai import AzureChatOpenAI
from aidial_rag_eval import create_rag_eval_metrics_report
from aidial_rag_eval.metric_binds import CONTEXT_TO_ANSWER_INFERENCE,\
ANSWER_TO_GROUND_TRUTH_INFERENCE, GROUND_TRUTH_TO_ANSWER_INFERENCE,\
ANSWER_TO_FACTS_INFERENCE, FACTS_TO_ANSWER_INFERENCE
llm = AzureChatOpenAI(model="gemini-2.5-flash-lite")
df_ground_truth = pd.DataFrame([
{
"question": "What is the diameter of the Earth and the name of the biggest ocean?",
"documents": ["earth.pdf"],
"facts": ["The diameter of the Earth is approximately 12,742 kilometers.", "The biggest ocean on Earth is the Pacific Ocean."],
"answer": "The Earth's diameter measures about 12,742 kilometers, and the Pacific Ocean is the largest ocean on our planet."
},])
df_answer = pd.DataFrame([
{
"question": "What is the diameter of the Earth and the name of the biggest ocean?",
"documents": ["earth.pdf"],
"context": [
"The Earth, our home planet, is the third planet from the sun. It's the only planet known to have an atmosphere containing free oxygen and oceans of liquid water on its surface. The diameter of the Earth is approximately 12,742 kilometers.",
"The Pacific Ocean is the largest and deepest of Earth's oceanic divisions, extending from the Arctic Ocean in the north to the Southern Ocean in the south."
],
"answer": "The Earth has a diameter of approximately 12,742 kilometers."
},
])
df_metrics = create_rag_eval_metrics_report(
df_ground_truth,
df_answer,
llm=llm,
metric_binds=[
CONTEXT_TO_ANSWER_INFERENCE,
ANSWER_TO_GROUND_TRUTH_INFERENCE,
GROUND_TRUTH_TO_ANSWER_INFERENCE,
ANSWER_TO_FACTS_INFERENCE,
FACTS_TO_ANSWER_INFERENCE,
],
)
print(df_metrics[["facts_ranks", "recall", 'precision', 'mrr', 'f1', 'ctx_ans_inference', 'ans_gt_inference', 'gt_ans_inference', 'ans_fct_inference', 'fct_ans_inference']])
It is expected to see the following results:
| recall | precision | mrr | f1 | ctx_ans_inference | ans_gt_inference | gt_ans_inference | ans_fct_inference | fct_ans_inference |
|---|---|---|---|---|---|---|---|---|
| 0.5 | 0.5 | 0.5 | 0.5 | 1.0 | 0.5 | 1.0 | 0.5 | 1.0 |
In this table:
- "recall" of 0.5 indicates that only 1 out of 2 ground truth facts were found in the context.
- "precision" of 0.5 reflects that just 1 context chunk out of 2 includes any ground truth facts.
- The prefix of the inference metrics signifies the premise and hypothesis in the following format: premise_hypothesis_inference.
- "ctx" refers to 'context'
- "ans" refers to 'answer'
- "gt" refers to 'ground truth answer'
- "fct" refers to 'facts'
- "ctx_ans_inference" of 1.0 means our answer can be fully derived from the context.
- "ans_gt_inference" of 0.5 means the ground truth answer is only partially entailed by our answer.
- "gt_ans_inference" of 1.0 means our answer can be fully derived from the ground truth answer.
- "ans_fct_inference" of 0.5 means only half of the ground truth facts are entailed by our answer (the Pacific Ocean fact is missing).
- "fct_ans_inference" of 1.0 means our answer can be fully derived from the ground truth facts.
Recommended models
The algorithm is token-intensive. Considering the balance between quality and price, the following models are recommended:
- gemini-3.1-flash-lite
- gemini-2.5-flash-lite
- gpt-5-mini
- gpt-5-nano
- gpt-5.4-mini
Developer environment
This project uses Python>=3.11 and Poetry>=2.2.1 as a dependency manager.
Check out Poetry's documentation on how to install it on your system before proceeding.
To install requirements:
poetry install
This will install all requirements for running the package, linting, formatting and tests.
Lint
Run the linting before committing:
make lint
To auto-fix formatting issues run:
make format
Test
Run unit tests locally for available python versions:
make test
Run unit tests for the specific python version:
make test PYTHON=3.11
The generation evaluation requires an access to the LLM. The generation evaluation tests (located in tests/llm_tests directory) use cached LLM responses by default. To run the tests with real LLM responses, you need add --llm-mode=real argument to the test command:
make test PYTHON=3.11 ARGS="--llm-mode=real"
The test run with real LLM responses requires the following environment variables to be set:
| Variable | Description |
|---|---|
| DIAL_URL | The URL of the DIAL server. |
| DIAL_API_KEY | The API key for the DIAL server. |
Copy .env.example to .env and customize it for your environment.
Clean
To remove the virtual environment and build artifacts run:
make clean
Build
To build the package run:
make build
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aidial_rag_eval-0.6.0.dev23.tar.gz.
File metadata
- Download URL: aidial_rag_eval-0.6.0.dev23.tar.gz
- Upload date:
- Size: 30.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.15 Linux/6.17.0-1013-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c860e6969f5f0530a746f2d55997af3eac8d3feeeb83f04260ad1d68a1463680
|
|
| MD5 |
b635057c3286405ecd09cea56ce8e12d
|
|
| BLAKE2b-256 |
36d0b6211e42d6ea00bf016cbc4fda1d677c6281adb46adf9c1b10e0b923bdab
|
File details
Details for the file aidial_rag_eval-0.6.0.dev23-py3-none-any.whl.
File metadata
- Download URL: aidial_rag_eval-0.6.0.dev23-py3-none-any.whl
- Upload date:
- Size: 45.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.11.15 Linux/6.17.0-1013-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2309656e9191c26d2f699f42ee61fee687eff7b9a34b0c3c08bab6625b1040f8
|
|
| MD5 |
4d054ad80a991b57a80b3492dd32ddf0
|
|
| BLAKE2b-256 |
5c357d3d17c85ee71cbcf15c81db6991c5f767ffa6e5962fedb02b2c7be78b3a
|