Library for RAG evaluation

These details have not been verified by PyPI

Project links

Project description

AI DIAL RAG EVAL

Overview

Library designed for RAG (Retrieval-Augmented Generation) evaluation, where retrieval and generation metrics are calculated.

Usage

Install the library using pip:

pip install aidial-rag-eval

Example

The example of how to get retrieval metrics along with answer inference based on the context.

import pandas as pd
from langchain_openai import AzureChatOpenAI
from aidial_rag_eval import create_rag_eval_metrics_report
from aidial_rag_eval.metric_binds import CONTEXT_TO_ANSWER_INFERENCE,\
    ANSWER_TO_GROUND_TRUTH_INFERENCE, GROUND_TRUTH_TO_ANSWER_INFERENCE

llm = AzureChatOpenAI(model="gemini-2.5-flash-lite")

df_ground_truth = pd.DataFrame([
    {
        "question": "What is the diameter of the Earth and the name of the biggest ocean?",
        "documents": ["earth.pdf"],
        "facts": ["The diameter of the Earth is approximately 12,742 kilometers.", "The biggest ocean on Earth is the Pacific Ocean."],
        "answer": "The Earth's diameter measures about 12,742 kilometers, and the Pacific Ocean is the largest ocean on our planet."
    },])
df_answer = pd.DataFrame([
    {
        "question": "What is the diameter of the Earth and the name of the biggest ocean?",
        "documents": ["earth.pdf"],
        "context":  [
            "The Earth, our home planet, is the third planet from the sun. It's the only planet known to have an atmosphere containing free oxygen and oceans of liquid water on its surface. The diameter of the Earth is approximately 12,742 kilometers.",
            "The Pacific Ocean is the largest and deepest of Earth's oceanic divisions, extending from the Arctic Ocean in the north to the Southern Ocean in the south."
        ],
        "answer": "The Earth has a diameter of approximately 12,742 kilometers."
    },
])

df_metrics = create_rag_eval_metrics_report(
    df_ground_truth,
    df_answer,
    llm=llm,
    metric_binds=[
        CONTEXT_TO_ANSWER_INFERENCE,
        ANSWER_TO_GROUND_TRUTH_INFERENCE,
        GROUND_TRUTH_TO_ANSWER_INFERENCE,
    ],
)
print(df_metrics[["facts_ranks", "recall", 'precision', 'mrr', 'f1', 'ctx_ans_inference', 'ans_gt_inference', 'gt_ans_inference']])

It is expected to see the following results:

recall	precision	mrr	f1	ctx_ans_inference	ans_gt_inference	gt_ans_inference
0.5	0.5	0.5	0.5	1.0	0.5	1.0

In this table:

"recall" of 0.5 indicates that only 1 out of 2 ground truth facts were found in the context.
"precision" of 0.5 reflects that just 1 context chunk out of 2 includes any ground truth facts.
The prefix of the inference metrics signifies the premise and hypothesis in the following format: premise_hypothesis_inference.
- "ctx" refers to 'context'
- "ans" refers to 'answer'
- "gt" refers to 'ground truth answer'
"ctx_ans_inference" and "ans_gt_inference" values of 1.0 mean our answer can be derived directly from the context and the ground truth answer, respectively.
"gt_ans_inference" of 0.5, denotes that the ground truth answer can only be partially inferred from our answer.

Recommended models

The algorithm is token-intensive. Considering the balance between quality and price, the following models are recommended:

gemini-2.5-flash-lite
gpt-5-mini
gemini-2.0-flash-lite
gpt-5-nano

Developer environment

This project uses Python>=3.11 and Poetry>=2.2.1 as a dependency manager.

Check out Poetry's documentation on how to install it on your system before proceeding.

To install requirements:

poetry install

This will install all requirements for running the package, linting, formatting and tests.

Lint

Run the linting before committing:

make lint

To auto-fix formatting issues run:

make format

Test

Run unit tests locally for available python versions:

make test

Run unit tests for the specific python version:

make test PYTHON=3.11

The generation evaluation requires an access to the LLM. The generation evaluation tests (located in tests/llm_tests directory) use cached LLM responses by default. To run the tests with real LLM responses, you need add --llm-mode=real argument to the test command:

make test PYTHON=3.11 ARGS="--llm-mode=real"

The test run with real LLM responses requires the following environment variables to be set:

Variable	Description
DIAL_URL	The URL of the DIAL server.
DIAL_API_KEY	The API key for the DIAL server.

Copy .env.example to .env and customize it for your environment.

Clean

To remove the virtual environment and build artifacts run:

make clean

Build

To build the package run:

make build

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

May 15, 2026

0.6.0rc0 pre-release

Mar 4, 2026

0.6.0.dev23 pre-release

May 14, 2026

0.6.0.dev22 pre-release

May 13, 2026

0.6.0.dev21 pre-release

May 6, 2026

0.6.0.dev20 pre-release

May 5, 2026

0.6.0.dev19 pre-release

Apr 28, 2026

0.6.0.dev16 pre-release

Apr 22, 2026

0.6.0.dev15 pre-release

Apr 21, 2026

0.6.0.dev14 pre-release

Apr 20, 2026

0.5.0

Feb 13, 2026

This version

0.5.0rc0 pre-release

Jan 20, 2026

0.4.0

Jan 19, 2026

0.4.0rc0 pre-release

Jan 15, 2026

0.3.0

Jan 14, 2026

0.3.0rc0 pre-release

Jun 10, 2025

0.2.0

Jun 2, 2025

0.2.0rc0 pre-release

May 27, 2025

0.1.0

May 26, 2025

0.1.0rc0 pre-release

May 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aidial_rag_eval-0.5.0rc0.tar.gz (28.8 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aidial_rag_eval-0.5.0rc0-py3-none-any.whl (42.8 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file aidial_rag_eval-0.5.0rc0.tar.gz.

File metadata

Download URL: aidial_rag_eval-0.5.0rc0.tar.gz
Upload date: Jan 20, 2026
Size: 28.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for aidial_rag_eval-0.5.0rc0.tar.gz
Algorithm	Hash digest
SHA256	`5d395d72ba4eeb47edccbf5667fb55f84c5159030f4af65fb16e6bf4ac35516a`
MD5	`6b928c68cf5bad4ace3552b61599c2d8`
BLAKE2b-256	`f40019d47ed081161ff7a6b1fa1bc7a649927bf189ba0c1ae0cc7f5a071f3d5e`

See more details on using hashes here.

File details

Details for the file aidial_rag_eval-0.5.0rc0-py3-none-any.whl.

File metadata

Download URL: aidial_rag_eval-0.5.0rc0-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 42.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.12.3 Linux/6.11.0-1018-azure

File hashes

Hashes for aidial_rag_eval-0.5.0rc0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dba5229deaf7d6859e9cccf47e31fd4dcce086606475b584efc4568bec5c24b8`
MD5	`c1bf6f5b0453f915441efe0c9064bbaf`
BLAKE2b-256	`0c5c7e597d069558f6adf4f4df4df922f4087c166bd0b5b314d70b9296919177`

See more details on using hashes here.

aidial-rag-eval 0.5.0rc0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AI DIAL RAG EVAL

Overview

Usage

Example

Recommended models

Developer environment

Lint

Test

Clean

Build

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes