Skip to main content

Metric to measure RAG responses with inline citations

Project description

Trust Eval

Welcome to Trust Eval! 🌟

A comprehensive tool for evaluating the trustworthiness of inline-cited outputs generated by large language models (LLMs) within the Retrieval-Augmented Generation (RAG) framework. Our suite of metrics measures correctness, citation quality, and groundedness.

This is the official implementation of the metrics introduced in the paper "Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse" (accepted at ICLR '25).

Installation 🛠️

Prerequisites

  • OS: Linux
  • Python: Versions 3.10 – 3.12 (preferably 3.10.13)
  • GPU: Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100)

Steps

  1. Set up a Python environment

    conda create -n trust_eval python=3.10.13
    conda activate trust_eval
    
  2. Install dependencies

    pip install trust_eval
    

    Note: that vLLM will be installed with CUDA 12.1. Please ensure your CUDA setup is compatible.

  3. Set up NLTK

    import nltk
    nltk.download('punkt_tab')
    

Quickstart 🔥

Set up

Download eval_data from Trust-Align Huggingface and place it at the same level as the prompt folder. If you would like to use the default path configurations, please do not rename the folders. If you rename your folders, you will need to specify your own path.

quickstart/
├── eval_data/
├── prompts/

Quick look at the data

Here, we are working with ASQA where the questions are of the type long form factoid QA. Each sample has 3 fields: question, answers, docs. Below is one example of the dataset:

[ ...
    {   // The question asked.
        "question": "Who has the highest goals in world football?",

        // A list containing all correct (short) answers to the question, represented as arrays where each element contains variations of the answer. 
        "answers": [
            ["Daei", "Ali Daei"],                // Variations for Ali Daei
            ["Bican", "Josef Bican"],            // Variations for Josef Bican
            ["Sinclair", "Christine Sinclair"]   // Variations for Christine Sinclair
        ],

        // A list of 100 dictionaries where each dictionary contains one document.
        "docs": [
            {   
                // The title of the document being referenced.
                "title": "Argentina\u2013Brazil football rivalry",

                // A snippet of text from the document.
                "text": "\"Football Player of the Century\", ...",

                // A binary list where each element indicates if the respective answer was found in the document (1 for found, 0 for not found).
                "answers_found": [0,0,0],

                // A recall score calculated as the percentage of correct answers that the document entails.
                "rec_score": 0.0
            },
        ]
    },
...
]
    

Please refer to datasets page for examples of how ELI5 or QAMPARI sampless

Configuring yaml files

For generator related configurations, there are three field that are mandatory: data_type, model and max_length. data_type determines which benchmark dataset to evaluate on. model determines which model to evaluate and max_length is the maximum context length of the model. We will be using Qwen2.5-3B-Instruct in this tutorial but you can replace it with the path to your model checkpoints to evaluate your model.

data_type: "asqa"
model: Qwen/Qwen2.5-3B-Instruct
max_length: 8192

For evaluation related configurations, only data_type is mandatory.

data_type: "asqa"

Your directory should now look like this:

quickstart/
├── eval_data/
├── prompts/
├── generator_config.yaml
├── eval_config.yaml

Running evals

Now define your main script:

Generating Responses

from config import EvaluationConfig, ResponseGeneratorConfig
from evaluator import Evaluator
from logging_config import logger
from response_generator import ResponseGenerator

# Configure the response generator
generator_config = ResponseGeneratorConfig.from_yaml(yaml_path="generator_config.yaml")

# Generate and save responses
generator = ResponseGenerator(generator_config)
generator.generate_responses()
generator.save_responses()

Evaluating Responses

# Configure the evaluator
evaluation_config = EvaluationConfig.from_yaml(yaml_path="eval_config.yaml")

# Compute and save evaluation metrics
evaluator = Evaluator(evaluation_config)
evaluator.compute_metrics()
evaluator.save_results()

Your directory should look like this:

quickstart/
├── eval_data/
├── prompts/
├── example_usage.py
├── generator_config.yaml
├── eval_config.yaml
CUDA_VISIBLE_DEVICES=0,1 python example_usage.py 

Note: Define the GPUs you wish to run on in CUDA_VISIBLE_DEVICES. For reference, we are able to run up to 7b models on two A40s.

Sample output:

{ // refusal response: "I apologize, but I couldn't find an answer..."
    
    // Basic statistics
    "num_samples": 948,
    "answered_ratio": 50.0, // Ratio of (# answered qns / total # qns)
    "answered_num": 5, // # of qns where response is not refusal response
    "answerable_num": 7, // # of qns that ground truth answerable, given the documents
    "overlapped_num": 5, // # of qns that are both answered and answerable
    "regular_length": 46.6, // Average length of all responses
    "answered_length": 28.0, // Average length of non-refusal responses

    // Refusal groundedness metrics

    // # qns where (model refused to respond & is ground truth unanswerable) / # qns is ground truth unanswerable
    "reject_rec": 100.0,

    // # qns where (model refused to respond & is ground truth unanswerable) / # qns where model refused to respond
    "reject_prec": 60.0,

    // F1 of reject_rec and reject_prec
    "reject_f1": 75.0,

    // # qns where (model respond & is ground truth answerable) / # qns is ground truth answerable
    "answerable_rec": 71.42857142857143,

    // # qns where (model respond & is ground truth answerable) / # qns where model responded
    "answerable_prec": 100.0,

    // F1 of answerable_rec and answerable_prec
    "answerable_f1": 83.33333333333333,

    // Avg of reject_rec and answerable_rec
    "macro_avg": 85.71428571428572,

    // Avg of reject_f1 and answerable_f1
    "macro_f1": 79.16666666666666,

    // Response correctness metrics

    // Regardless of response type (refusal or answered), check if ground truth claim is in the response. 
    "regular_str_em": 41.666666666666664,

    // Only for qns with answered responses, check if ground truth claim is in the response. 
    "answered_str_em": 66.66666666666666,

    // Calculate EM for all qns that are answered and answerable, avg by # of answered questions (EM_alpha)
    "calib_answered_str_em": 100.0,

    // Calculate EM for all qns that are answered and answerable, avg by # of answerable questions (EM_beta)
    "calib_answerable_str_em": 71.42857142857143,

    // F1 of calib_answered_claims_nli and calib_answerable_claims_nli
    "calib_str_em_f1": 83.33333333333333,

    // EM score of qns that are answered and ground truth unanswerable, indicating use of parametric knowledge
    "parametric_str_em": 0.0,

    // Citation quality metrics

    // (Avg across all qns) Does the set of citations support statement s_i? 
    "regular_citation_rec": 28.333333333333332,

    // (Avg across all qns) Any redundant citations? (1) Does citation c_i,j fully support statement s_i? (2) Is the set of citations without c_i,j insufficient to support statement s_i? 
    "regular_citation_prec": 35.0,

    // F1 of regular_citation_rec and regular_citation_prec
    "regular_citation_f1": 31.315789473684212,

    // (Avg across answered qns only)
    "answered_citation_rec": 50.0,

    // (Avg across answered qns only)
    "answered_citation_prec": 60.0,

    // F1 answered_citation_rec and answered_citation_prec
    "answered_citation_f1": 54.54545454545455,

    // Avg (macro_f1, calib_claims_nli_f1, answered_citation_f1)
    "trust_score": 72.34848484848486
}

Please refer to metrics for explanations of outputs when evaluating with ELI5 or QAMPARI.

The end

Congratulations! You have reached the end of the quickstart tutorial and you are now ready to benchmark your own RAG application (running the evaluations with custom data instead of benchmark data) or reproduce our experimental setup! 🥳

Contact 📬

For questions or feedback, reach out to Shang Hong (simshanghong@gmail.com).

Citation 📝

If you use this software in your research, please cite the Trust-Eval paper as below.

@misc{song2024measuringenhancingtrustworthinessllms,
      title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse}, 
      author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},
      year={2024},
      eprint={2409.11242},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.11242}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trust_eval-0.1.4.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trust_eval-0.1.4-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file trust_eval-0.1.4.tar.gz.

File metadata

  • Download URL: trust_eval-0.1.4.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.2 Linux/6.2.0-32-generic

File hashes

Hashes for trust_eval-0.1.4.tar.gz
Algorithm Hash digest
SHA256 2d1d9ac56d8fc7c3a9638a7ba86a58b1f0d2578aa65b14b57000ad28f7570779
MD5 9d9a3ed453989c82c1def310edcf16cc
BLAKE2b-256 f9212b767ee914ede671b898f14f30004e5448e35c2d448e86c40156ca7c42f4

See more details on using hashes here.

File details

Details for the file trust_eval-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: trust_eval-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.2 Linux/6.2.0-32-generic

File hashes

Hashes for trust_eval-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b5a858c1c0c9dfadc3e13c150e4c645c5aad9582138138b25d9abff45708d2ce
MD5 8adb04362c155ec9f90611f335315b07
BLAKE2b-256 723305cd1fd71b882c90d714cab1502dcbe355bf778748a9be800c67297719c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page