Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models.
Project description
GroUSE
Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.
Install
pip install -e .
Command Line Usage
Evaluation of the Grounded Question Answering task
You can build a dataset in a jsonl file with the following format per line:
{
"references": [...] # List of references,
"input": "" # Query
"actual_output": "", # Predicted answer generated by the model we want to evaluate
"expected_output": "" # Ground truth answer to the input
}
You can also check this example example_data/grounded_qa.jsonl.
Then, run this command:
grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o
We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and the prompts with the --model-name and --prompts-path options.
Unit Testing of Evaluators with GroUSE
Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.
grouse meta-evaluate gpt-4o meta-outputs/gpt-4o
Plot Matrices of unit tests success
You can plot the results of unit tests in the shape of matrices:
grouse plot meta-outputs/gpt-4o
The resulting matrices look like this:
Python Usage
from grouse import EvaluationSample, GroundedQAEvaluator
sample = EvaluationSample(
input="What is the capital of France?",
# Replace this with the actual output from your LLM application
actual_output="The capital of France is Marseille.",
expected_output="The capital of France is Paris.",
references=["Paris is the capital of France."]
)
evaluator = GroundedQAEvaluator()
evaluator.evaluate([sample])
Links
Citation
@misc{muller2024grouse,
title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering},
author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
year={2024},
archivePrefix={arXiv},
primaryClass={cs.IR},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file grouse-0.1.0.tar.gz.
File metadata
- Download URL: grouse-0.1.0.tar.gz
- Upload date:
- Size: 23.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0c902f724ed3d65d04aa27487b7babaecaebb0ed8fe55f0c6da16e5bb0de13e
|
|
| MD5 |
1470082d9703bfb9da1a90c4663a53b1
|
|
| BLAKE2b-256 |
ddb00d0b0703932df97e981c2cf9dea675eb0812929f0e6731163084802d7ed3
|
File details
Details for the file grouse-0.1.0-py3-none-any.whl.
File metadata
- Download URL: grouse-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afa3e5a17dc2e8c7e773e54eb628dc04307ac7c25f03086121249805b78f6169
|
|
| MD5 |
23f54f98db6629a95a0bb1ed858cba79
|
|
| BLAKE2b-256 |
506207e140af3c40d92074b227f30cf3b6f8289a4494c7cad1b5eb65c0d20d63
|