Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

GroUSE

Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.

Install
Command Line Usage
Python Usage
Links
Citation

Install

pip install grouse

Then, setup your OpenAI credentials by creating an .env file by copying the .env.dist file, filling in your OpenAI API key and organization id and exporting the environment variables export $(cat .env | xargs).

Command Line Usage

Evaluation of the Grounded Question Answering task

You can build a dataset in a jsonl file with the following format per line:

{
    "references": ["", ...], // List of references
    "input": "", // Query
    "actual_output": "", // Predicted answer generated by the model we want to evaluate
    "expected_output": "" // Ground truth answer to the input
}

You can also check this example example_data/grounded_qa.jsonl.

Then, run this command:

grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o

We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and prompts using the otional arguments :

--evaluator_model_name: Name of the evaluator model. It can be any LiteLLM model. The default model is GPT-4.
--prompts_path: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.

Unit Testing of Evaluators with GroUSE

Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.

grouse meta-evaluate gpt-4o meta-outputs/gpt-4o

Optional arguments :

--prompts_path: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.
--train_set: Optional flag to meta-evaluate on the train set (16 tests) instead of the test set (144 tests). The train set is meant to be used during the prompt engineering phase.

Plot Matrices of unit tests success

You can plot the results of unit tests in the shape of matrices:

grouse plot meta-outputs/gpt-4o

The resulting matrices look like this:

result_matrices_plot

Python Usage

from grouse import EvaluationSample, GroundedQAEvaluator

sample = EvaluationSample(
    input="What is the capital of France?",
    # Replace this with the actual output from your LLM application
    actual_output="The capital of France is Marseille.",
    expected_output="The capital of France is Paris.",
    references=["Paris is the capital of France."]
)
evaluator = GroundedQAEvaluator()
evaluator.evaluate([sample])

Citation

@misc{muller2024grousebenchmarkevaluateevaluators,
      title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, 
      author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
      year={2024},
      eprint={2409.06595},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.06595}, 
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

antoniol

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.2

Nov 20, 2024

0.4.1

Sep 29, 2024

0.4.0

Sep 24, 2024

0.3.1

Sep 20, 2024

This version

0.3.0

Sep 11, 2024

0.2.1

Sep 5, 2024

0.2.0

Sep 5, 2024

0.1.0

Aug 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grouse-0.3.0.tar.gz (1.3 MB view details)

Uploaded Sep 11, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

grouse-0.3.0-py3-none-any.whl (20.8 kB view details)

Uploaded Sep 11, 2024 Python 3

File details

Details for the file grouse-0.3.0.tar.gz.

File metadata

Download URL: grouse-0.3.0.tar.gz
Upload date: Sep 11, 2024
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for grouse-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`42da105c53bd505dfa2e5fdafb95d8c57228aa5b9531572e5439c6dd883f64cd`
MD5	`366903db3f85568ab2244ccb953c447f`
BLAKE2b-256	`5f859c25b537402452dfc09d58fc52142bc7884657e127702eeb86d8612018f5`

See more details on using hashes here.

File details

Details for the file grouse-0.3.0-py3-none-any.whl.

File metadata

Download URL: grouse-0.3.0-py3-none-any.whl
Upload date: Sep 11, 2024
Size: 20.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for grouse-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ba9bae0ec23eb5227903594006a918329ebdb085f331a397cab921d22c7b11b`
MD5	`b3c805289c71afa5dc7eb21ee3b917d0`
BLAKE2b-256	`db7e593360631aeb15c348b300e7b03a651000c48aec7654bbcbdd48b9db97af`

See more details on using hashes here.

grouse 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GroUSE

Install

Command Line Usage

Evaluation of the Grounded Question Answering task

Unit Testing of Evaluators with GroUSE

Plot Matrices of unit tests success

Python Usage

Links

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes