Skip to main content

Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models.

Project description

GroUSE

arXiv Hugging Face Blog Tutorial


Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.

Install

pip install grouse

Then, setup your OpenAI credentials by creating an .env file by copying the .env.dist file, filling in your OpenAI API key and organization id and exporting the environment variables export $(cat .env | xargs).

Command Line Usage

Evaluation of the Grounded Question Answering task

You can build a dataset in a jsonl file with the following format per line:

{
    "references": ["", ...], // List of references
    "input": "", // Query
    "actual_output": "", // Predicted answer generated by the model we want to evaluate
    "expected_output": "" // Ground truth answer to the input
}

You can also check this example example_data/grounded_qa.jsonl.

Then, run this command:

grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o

We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and prompts using the otional arguments :

  • --evaluator_model_name: Name of the evaluator model. It can be any LiteLLM model. The default model is GPT-4.
  • --prompts_path: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.

Unit Testing of Evaluators with GroUSE

Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.

grouse meta-evaluate gpt-4o meta-outputs/gpt-4o

Optional arguments :

  • --prompts_path: Path to the folder containing the prompts of the evaluator. By default, the prompts are those optimized for GPT-4.
  • --train_set: Optional flag to meta-evaluate on the train set (16 tests) instead of the test set (144 tests). The train set is meant to be used during the prompt engineering phase.

Plot Matrices of unit tests success

You can plot the results of unit tests in the shape of matrices:

grouse plot meta-outputs/gpt-4o

The resulting matrices look like this:

result_matrices_plot

Python Usage

from grouse import EvaluationSample, GroundedQAEvaluator

sample = EvaluationSample(
    input="What is the capital of France?",
    # Replace this with the actual output from your LLM application
    actual_output="The capital of France is Marseille.[1]",
    expected_output="The capital of France is Paris.[1]",
    references=["Paris is the capital of France."]
)
evaluator = GroundedQAEvaluator()
evaluator.evaluate([sample])

Tutorial

You can check this tutorial to get started on some examples.

Links

Citation

@misc{muller2024grousebenchmarkevaluateevaluators,
      title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, 
      author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
      year={2024},
      eprint={2409.06595},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.06595}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grouse-0.4.2.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grouse-0.4.2-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file grouse-0.4.2.tar.gz.

File metadata

  • Download URL: grouse-0.4.2.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for grouse-0.4.2.tar.gz
Algorithm Hash digest
SHA256 42fa8955d37e03a527590112c16d8c261b1d7d28cf2e082d1a67c54552a3fae7
MD5 fb1fd703c5758961661377d899d87f5e
BLAKE2b-256 9e90e8a02ae69a8c50d2ed93fce1e5919ec34ad5c2439754bb40b0b990ee3ca6

See more details on using hashes here.

Provenance

The following attestation bundles were made for grouse-0.4.2.tar.gz:

Publisher: ci.yml on illuin-tech/grouse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file grouse-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: grouse-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for grouse-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 20a7c97f61a335f32ce8bbb5f7efd81a98b8104fcee10ba6e25d21243a896cb7
MD5 73264fa7dda767b38f52cdcc92a34269
BLAKE2b-256 a6f5f2ced513e0959a4347a094ef71c81d02374ce897b5a43ebed32475f17804

See more details on using hashes here.

Provenance

The following attestation bundles were made for grouse-0.4.2-py3-none-any.whl:

Publisher: ci.yml on illuin-tech/grouse

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page