Skip to main content

Evaluate Grounded Question Answering models and Grounded Question Answering evaluator models.

Project description

GroUSE

Evaluate Grounded Question Answering (GQA) models and GQA evaluator models. We implement the evaluation methods described in GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering.

Install

pip install -e .

Command Line Usage

Evaluation of the Grounded Question Answering task

You can build a dataset in a jsonl file with the following format per line:

{
    "references": [...] # List of references,
    "input": "" # Query
    "actual_output": "", # Predicted answer generated by the model we want to evaluate
    "expected_output": "" # Ground truth answer to the input
}

You can also check this example example_data/grounded_qa.jsonl.

Then, run this command:

grouse evaluate {PATH_TO_DATASET_WITH_GENERATIONS} outputs/gpt-4o

We recommend using GPT-4 as an evaluator model as we optimised prompts for this model, but you can change the model and the prompts with the --model-name and --prompts-path options.

Unit Testing of Evaluators with GroUSE

Meta-Evaluation consists in evaluating GQA evaluators with the GroUSE unit tests.

grouse meta-evaluate gpt-4o meta-outputs/gpt-4o

Plot Matrices of unit tests success

You can plot the results of unit tests in the shape of matrices:

grouse plot meta-outputs/gpt-4o

The resulting matrices look like this:

result_matrices_plot

Python Usage

from grouse import EvaluationSample, GroundedQAEvaluator

sample = EvaluationSample(
    input="What is the capital of France?",
    # Replace this with the actual output from your LLM application
    actual_output="The capital of France is Marseille.",
    expected_output="The capital of France is Paris.",
    references=["Paris is the capital of France."]
)
evaluator = GroundedQAEvaluator()
evaluator.evaluate([sample])

Links

Citation

@misc{muller2024grouse,
      title={GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering}, 
      author={Sacha Muller and António Loison and Bilel Omrani and Gautier Viaud},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grouse-0.1.0.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grouse-0.1.0-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file grouse-0.1.0.tar.gz.

File metadata

  • Download URL: grouse-0.1.0.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for grouse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d0c902f724ed3d65d04aa27487b7babaecaebb0ed8fe55f0c6da16e5bb0de13e
MD5 1470082d9703bfb9da1a90c4663a53b1
BLAKE2b-256 ddb00d0b0703932df97e981c2cf9dea675eb0812929f0e6731163084802d7ed3

See more details on using hashes here.

File details

Details for the file grouse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: grouse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for grouse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 afa3e5a17dc2e8c7e773e54eb628dc04307ac7c25f03086121249805b78f6169
MD5 23f54f98db6629a95a0bb1ed858cba79
BLAKE2b-256 506207e140af3c40d92074b227f30cf3b6f8289a4494c7cad1b5eb65c0d20d63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page