Skip to main content

Empirical evaluation of retrieval-augmented instruction-following models.

Project description

Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

arXiv License PyPi

Quick Start

Installation

Make sure you have Python 3.7+ installed. It is also a good idea to use a virtual environment.

Show instructions for creating a Virtual Environment
python3 -m venv instruct-qa-venv
source instruct-qa-venv/bin/activate

You can install the library via pip:

# Install the latest release
pip3 install instruct-qa

# Install the latest version from GitHub
pip3 install git+https://github.com/McGill-NLP/instruct-qa

For development, you can install it in editable mode with:

git clone https://github.com/McGill-NLP/instruct-qa
cd instruct-qa/
pip3 install -e .

Usage

Here is a simple example to get started. Using this library, use can easily leverage retrieval-augmented instruction-following models for question-answering in ~25 lines of code. The source file for this example is examples/get_started.py.

from instruct_qa.collections.utils import load_collection
from instruct_qa.retrieval.utils import load_retriever, load_index
from instruct_qa.prompt.utils import load_template
from instruct_qa.generation.utils import load_model
from instruct_qa.response_runner import ResponseRunner

collection = load_collection("dpr_wiki_collection")
index = load_index("dpr-nq-multi-hnsw")
retriever = load_retriever("facebook-dpr-question_encoder-multiset-base", index)
model = load_model("flan-t5-xxl")
prompt_template = load_template("qa")

queries = ["what is haleys comet"]

runner = ResponseRunner(
    model=model,
    retriever=retriever,
    document_collection=collection,
    prompt_template=prompt_template,
    queries=queries,
)

responses = runner()
print(responses[0]["response"])
# Halley's Comet Halley's Comet or Comet Halley, officially designated 1P/Halley, is a short-period comet visible from Earth every 75–76 years. Halley is the only known short-period comet that is regularly visible to the naked eye from Earth, and the only naked-eye comet that might appear twice in a human lifetime. Halley last appeared...

You can also check the input prompt given to the instruction-sollowing model that contains the instruction and the retrieved passages.

print(responses[0]["prompt"])
"""
Please answer the following question given the following passages:
- Title: Bill Haley
then known as Bill Haley's Saddlemen...

- Title: C/2016 R2 (PANSTARRS)
(CO) with a blue coma. The blue color...

...

Question: what is haleys comet
Answer:
"""

Detailed documentation of different modules of the library can be found here

Generating responses for entire datasets

Our library supports both question answering (QA) and conversational question answering (ConvQA) datasets. The following datasets are currently incorporated in the library

Here is an example to generate responses for Natural Questions using DPR retriever and Flan-T5 generator.

python experiments/question_answering.py \
--prompt_type qa \
--dataset_name natural_questions \
--document_collection_name dpr_wiki_collection \
--index_name dpr-nq-multi-hnsw \
--retriever_name facebook-dpr-question_encoder-multiset-base \
--batch_size 1 \
--model_name flan-t5-xxl \
--k 8 \
--max_new_tokens 500 \
--post_process_response

By default, a results directory is created within the repository that stores the model responses. The default directory location can be overidden by providing an additional command line argument --persistent_dir <OUTPUT_DIR> More examples are present in the examples directory.

Download model responses and human evaluation data

We release the model responses generated using the above commands for all three datasets. The scores reported in the paper are based on these responses. The responses can be downloaded with the following command:

python download_data.py --resource results

The responses are automatically unzipped and stored as JSON lines in the following directory structure:

results
├── {dataset_name}
│   ├── response
│   │   ├── {dataset}_{split}_c-{collection}_m-{model}_r-{retriever}_prompt-{prompt}_p-{top_p}_t-{temperature}_s-{seed}.jsonl

Currently, the following models are included:

  • fid (Fusion-in-Decoder, separately fine-tuned on each dataset)
  • gpt-3.5-turbo (GPT-3.5)
  • alpaca-7b (Alpaca)
  • llama-2-7b-chat (Llama-2)
  • flan-t5-xxl (Flan-T5)

We also release the human annotations for correctness and faithfulness on a subset of responses for all datasets. The annotations can be downloaded with the following command:

python download_data.py --resource human_eval_annotations

The responses will be automatically unzipped in the following directory structure:

human_eval_annotations
├── correctness
│   ├── {dataset_name}
│   │   ├── {model}_human_eval_results.json
|
├── faithfulness
│   ├── {dataset_name}
│   │   ├── {model}_human_eval_results.json

Evaluating model responses (Coming soon!)

Documentation to evaluate model responses and add your own evaluation criterion coming soon! Stay tuned!

LLM-based evaluation

The following prompt templates and instructions were used for LLM-based evaluation.

Correctness

System prompt: You are CompareGPT, a machine to verify the correctness of predictions. Answer with only yes/no.

You are given a question, the corresponding ground-truth answer and a prediction from a model. Compare the "Ground-truth answer" and the "Prediction" to determine whether the prediction correctly answers the question. All information in the ground-truth answer must be present in the prediction, including numbers and dates. You must answer "no" if there are any specific details in the ground-truth answer that are not mentioned in the prediction. There should be no contradicting statements in the prediction. The prediction may contain extra information. If the prediction states something as a possibility, treat it as a definitive answer.

Question: {Question}
Ground-truth answer: {Reference answer}
Prediction:  {{Model response}

CompareGPT response:

Faithfulness

System prompt: You are CompareGPT, a machine to verify the groundedness of predictions. Answer with only yes/no.

You are given a question, the corresponding evidence and a prediction from a model. Compare the "Prediction" and the "Evidence" to determine whether all the information of the prediction in present in the evidence or can be inferred from the evidence. You must answer "no" if there are any specific details in the prediction that are not mentioned in the evidence or cannot be inferred from the evidence.

Question: {Question}
Prediction:  {Model response}
Evidence: {Reference passage}

CompareGPT response:

License

This work is licensed under the Apache 2 license. See LICENSE for details.

Citation

To cite this work, please use the following citation:

@article{adlakha2023evaluating,
      title={Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering}, 
      author={Vaibhav Adlakha and Parishad BehnamGhader and Xing Han Lu and Nicholas Meade and Siva Reddy},
      year={2023},
      journal={arXiv:2307.16877},
}

Contact

For queries and clarifications please contact vaibhav.adlakha (at) mila (dot) quebec

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instruct-qa-0.0.3rc3.tar.gz (42.4 kB view details)

Uploaded Source

Built Distribution

instruct_qa-0.0.3rc3-py3-none-any.whl (53.1 kB view details)

Uploaded Python 3

File details

Details for the file instruct-qa-0.0.3rc3.tar.gz.

File metadata

  • Download URL: instruct-qa-0.0.3rc3.tar.gz
  • Upload date:
  • Size: 42.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for instruct-qa-0.0.3rc3.tar.gz
Algorithm Hash digest
SHA256 432e07820328d05be09f47983691f0df824ccb08192229e9be053f892f24a210
MD5 a9082180b13c8eeac6b7912c380ea874
BLAKE2b-256 d29d248e48900e06ebcc4b1bf798f8174d4e6ce8383a002016ca01dfc0f8c8b8

See more details on using hashes here.

File details

Details for the file instruct_qa-0.0.3rc3-py3-none-any.whl.

File metadata

File hashes

Hashes for instruct_qa-0.0.3rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 7ea0226a8e60e3f92565749ee7e8036418bfd4aa0012cdaf7d57e8b7c8efb942
MD5 ca576481bdf67b92dd71f2f71a759b20
BLAKE2b-256 f90656f4c94ade6679622710971952b31561ef0c1326bc7d31810c60b333e4d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page