3k questions generated by 15 QG models (including LLMs) based on 200 passages
Project description
QGEval
Resources for paper - QGEval: A Benchmark for Question Generation Evaluation
Data
We share the generated questions from 15 QG models with averaged annotation scores of three annotators in data/scores.xlsx, and the instances integrated by passages are in data/instances.json. We also share the annotation result of each annotator in data/annotation result.
Example of instances.
{
"id": "572882242ca10214002da423",
"passage": "... The publication of a Taoist text inscribed with the name of T枚regene Khatun, 脰gedei's wife, ...",
"reference": "Who was 脰gedei's wife?"
"answer": "T枚regene Khatun",
"questions": [
{
"prediction": "Who was the author of the Taoist text inscribed with the name of?",
"source": "SQuAD_BART-base_finetune",
"fluency": 3.0,
"clarity": 2.6667,
"conciseness": 3.0,
"relevance": 3.0,
"consistency": 2.0,
"answerability": 1.0,
"answer_consistency": 1.0
},
// ... 14 more questions
]
}
The average annotation scores of each QG model over 7 dimensions are shown in the below table.
Models | Flu. | Clar. | Conc. | Rel. | Cons. | Ans. | AnsC. | Avg. |
---|---|---|---|---|---|---|---|---|
M1 - Reference | 2.968 | 2.930 | 2.998 | 2.993 | 2.923 | 2.832 | 2.768 | 2.916 |
M2 - BART-base-finetune | 2.958 | 2.882 | 2.898 | 2.995 | 2.920 | 2.732 | 2.588 | 2.853 |
M3 - BART-large-finetune | 2.932 | 2.915 | 2.828 | 2.995 | 2.935 | 2.825 | 2.737 | 2.881 |
M4 - T5-base-finetune | 2.972 | 2.923 | 2.922 | 3.000 | 2.917 | 2.788 | 2.652 | 2.882 |
M5 - T5-large-finetune | 2.978 | 2.930 | 2.907 | 2.995 | 2.933 | 2.795 | 2.720 | 2.894 |
M6 - Flan-T5-base-finetune | 2.963 | 2.888 | 2.938 | 2.998 | 2.925 | 2.775 | 2.665 | 2.879 |
M7 - Flan-T5-large-finetune | 2.982 | 2.902 | 2.895 | 2.995 | 2.950 | 2.818 | 2.727 | 2.895 |
M8 - Flan-T5-XL-LoRA | 2.913 | 2.843 | 2.880 | 2.997 | 2.928 | 2.772 | 2.667 | 2.857 |
M9 - Flan-T5-XXL-LoRA | 2.938 | 2.848 | 2.907 | 3.000 | 2.943 | 2.757 | 2.678 | 2.867 |
M10 - Flan-T5-XL-fewshot | 2.975 | 2.820 | 2.985 | 2.955 | 2.908 | 2.652 | 2.193 | 2.784 |
M11 - Flan-T5-XXL-fewshot | 2.987 | 2.882 | 2.990 | 2.988 | 2.920 | 2.687 | 2.432 | 2.841 |
M12 - GPT-3.5-Turbo-fewshot | 2.972 | 2.927 | 2.858 | 2.995 | 2.955 | 2.850 | 2.335 | 2.842 |
M13 - GPT-4-Turbo-fewshot | 2.988 | 2.987 | 2.897 | 2.992 | 2.947 | 2.922 | 2.772 | 2.929 |
M14 - GPT-3.5-Turbo-zeroshot | 2.995 | 2.977 | 2.913 | 2.992 | 2.917 | 2.823 | 2.157 | 2.825 |
M15 - GPT-4-Turbo-zeroshot | 2.983 | 2.990 | 2.943 | 2.970 | 2.932 | 2.883 | 2.723 | 2.918 |
Avg. | 2.967 | 2.910 | 2.917 | 2.991 | 2.930 | 2.794 | 2.588 |
Metrics
We implemented 15 metrics for re-evaluation, they are:
We share the results of each metric on each generated question in data/metric_result.xlsx. Results of LLM-based metrics on answerability are in data/test_answerability.xlsx.
Models
You can find our trained QG model at huggingface.
How to use
Our codes provide the ability to evaluate automatic metrics
, you can also use our codes to train Question Generation model
and calculate automatic metrics
.
Evaluation of Automatic Metrics
The codes for Automatic Metrics are in metrics.
Take the evaluation of QRelScore as an example, you can use the QGEval benchmark to evaluate QRelScore step by step:
-
Prepare data for evaluation: You can get the QGEval dataset at data/scores.xlsx.
Column Explanation "passage" - the passage of the question based on. "reference" - the reference question. "answer" - the provided answer. "prediction" - the generated question. "source" - the base dataset and model used to generate the 'prediction' question.
-
Run automatic metrics
- cd
./metric
- run
pip install -r requirements.txt
to install the required packages - run the specific code file to get results from automatic metrics. To get QRelScore results, run
python metrics.py
:
import pandas as pd # load data data_path = 'your data path' save_path = 'result save path' data = pd.read_excel(data_path) # prepare parameters hypos = data['prediction'].tolist() refs_list = [data['reference'].tolist()] contexts = data['passage'].tolist() answers = data['answer'].tolist() # metric to use score_names = ['QRelScore'] # run metric res = get_metrics(hypos, refs_list, contexts, answers, score_names=score_names) # handle results for k, v in res.items(): data[k] = v # save results data.to_excel(save_path, index=False) print('Metrics saved to {}'.format(save_path))
- cd
-
Calculate Correlations
run
python coeff.py
to obtain the Pearson correlation coefficient between the generated results and the labeled results.import pandas as pd result_data_path = 'your result path' df = pd.read_excel(result_data_path) metrics = ['QRelScore'] # dimensions to calculate correlation with dimensions = ['fluency','clarity','conciseness','relevance','consistency','answerability','answer_consistency'] # calculate pearson coeff = Coeff() for metric in metrics: print(f"Pearson of {metric}") for dimension in dimensions: labels = df[dimension].to_list() preds = df[metric].to_list() per, spea, ken = coeff.apply(labels, preds) print(f"{dimension}: Pearson={per}, Spearman={spea}, Kendall={ken}") print()
More details about the codes for automatic metrics are in metrics/readme.
Question Generation
The codes and the data for Question Generation are in qg, train your own QG models by these steps:
- cd
./qg
- run
pip install -r requirements.txt
to install the required packages - run
python process.py
to process data - run the code file for specific models to train. For example, run
python T5.py
to train your T5-based QG model
Find more details in qg/readme.
Automatic Metrics Calculation
The codes for Automatic Metrics Calculation(e.g. BLEU-4) are in metrics, calculate automatic metrics by these steps:
- prepare data, you can get the Question Generation dataset at qg/data or you can prepare data yourself
- cd
./metric
- run
pip install -r requirements.txt
to install the required packages - run
python metrics.py
to get your chosen metrics evaluation results
Find more details in metrics/readme.
Citation
Please cite:
@misc{fu2024qgeval,
title={QGEval: A Benchmark for Question Generation Evaluation},
author={Weiping Fu and Bifan Wei and Jianxiang Hu and Zhongmin Cai and Jun Liu},
year={2024},
eprint={2406.05707},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file QGEval_qg-1.0.0.tar.gz
.
File metadata
- Download URL: QGEval_qg-1.0.0.tar.gz
- Upload date:
- Size: 22.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 907c96426e7d667d2f0f1b7826bc807237b858ef0168c1c28984546656b6448e |
|
MD5 | 2050bcff3967c16301effadb890bc936 |
|
BLAKE2b-256 | 8165f7ccea793b52fb427cb7fb01f4be336cdaff89328610049281ff6a5cebe3 |
File details
Details for the file QGEval_qg-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: QGEval_qg-1.0.0-py3-none-any.whl
- Upload date:
- Size: 30.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da40daf59901889a1efc0cee7bdd35a11b0acbde15eaf67b782c4875c191c55c |
|
MD5 | f013b6fa3efa10774e33960d60cf51f2 |
|
BLAKE2b-256 | d0a2e90ffbb8a76d225e5e0b7f914abdd50b6105b9b67e0a34a6315e6479475f |