Skip to main content

3k questions generated by 15 QG models (including LLMs) based on 200 passages

Project description

QGEval

Resources for paper - QGEval: A Benchmark for Question Generation Evaluation

Data

We share the generated questions from 15 QG models with averaged annotation scores of three annotators in data/scores.xlsx, and the instances integrated by passages are in data/instances.json. We also share the annotation result of each annotator in data/annotation result.

Example of instances.

{
  "id": "572882242ca10214002da423",
  "passage": "... The publication of a Taoist text inscribed with the name of T枚regene Khatun, 脰gedei's wife, ...",
  "reference": "Who was 脰gedei's wife?"
  "answer": "T枚regene Khatun",
  "questions": [
      {
        "prediction": "Who was the author of the Taoist text inscribed with the name of?",
        "source": "SQuAD_BART-base_finetune",
        "fluency": 3.0,
        "clarity": 2.6667,
        "conciseness": 3.0,
        "relevance": 3.0,
        "consistency": 2.0,
        "answerability": 1.0,
        "answer_consistency": 1.0
      },
      // ... 14 more questions
  ]
}

The average annotation scores of each QG model over 7 dimensions are shown in the below table.

Models Flu. Clar. Conc. Rel. Cons. Ans. AnsC. Avg.
M1 - Reference 2.968 2.930 2.998 2.993 2.923 2.832 2.768 2.916
M2 - BART-base-finetune 2.958 2.882 2.898 2.995 2.920 2.732 2.588 2.853
M3 - BART-large-finetune 2.932 2.915 2.828 2.995 2.935 2.825 2.737 2.881
M4 - T5-base-finetune 2.972 2.923 2.922 3.000 2.917 2.788 2.652 2.882
M5 - T5-large-finetune 2.978 2.930 2.907 2.995 2.933 2.795 2.720 2.894
M6 - Flan-T5-base-finetune 2.963 2.888 2.938 2.998 2.925 2.775 2.665 2.879
M7 - Flan-T5-large-finetune 2.982 2.902 2.895 2.995 2.950 2.818 2.727 2.895
M8 - Flan-T5-XL-LoRA 2.913 2.843 2.880 2.997 2.928 2.772 2.667 2.857
M9 - Flan-T5-XXL-LoRA 2.938 2.848 2.907 3.000 2.943 2.757 2.678 2.867
M10 - Flan-T5-XL-fewshot 2.975 2.820 2.985 2.955 2.908 2.652 2.193 2.784
M11 - Flan-T5-XXL-fewshot 2.987 2.882 2.990 2.988 2.920 2.687 2.432 2.841
M12 - GPT-3.5-Turbo-fewshot 2.972 2.927 2.858 2.995 2.955 2.850 2.335 2.842
M13 - GPT-4-Turbo-fewshot 2.988 2.987 2.897 2.992 2.947 2.922 2.772 2.929
M14 - GPT-3.5-Turbo-zeroshot 2.995 2.977 2.913 2.992 2.917 2.823 2.157 2.825
M15 - GPT-4-Turbo-zeroshot 2.983 2.990 2.943 2.970 2.932 2.883 2.723 2.918
Avg. 2.967 2.910 2.917 2.991 2.930 2.794 2.588

Metrics

We implemented 15 metrics for re-evaluation, they are:

Metrics Paper Code Link
BLEU-4 BLEU: a Method for Automatic Evaluation of Machine Translation link
ROUGE-L ROUGE: A Package for Automatic Evaluation of Summaries link
METEOR METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments link
BERTScore BERTScore: Evaluating Text Generation with BERT link
MoverScore MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance link
BLEURT BLEURT: Learning Robust Metrics for Text Generation link
BARTScore-ref BARTScore: Evaluating Generated Text as Text Generation link
GPTScore-ref GPTScore: Evaluate as You Desire link
Q-BLEU4 Towards a Better Metric for Evaluating Question Generation Systems link
QSTS QSTS: A Question-Sensitive Text Similarity Measure for Question Generation link
BARTScore-src BARTScore: Evaluating Generated Text as Text Generation link
GPTScore-src GPTScore: Evaluate as You Desire link
QRelScore QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance link
UniEval Towards a Unified Multi-Dimensional Evaluator for Text Generation link
RQUGE RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question link

We share the results of each metric on each generated question in data/metric_result.xlsx. Results of LLM-based metrics on answerability are in data/test_answerability.xlsx.

Models

You can find our trained QG model at huggingface.

How to use

Our codes provide the ability to evaluate automatic metrics, you can also use our codes to train Question Generation model and calculate automatic metrics.

Evaluation of Automatic Metrics

The codes for Automatic Metrics are in metrics.

Take the evaluation of QRelScore as an example, you can use the QGEval benchmark to evaluate QRelScore step by step:

  1. Prepare data for evaluation: You can get the QGEval dataset at data/scores.xlsx.

    Column Explanation
    "passage" - the passage of the question based on.
    "reference" - the reference question.
    "answer" - the provided answer.
    "prediction" - the generated question.
    "source" - the base dataset and model used to   generate the 'prediction' question.
    
  2. Run automatic metrics

    • cd ./metric
    • run pip install -r requirements.txt to install the required packages
    • run the specific code file to get results from automatic metrics. To get QRelScore results, run python metrics.py:
     import pandas as pd
     # load data
     data_path = 'your data path'
     save_path = 'result save path'
     data = pd.read_excel(data_path)
     # prepare parameters
     hypos = data['prediction'].tolist()
     refs_list = [data['reference'].tolist()]
     contexts = data['passage'].tolist()
     answers = data['answer'].tolist()
     # metric to use
     score_names = ['QRelScore']
     # run metric
     res = get_metrics(hypos, refs_list, contexts, answers, score_names=score_names)
     # handle results
     for k, v in res.items():
         data[k] = v
     # save results
     data.to_excel(save_path, index=False)
     print('Metrics saved to {}'.format(save_path))
    
  3. Calculate Correlations

    run python coeff.py to obtain the Pearson correlation coefficient between the generated results and the labeled results.

    import pandas as pd
    result_data_path = 'your result path'
    df = pd.read_excel(result_data_path)
    metrics = ['QRelScore']
    
    # dimensions to calculate correlation with
    dimensions = ['fluency','clarity','conciseness','relevance','consistency','answerability','answer_consistency']
    
    # calculate pearson
    coeff = Coeff()
    
    for metric in metrics:
     print(f"Pearson of {metric}")
     for dimension in dimensions:
       labels = df[dimension].to_list()
       preds = df[metric].to_list()
       per, spea, ken = coeff.apply(labels, preds)
       print(f"{dimension}: Pearson={per}, Spearman={spea}, Kendall={ken}")
       print()
    

More details about the codes for automatic metrics are in metrics/readme.

Question Generation

The codes and the data for Question Generation are in qg, train your own QG models by these steps:

  1. cd ./qg
  2. run pip install -r requirements.txt to install the required packages
  3. run python process.py to process data
  4. run the code file for specific models to train. For example, run python T5.py to train your T5-based QG model

Find more details in qg/readme.

Automatic Metrics Calculation

The codes for Automatic Metrics Calculation(e.g. BLEU-4) are in metrics, calculate automatic metrics by these steps:

  1. prepare data, you can get the Question Generation dataset at qg/data or you can prepare data yourself
  2. cd ./metric
  3. run pip install -r requirements.txt to install the required packages
  4. run python metrics.py to get your chosen metrics evaluation results

Find more details in metrics/readme.

Citation

Please cite:

@misc{fu2024qgeval,
      title={QGEval: A Benchmark for Question Generation Evaluation}, 
      author={Weiping Fu and Bifan Wei and Jianxiang Hu and Zhongmin Cai and Jun Liu},
      year={2024},
      eprint={2406.05707},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

QGEval_metrics-1.0.0-py3-none-any.whl (39.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page