Skip to main content

A factuality evaluation metric for evaluating plain language summaries using question answering

Project description

PyPI arXiv Dataset

PlainQAFact

PlainQAFact is a retrieval-augmented and question-answering (QA)-based factuality evaluation framework for assessing the factuality of biomedical plain language summarization tasks. PlainFact is a high-quality human-annotated dataset with fine-grained explanation (i.e., added information) annotations.

News

  • (2025.03.11) PlainFact is now available on 🤗 Hugging Face: PlainFact for sentence-level data and PlainFact-summary for summary-level data.
  • (2025.03.02) Pre-embedded vector bases of Textbooks and StatPearls can be downloaded here.
  • (2025.03.01) 🚨🚨🚨 PlainQAFact is now on PyPI! Simply use pip install plainqafact to load our pipeline!
  • (2025.02.24) Our PlainFact dataset can be downloaded here: PlainFact, including sentence-level and summary-level granularities.
    • Target_Sentence: The plain language sentence/summary.
    • Original_Abstract: The scientific abstract corresponding to each sentence/summary.
    • External: Whether the sentence includes information does not explicitly present in the scientific abstract. (yes: explanation, no: simplification)
    • We will release the full version of PlainFact soon (including Category and Relation information). Stay tuned!
  • (2025.02.24) Our fine-tuned Question Generation model is available on 🤗 Hugging Face: QG model (or download it here)

NOTE: This repo is heavily relied on QAFactEval, QAEval, and MedRAG.

Overall Framework

Model Downloading

In PlainQAFact, we use pre-trained classifier to distinguish simplification and explanation sentences, Llama 3.1 8B Instruct for answer extraction, fine-tuned QG model, and the original question answering model from QAFactEval.

Download the pre-trained QA model and our pre-trained classifier through bash download_question_answering.sh.

Quickstart

conda create -n plainqafact python=3.9
pip install plainqafact

After installation, make sure you initialized git-lfs as required by MedRAG. Then, you can directly use PlainQAFact through:

from plainqafact import PlainQAFact

metric = PlainQAFact(
    cuda_device=0,
    classifier_type='learned',
    classifier_path='models/learned_classifier',
    llm_model_path='meta-llama/Llama-3.1-8B-Instruct',
    question_generation_model_path='uzw/bart-large-question-generation',
    qa_answering_model_dir='models/answering',
    knowledge_base='combined', # retrieve from both Textbooks and StatPearls KBs
    scoring_batch_size=1,
    answer_selection_strategy='llm-keywords'
)

# choice 1: interactively evaluate summaries
# summaries:
target_sentences = [
    "The study shows aspirin reduces heart attack risk.",
    "Patients with high blood pressure should exercise regularly."
]
# scientific abstracts:
abstracts = [
    "A comprehensive clinical trial demonstrated that daily aspirin administration significantly decreased the incidence of myocardial infarction in high-risk patients.",
    "Research indicates that regular physical activity is an effective intervention for managing hypertension in adult patients."
]

results = metric.evaluate(target_sentences, abstracts)

print(f"Explanation score (mean: {results['external_mean']:.4f}):", results['external_scores'])
print(f"Simplification score (mean: {results['internal_mean']:.4f}):", results['internal_scores'])
print(f"PlainQAFact score: {results['overall_mean']:.4f}")

Or you can evaluate a data file through:

from plainqafact import PlainQAFact

metric = PlainQAFact(
    cuda_device=0,
    classifier_type='learned',
    classifier_path='models/learned_classifier',
    llm_model_path='meta-llama/Llama-3.1-8B-Instruct',
    question_generation_model_path='uzw/bart-large-question-generation',
    qa_answering_model_dir='models/answering',
    knowledge_base='combined',
    scoring_batch_size=1,
    answer_selection_strategy='llm-keywords',
    target_sentence_col='Target_Sentence', # name of your summary's key (column)
    abstract_col='Original_Abstract', # name of your abstract's key (column)
    input_file_format='csv' # your input file format
)

# choice 2: directly evaluate a data file:
results = metric.evaluate_all(input_file='your_data.csv')

print(f"Explanation score (mean: {results['external_mean']:.4f}):", results['external_scores'])
print(f"Simplification score (mean: {results['internal_mean']:.4f}):", results['internal_scores'])
print(f"PlainQAFact score: {results['overall_mean']:.4f}")

Option 2: Install from source

Installation

  • First, create a new conda env: conda create -n plainqafact python=3.9 and clone our repo.
  • cd PlainQAFact
  • Follow the instructions in MedRAG to install PyTorch and other required packages.
  • Then, run the following command:
    conda install git
    pip install -r requirements.txt
    
  • Finally, install the old tokenizer package through:
    pip install transformers_old_tokenizer-3.1.0-py3-none-any.whl
    

Running through our PlainFact dataset

Before running the following command, please download the question answering and learned classifier models through above instructions.

python3 run.py \
    --cuda_device 0 \
    --classifier_type learned \  # Options: 'learned', 'llama', 'gpt'
    --input_file data/summary_level.csv \ # path of the input dataset 
    --classifier_path path/to/learned_classifier \ # path of the classifier
    --llm_model_path meta-llama/Llama-3.1-8B-Instruct \ # path of the answer extractor
    --question_generation_model_path uzw/bart-large-question-generation \ # path of the question generation model
    --qa_answering_model_dir models/answering \ # path of the question answering model
    --knowledge_base combined \ # knowledge bases for retrieval, options: textbooks, statpearls, pubmed, wikipedia, combined
    --answer_selection_strategy llm-keywords  # Options: 'llm-keywords', 'gpt-keywords', 'none'

Running through your own data

Please modify the default_config.py file Line 17-19 to indicate the heading/key names of your dataset. We currently support .json, .txt, and .csv file.

python3 run.py \
    --cuda_device 0 \
    --classifier_type learned \
    --input_file your_own_data.json \
    --input_file_format json \
    --classifier_path path/to/learned_classifier \
    --llm_model_path meta-llama/Llama-3.1-8B-Instruct \
    --question_generation_model_path uzw/bart-large-question-generation \
    --qa_answering_model_dir models/answering \
    --knowledge_base textbooks \
    --answer_selection_strategy llm-keywords

Easily replace the pre-trained classifier to OpenAI models or your own

We provides options to easily replace our pre-trained classisifer tailored for the biomedical plain language summarization tasks to other tasks. You may simply set --classifier_type as gpt and provide your OpenAI API key in the default_config.py file Line 26 to run PlainQAFact.

python3 run.py \
    --cuda_device 0 \
    --classifier_type gpt \
    --input_file your_own_data.json \
    --input_file_format json \
    --llm_model_path meta-llama/Llama-3.1-8B-Instruct \
    --question_generation_model_path uzw/bart-large-question-generation \
    --qa_answering_model_dir models/answering \
    --knowledge_base textbooks \
    --answer_selection_strategy llm-keywords

Using other Knowledge Bases for retrieval

Currently, we only experiment with two KBs: Textbooks and StatPearls. You may want to use your customized KBs for more accurate retrieval. In PlainQAFact, we combine both Textbooks and StatPearls and concatenate with the scientific abstracts. Set --knowledge_base textbooks as combined to reproduce our results.

NOTE: Using Llama 3.1 8B model for both classification and answer extraction would take over 40 GB GPU memory. We recommend to use our pre-trained classifier or OpenAI models for classification if the GPU memory is limited.

Citation Information

For the use of PlainQAFact and PlainFact benchmark, please cite:

@misc{you2025plainqafactretrievalaugmentedfactualconsistency,
      title={PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization}, 
      author={Zhiwen You and Yue Guo},
      year={2025},
      eprint={2503.08890},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.08890}, 
}

Contact Information

If you have any questions, please email zhiweny2@illinois.edu.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plainqafact-1.1.0.tar.gz (254.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plainqafact-1.1.0-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file plainqafact-1.1.0.tar.gz.

File metadata

  • Download URL: plainqafact-1.1.0.tar.gz
  • Upload date:
  • Size: 254.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.20

File hashes

Hashes for plainqafact-1.1.0.tar.gz
Algorithm Hash digest
SHA256 2b1aaa750cf6d303882a4af7633774d04718961cc34b0a37750e844da85c239e
MD5 3ed227248a709f87015ab85bbcd193b4
BLAKE2b-256 55a27c6bcc2940a4c38b5ee451819894a7f13e285a288db64e350f0f20d8ecd4

See more details on using hashes here.

File details

Details for the file plainqafact-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: plainqafact-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.20

File hashes

Hashes for plainqafact-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e9181b9ac853050ac450ae9b900e7de1d031d266374da4697908be4fae80877
MD5 621c519a4a6a92208ebeffd1b21770e7
BLAKE2b-256 8c437558b3e9334fa098a857d3fa85b891d74b740b1daaf22c71154e20b6f062

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page