Skip to main content

This package provides standard and classifier-based short form QA evaluation methods

Project description

QA-Evaluation-Metrics 📊

PyPI version qa-metrics Colab

A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.

pip install qa-metrics is all you need!

🎉 Latest Updates

  • Version 0.2.19 Released!
    • Paper accepted to EMNLP 2024 Findings! 🎓
    • Enhanced PEDANTS with multi-pipeline support and improved edge case handling
    • Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
    • Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via deepinfra
    • Introduced trained tiny-bert for QA evaluation (18MB model size)
    • Added direct Huggingface model download support for TransformerMatcher

🚀 Quick Start

Table of Contents

Prerequisites

  • Python >= 3.6
  • openai >= 1.0

Installation

pip install qa-metrics

💡 Features

Our package offers six QA evaluation methods with varying strengths:

Method Best For Cost Correlation with Human Judgment
Normalized Exact Match Short-form QA (NQ-OPEN, HotpotQA, etc.) Free Good
PEDANTS Both short & medium-form QA Free Very High
Neural Evaluation Both short & long-form QA Free High
Open Source LLM Evaluation All QA types Free High
Black-box LLM Evaluation All QA types Paid Highest

📖 Documentation

1. Normalized Exact Match

Method: em_match

Parameters

  • reference_answer (list of str): A list of gold (correct) answers to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • boolean: True if there are any exact normalized matches between gold and candidate answers
from qa_metrics.em import em_match

reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)

2. F1 Score

Method: f1_score_with_precision_recall

Parameters

  • reference_answer (str): A gold (correct) answer to the question
  • candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

  • dictionary: Contains the F1 score, precision, and recall between a gold and candidate answer

Method: f1_match

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • threshold (float): F1 score threshold for considering a match (default: 0.5)

Returns

  • boolean: True if F1 score exceeds threshold for any gold answer
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall

f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)

3. PEDANTS

Method: get_score

Parameters

  • reference_answer (str): A Gold answer
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • float: The similarity score between two strings (0 to 1)

Method: get_highest_score

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: get_scores

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: evaluate

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • boolean: True if candidate answer matches any gold answer

Method: get_question_type

Parameters

  • reference_answer (list of str): List of gold answers
  • question (str): The question being evaluated

Returns

  • list: The type of the question (what, who, when, how, why, which, where)

Method: get_judgement_type

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • list: A list revised rules applicable to judge answer correctness
from qa_metrics.pedant import PEDANT

pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)

4. Transformer Neural Evaluation

Method: get_score

Parameters

  • reference_answer (str): A Gold answer
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • float: The similarity score between two strings (0 to 1)

Method: get_highest_score

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: get_scores

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: transformer_match

Parameters

  • reference_answer (list of str): List of gold answers
  • candidate_answer (str): Candidate answer to evaluate
  • question (str): The question being evaluated

Returns

  • boolean: True if transformer model considers candidate answer equivalent to any gold answer
from qa_metrics.transformerMatcher import TransformerMatcher

### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)

5. LLM Integration

Method: prompt_gpt

Parameters

  • prompt (str): The input prompt text
  • model_engine (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
  • temperature (float): Controls randomness (0-1)
  • max_tokens (int): Maximum tokens in response
from qa_metrics.prompt_llm import CloseLLM

model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')

Method: prompt_claude

Parameters

  • prompt (str): The input prompt text
  • model_engine (str): Claude model to use
  • anthropic_version (str): API version
  • max_tokens_to_sample (int): Maximum tokens in response
  • temperature (float): Controls randomness (0-1)
model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')

Method: prompt

Parameters

  • message (str): The input message text
  • model_engine (str): Model to use
  • temperature (float): Controls randomness (0-1)
  • max_tokens (int): Maximum tokens in response
from qa_metrics.prompt_open_llm import OpenLLM

model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')

🤗 Model Hub

Our fine-tuned models are available on Huggingface:

📚 Resources

📄 Citation

@misc{li2024pedantspreciseevaluationsdiverse,
      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, 
      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
      year={2024},
      eprint={2402.11161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.11161}, 
}

📝 License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qa_metrics-0.2.42.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qa_metrics-0.2.42-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file qa_metrics-0.2.42.tar.gz.

File metadata

  • Download URL: qa_metrics-0.2.42.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for qa_metrics-0.2.42.tar.gz
Algorithm Hash digest
SHA256 8065dc54d1691d60ae773d1ec650e4546ef0851c345c709e49c73d5edfe28f38
MD5 b80f54bc5e2b0feff16dbaf609de2335
BLAKE2b-256 67ec7e811b012b8999f311150e5df293a995d73241a4ca23d967e462183067cf

See more details on using hashes here.

File details

Details for the file qa_metrics-0.2.42-py3-none-any.whl.

File metadata

  • Download URL: qa_metrics-0.2.42-py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for qa_metrics-0.2.42-py3-none-any.whl
Algorithm Hash digest
SHA256 0ae79ac4e209de307addc81a35e870acd77419b12aaaec91a479bfa4a6221128
MD5 b23d802d864318964e2a9d031045294e
BLAKE2b-256 af4101041f508b31d4098a70e0dbf214eaa69934ab14c77dc61885d4f4332099

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page