This package provides standard and classifier-based short form QA evaluation methods

These details have not been verified by PyPI

Project links

Homepage

Project description

QA-Evaluation-Metrics 📊

A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.

pip install qa-metrics is all you need!

🎉 Latest Updates

Version 0.2.19 Released!
- Paper accepted to EMNLP 2024 Findings! 🎓
- Enhanced PEDANTS with multi-pipeline support and improved edge case handling
- Added support for OpenAI GPT-series and Claude Series models (OpenAI version > 1.0)
- Integrated support for open-source models (LLaMA-2-70B-chat, LLaVA-1.5, etc.) via deepinfra
- Introduced trained tiny-bert for QA evaluation (18MB model size)
- Added direct Huggingface model download support for TransformerMatcher

Prerequisites

Python >= 3.6
openai >= 1.0

Installation

pip install qa-metrics

💡 Features

Our package offers six QA evaluation methods with varying strengths:

Method	Best For	Cost	Correlation with Human Judgment
Normalized Exact Match	Short-form QA (NQ-OPEN, HotpotQA, etc.)	Free	Good
PEDANTS	Both short & medium-form QA	Free	Very High
Neural Evaluation	Both short & long-form QA	Free	High
Open Source LLM Evaluation	All QA types	Free	High
Black-box LLM Evaluation	All QA types	Paid	Highest

📖 Documentation

1. Normalized Exact Match

Method: `em_match`

Parameters

reference_answer (list of str): A list of gold (correct) answers to the question
candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

boolean: True if there are any exact normalized matches between gold and candidate answers

from qa_metrics.em import em_match

reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)

2. F1 Score

Method: `f1_score_with_precision_recall`

Parameters

reference_answer (str): A gold (correct) answer to the question
candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

dictionary: Contains the F1 score, precision, and recall between a gold and candidate answer

Method: `f1_match`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
threshold (float): F1 score threshold for considering a match (default: 0.5)

Returns

boolean: True if F1 score exceeds threshold for any gold answer

from qa_metrics.f1 import f1_match, f1_score_with_precision_recall

f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)

3. PEDANTS

Method: `get_score`

Parameters

reference_answer (str): A Gold answer
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

float: The similarity score between two strings (0 to 1)

Method: `get_highest_score`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: `get_scores`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: `evaluate`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

boolean: True if candidate answer matches any gold answer

Method: `get_question_type`

Parameters

reference_answer (list of str): List of gold answers
question (str): The question being evaluated

Returns

list: The type of the question (what, who, when, how, why, which, where)

Method: `get_judgement_type`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

list: A list revised rules applicable to judge answer correctness

from qa_metrics.pedant import PEDANT

pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)

4. Transformer Neural Evaluation

Method: `get_score`

Parameters

reference_answer (str): A Gold answer
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

float: The similarity score between two strings (0 to 1)

Method: `get_highest_score`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: `get_scores`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: `transformer_match`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

boolean: True if transformer model considers candidate answer equivalent to any gold answer

from qa_metrics.transformerMatcher import TransformerMatcher

### supports `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)

5. LLM Integration

Method: `prompt_gpt`

Parameters

prompt (str): The input prompt text
model_engine (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
temperature (float): Controls randomness (0-1)
max_tokens (int): Maximum tokens in response

from qa_metrics.prompt_llm import CloseLLM

model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')

Method: `prompt_claude`

Parameters

prompt (str): The input prompt text
model_engine (str): Claude model to use
anthropic_version (str): API version
max_tokens_to_sample (int): Maximum tokens in response
temperature (float): Controls randomness (0-1)

model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')

Method: `prompt`

Parameters

message (str): The input message text
model_engine (str): Model to use
temperature (float): Controls randomness (0-1)
max_tokens (int): Maximum tokens in response

from qa_metrics.prompt_open_llm import OpenLLM

model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')

🤗 Model Hub

Our fine-tuned models are available on Huggingface:

📚 Resources

📄 Citation

@misc{li2024pedantspreciseevaluationsdiverse,
      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, 
      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
      year={2024},
      eprint={2402.11161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.11161}, 
}

📝 License

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.42

Jun 24, 2025

0.2.41

Jun 23, 2025

0.2.40

Jun 21, 2025

0.2.39

Jun 20, 2025

0.2.38

Jun 20, 2025

0.2.37

Jun 20, 2025

0.2.36

Jun 20, 2025

0.2.35

Jun 19, 2025

0.2.34

Jun 15, 2025

0.2.33

May 4, 2025

0.2.32

Mar 31, 2025

0.2.31

Mar 13, 2025

0.2.30

Feb 4, 2025

0.2.29

Dec 16, 2024

0.2.28

Dec 16, 2024

0.2.27

Dec 16, 2024

0.2.26

Dec 15, 2024

0.2.25

Dec 15, 2024

0.2.24

Dec 5, 2024

0.2.23

Oct 29, 2024

0.2.22

Oct 28, 2024

0.2.21

Oct 28, 2024

0.2.20

Oct 28, 2024

0.2.19

Oct 28, 2024

0.2.18

Oct 28, 2024

0.2.17

Aug 20, 2024

0.2.16

May 10, 2024

0.2.15

May 7, 2024

0.2.14 yanked

May 7, 2024

Reason this release was yanked:

pedant evaluation bug

0.2.13 yanked

May 7, 2024

Reason this release was yanked:

PEDANT get_scores function bug

0.2.12 yanked

May 7, 2024

Reason this release was yanked:

BUG in PEANT evaluation

0.2.11

Apr 23, 2024

0.2.10

Apr 23, 2024

0.2.9

Apr 23, 2024

0.2.8

Apr 2, 2024

0.2.7

Apr 2, 2024

0.2.6

Apr 2, 2024

0.2.5

Apr 2, 2024

0.2.4

Apr 2, 2024

0.2.3

Apr 1, 2024

0.2.2

Apr 1, 2024

0.2.1

Apr 1, 2024

0.1.60

Apr 1, 2024

0.1.59

Apr 1, 2024

0.1.58

Mar 31, 2024

0.1.57

Mar 31, 2024

0.1.56

Mar 31, 2024

0.1.54

Mar 31, 2024

0.1.53

Mar 31, 2024

0.1.52

Mar 31, 2024

0.1.51

Mar 31, 2024

0.1.50

Mar 31, 2024

0.1.49

Mar 30, 2024

0.1.48

Mar 30, 2024

0.1.47

Mar 30, 2024

0.1.46

Mar 30, 2024

0.1.45

Mar 30, 2024

0.1.44

Mar 30, 2024

0.1.43

Mar 30, 2024

0.1.42

Mar 30, 2024

0.1.41

Mar 30, 2024

0.1.40

Mar 30, 2024

0.1.39

Mar 30, 2024

0.1.38

Mar 30, 2024

0.1.37

Mar 30, 2024

0.1.36

Feb 26, 2024

0.1.35

Feb 26, 2024

0.1.34

Feb 26, 2024

0.1.33

Feb 26, 2024

0.1.32

Feb 26, 2024

0.1.31

Feb 26, 2024

0.1.30

Feb 26, 2024

0.1.29

Feb 26, 2024

0.1.28

Feb 26, 2024

0.1.27

Jan 26, 2024

0.1.26

Jan 26, 2024

0.1.25

Jan 26, 2024

0.1.24

Jan 26, 2024

0.1.23

Jan 25, 2024

0.1.22

Jan 24, 2024

0.1.21

Jan 22, 2024

0.1.20

Jan 21, 2024

0.1.19

Jan 21, 2024

0.1.18

Jan 21, 2024

0.1.17

Jan 21, 2024

0.1.16

Jan 21, 2024

0.1.15

Jan 21, 2024

0.1.14

Jan 21, 2024

0.1.13

Jan 21, 2024

0.1.12

Jan 21, 2024

0.1.11

Jan 21, 2024

0.1.10

Jan 21, 2024

0.1.9

Jan 21, 2024

0.1.8

Jan 21, 2024

0.1.6

Jan 21, 2024

0.1.5

Jan 21, 2024

0.1.4

Jan 21, 2024

0.1.3

Jan 21, 2024

0.1.2

Jan 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qa_metrics-0.2.42.tar.gz (22.6 kB view details)

Uploaded Jun 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

qa_metrics-0.2.42-py3-none-any.whl (22.0 kB view details)

Uploaded Jun 24, 2025 Python 3

File details

Details for the file qa_metrics-0.2.42.tar.gz.

File metadata

Download URL: qa_metrics-0.2.42.tar.gz
Upload date: Jun 24, 2025
Size: 22.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for qa_metrics-0.2.42.tar.gz
Algorithm	Hash digest
SHA256	`8065dc54d1691d60ae773d1ec650e4546ef0851c345c709e49c73d5edfe28f38`
MD5	`b80f54bc5e2b0feff16dbaf609de2335`
BLAKE2b-256	`67ec7e811b012b8999f311150e5df293a995d73241a4ca23d967e462183067cf`

See more details on using hashes here.

File details

Details for the file qa_metrics-0.2.42-py3-none-any.whl.

File metadata

Download URL: qa_metrics-0.2.42-py3-none-any.whl
Upload date: Jun 24, 2025
Size: 22.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for qa_metrics-0.2.42-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ae79ac4e209de307addc81a35e870acd77419b12aaaec91a479bfa4a6221128`
MD5	`b23d802d864318964e2a9d031045294e`
BLAKE2b-256	`af4101041f508b31d4098a70e0dbf214eaa69934ab14c77dc61885d4f4332099`

See more details on using hashes here.

qa-metrics 0.2.42

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

QA-Evaluation-Metrics 📊

🎉 Latest Updates

🚀 Quick Start

Table of Contents

Prerequisites

Installation

💡 Features

📖 Documentation

1. Normalized Exact Match

Method: em_match

2. F1 Score

Method: f1_score_with_precision_recall

Method: f1_match

3. PEDANTS

Method: get_score

Method: get_highest_score

Method: get_scores

Method: evaluate

Method: get_question_type

Method: get_judgement_type

4. Transformer Neural Evaluation

Method: get_score

Method: get_highest_score

Method: get_scores

Method: transformer_match

5. LLM Integration

Method: prompt_gpt

Method: prompt_claude

Method: prompt

🤗 Model Hub

📚 Resources

📄 Citation

📝 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Method: `em_match`

Method: `f1_score_with_precision_recall`

Method: `f1_match`

Method: `get_score`

Method: `get_highest_score`

Method: `get_scores`

Method: `evaluate`

Method: `get_question_type`

Method: `get_judgement_type`

Method: `get_score`

Method: `get_highest_score`

Method: `get_scores`

Method: `transformer_match`

Method: `prompt_gpt`

Method: `prompt_claude`

Method: `prompt`