Skip to main content

a python package for evaluating evaluation metrics

Project description

MetricEval

MetricEval: A framework that conceptualizes and operationalizes key desiderata of metric evaluation, in terms of reliability and validity. Please see Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory for more details.

Summary

In this Github repo, you will find the implementation of our framework, metric-eval, a python pkg for evaluation metrics analysis.

@article{xiao2023evaluating,
  title={Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory},
  author={Xiao, Ziang and Zhang, Susu and Lai, Vivian and Liao, Q Vera},
  journal={arXiv preprint arXiv:2305.14889},
  year={2023}
}

metirc-eval

Quick Start

Install from PyPI

pip install metric-eval

Install from Repo

Clone the repository:

git clone git@github.com:isle-dev/MetricEval.git
cd metirc-eval

Install Dependencies:

conda create --name metirc-eval python=3.10
conda activate metirc-eval
pip install -r requirements.txt

Usage

Please refer to metric_eval/example.py for the detailed usage of metirc-eval, see Github repo. To interprete the evalaution results, see paper.

Import Module

import metric_eval

Load Data

Example data could be found in metric_eval/data/* , see Github repo. The data is a csv file with the following format:

- test_id: array containing ids of test examples
- model_id: character array containing ids of models
- additional columns containing scores on each metric (metric name as column name)
data = pd.read_csv("data/metric_scores_long.csv")
data_2nd_run = pd.read_csv("data/metric_scores_long_2nd_run.csv")

Metric Stability

The function compares the average metric scores for models between the two data sets (data1 and data2). It calculates the Pearson correlation coefficient between the average metric scores from the first run (data) and the second run (data_2nd_run) for each metric.

rel_cor = metric_eval.metric_stability(data,data_2nd_run)
print(rel_cor)

Metric Consistency

Metric Consistency describes how the metric score fluctuates within a benchmark dataset, i.e., across data points. Metric consistency estimates (alphas) and the standard error of measurement (sems) of each metric given N randomly samples. Defult N (-1) refers all avaliable samples in the dataset.

alphas, sems = metric_eval.metric_consistency(data, N = -1)
print(alphas)
print(sems)

MTMM Table

The MTMM table presents a way to scrutinize whether observed metric scores act in concert with theory on what they intend to measure, when two or more constructs are measured using two or more methods. By convention, an MTMM table reports the pairwise correlations of the observed metric scores across raters and traits on the off-diagonals and the reliability coefficients of each score on the diagonals.

metric_names = data.columns[2:14].tolist()
trait = ['COH', 'CON', 'FLU', 'REL'] * 3
method = ['Expert_1'] * 4 + ['Expert_2'] * 4 + ['Expert_3'] * 4

# Create the MTMM_design DataFrame
MTMM_design = pd.DataFrame({
    'trait': trait,
    'method': method,
    'metric': metric_names
})

MTMM_result = metric_eval.MTMM(data, MTMM_design, method = 'pearson')
print(MTMM_result)

Metric Concurrent Validity

The function computes the concurrent validity for each criterion variable by calculating the Kendall's Tau correlation coefficient between the criterion variable and each metric.

criterion = ['Expert.1.COH','Expert.1.CON','Expert.1.FLU','Expert.1.REL']
concurrent_val_table = metric_eval.concurrent_validity(data, criterion)
metric_eval.print_concurrent_validity_table(concurrent_val_table)

Get Involved

We welcome contributions from the community! Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. If you would like to contribute to the codebase, please create a pull request.

Contact

If you have any questions, please contact Ziang Xiao at ziang dot xiao at jhu dot edu or Susu Zhang at szhan105 at illinois dot edu.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metric-eval-1.0.2.tar.gz (6.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page