A library for Grammatical Error Correction evaluation.

Project description

gec-metrics

A library for evaluation of Grammatical Error Correction.

Install

pip install git+https://github.com/gotutiyan/gec-metrics
python -m spacy download en_core_web_sm

Or,

git clone git@github.com:gotutiyan/gec-metrics.git
cd gec-metrics
pip install -e ./
python -m spacy download en_core_web_sm

Common Usage

API

gec_metrics.get_metric() supports ['errant', 'gleu', 'gleuofficial', 'green', 'gotoscorer', 'impara', 'some', 'scribendi'].

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
srcs = ['This sentences contain grammatical error .']
hyps = ['This sentence contains an grammatical error .']
refs = [
    ['This sentence contains an grammatical error .'],
    ['This sentence contains grammatical errors .']
] # (num_refs, num_sents)

# Corpus-level score
# If the metric is reference-free, the argument `references=` is not needed.
corpus_score: float = metric.score_corpus(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)
# Sentence-level scores
sent_scores: list[float] = metric.score_sentence(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)

CLI

As the corresponding configurations differ depending on the metric, they are described and entered in yaml. If no yaml is provided, the default configuration is used.
--metric supports ['errant', 'gleu', 'gleuofficial', 'green', 'gotoscorer', 'impara', 'some', 'scribendi'].
You can input multiple hypotheses,

gecmetrics-eval \
    --src <sources file> \
    --hyps <hypotheses file 1> <hypotheses file 2> ... \
    --refs <references file 1> <references file 2> ... \
    --metric <metric id> \
    --config config.yaml

# The output will be:
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 1>
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 2>
# ...

The config.yaml with default values can be generated via gecmetrics-gen-config.

gecmetrics-gen-config > config.yaml

Metrics

gec-metrics supports the following metrics.
All of arguments in the following examples indicate default values.

Reference-based

M2

To be added.

GLEU+ [Napoles+ 15] [Napoles+ 16]

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

We also provide a reproduction of the official implementation as GLEUOfficial.
The official one ignores ngram frequency differences when calculating the difference set between source and reference.

from gec_metrics import get_metric
metric_cls = get_metric('gleuofficial')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

ERRANT [Felice+ 16] [Bryant+ 17]

from gec_metrics import get_metric
metric_cls = get_metric('errant')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    language='en'  # Language for SpaCy.
))

GoToScorer [Gotou+ 20]

from gec_metrics import get_metric
metric_cls = get_metric('gotoscorer')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    ref_id=0,  # The reference id
    no_weight=False,  # If True, all weights are 1.0
    weight_file=''  # It is required if no_weight=False
))

You can generate a weight file via gecmetrics-gen-gotoscorer-weight.
The output is a JSON file.

gecmetrics-gen-gotoscorer-weight \
    --src <raw text file> \
    --ref <raw text file> \
    --hyp <raw text file 1> <raw text file 2> ... <raw text file N> \
    --out weight.json

PT-M2 [Gong+ 22]

To be added.

PT-ERRANT [Gong+ 22]

from gec_metrics import get_metric
metric_cls = get_metric('pterrant')
weight_model_id = 'bertscore'
weight_model_cls = get_metric(weight_model_id)
metric = metric_cls(metric_cls.Config(
    beta=0.5,
    weight_model_name=weight_model_id,
    weight_model_config=weight_model_cls.Config(  # Optional: you can pass config
        score_type='f',
        rescale_with_baseline=True
    )
))

CLEME [Ye+ 23]

To be added.

GREEN [Koyama+ 24]

from gec_metrics import get_metric
metric_cls = get_metric('green')
metric = metric_cls(metric_cls.Config(
    n=4,  # Max n of ngram
    beta=2.0,  # The beta for F-beta
    unit='word'  # 'word' or 'char'. Choose word-level or character-level
))

Reference-based (but sources-free)

These metrics are intended to be used for a component of PT-{M2, ERRANT}, but are also exposed to API.

BERTScore [Zhang+ 19]

The default config follows the default setting of [Gong+ 22].

from gec_metrics import get_metric
metric_cls = get_metric('bertscore')
metric = metric_cls(metric_cls.Config(
    model_type='bert-base-uncased',
    num_layers=None,
    batch_size=64,
    nthreads=4,
    all_layers=False,
    idf=False,
    idf_sents=None,
    lang='en',
    rescale_with_baseline=True,
    baseline_path=None,
    use_fast_tokenizer=False,
    score_type='f'
))

BARTScore [Yuan+ 21]

To be added.

Reference-free

SOME [Yoshimura+ 20]

Download pre-trained models in advance from here.

from gec_metrics import get_metric
metric_cls = get_metric('some')
metric = metric_cls(metric_cls.Config(
    model_g='gfm-models/grammer',
    model_f='gfm-models/fluency',
    model_m='gfm-models/meaning',
    weight_f=0.55,
    weight_g=0.43,
    weight_m=0.02,
    batch_size=32
))

Scribendi [Islam+ 21]

from gec_metrics import get_metric
metric_cls = get_metric('scribendi')
metric = metric_cls(metric_cls.Config(
    model='gpt2',  # The model name or path to the language model to compute perplexity
    threshold=0.8  # The threshold for the maximum values of token-sort-ratio and levelshtein distance ratio
))

IMPARA [Maeda+ 22]

Note that the QE model is an unofficial model which achieves comparable correlation with the human evaluation results.
By default, it uses an unofficial pretrained QE model: [gotutiyan/IMPARA-QE].

from gec_metrics import get_metric
metric_cls = get_metric('impara')
metric = metric_cls(metric_cls.Config(
    model_qe='gotutiyan/IMPARA-QE',  # The model name or path for quality estimation.
    model_se='bert-base-cased',  # The model name or path for similarity estimation.
    threshold=0.9  # The threshold for the similarity score.
))

Meta Evaluation

To perform meta evaluation easily, we provide meta-evaluation scripts.

Preparation

To donwload test data and human scores, you must download datasets by using the shell.

gecmetrics-prepare-meta-eval
# The above is the same as:
# bash src/gec_metrics/meta_eval/prepare_meta_eval.sh

This shell creates meta_eval_data/ directory which consists of SEEDA dataset and CoNLL14 official submissions.

meta_eval_data/
├── GJG15
│   └── judgments.xml
├── conll14
│   ├── official_submissions
│   │   ├── AMU
│   │   ├── CAMB
│   │   ├── ...
│   ├── REF0
│   └── REF1
└── SEEDA
    ├── outputs
    │   ├── all
    │   │   ├── ...
    │   └── subset
    │       ├── ...
    ├── scores
    │   ├── human
    │   │   ├── ...├── ...

SEEDA: [Kobayashi+ 24]

The examples below uses ERRANT as a metric, but can also use other metrics based on gec_metrics.metrics.MetricBase.

ew_* means using ExpectedWins human evaluation scores and ts_* means using TrueSkill.
*_edit and *_sent means SEEDA-E and SEEDA-S.

from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta_seeda = meta_cls(
    meta_cls.Config(system='base')
)
# System correlation
results = meta_seeda.corr_system(metric)
# Output:
# SEEDASystemCorrOutput(ew_edit=Corr(pearson=0.9007842791853424,
#                                    spearman=0.9300699300699302,
#                                    accuracy=None,
#                                    kendall=None),
#                       ew_sent=Corr(pearson=0.8749437873537543,
#                                    spearman=0.9090909090909092,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_edit=Corr(pearson=0.9123732084071973,
#                                    spearman=0.9440559440559443,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_sent=Corr(pearson=0.8856173179230024,
#                                    spearman=0.9020979020979022,
#                                    accuracy=None,
#                                    kendall=None))

# Sentence correlation
results = meta_seeda.corr_sentence(metric)
# Output:
# SEEDASentenceCorrOutput(sent=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6715701950751519,
#                                   kendall=0.3431403901503038),
#                         edit=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6734561494551116,
#                                   kendall=0.3469122989102231))

The window analysis can be done by window_analysis_system().

ew_* uses Expected Wins human evaluation scores and ts_* uses TrueSkill.
*_edit and *_sent means SEEDA-E and SEEDA-S.
Each is a dictionary: {(start_rank, end_rank): MetaEvalSEEDA.Corr}.

from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta_seeda = meta_cls(
    meta_cls.Config(system='base')
)
results = meta_seeda.window_analysis_system(metric, window=4)
assert results.ew_edit is not None
assert results.ew_sent is not None
assert results.ts_edit is not None
assert results.ts_sent is not None

for k, v in results.ts_sent.items():
    print(f'From {k[0]} to {k[1]}: {v.pearson=}, {v.spearman=}')

GJG15: [Grundkiewicz+ 15]

This is referred to GJG15 in the SEEDA paper.
Basically, TrueSkill ranking is used to compute the correlation.

from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('gjg')
meta_gjg = meta_cls(meta_cls.Config())
# System correlation
results = meta_gjg.corr_system(metric)
# Output:
# GJGSystemCorrOutput(ew=Corr(pearson=0.601290729078602,
#                             spearman=0.5934065934065934,
#                             accuracy=None,
#                             kendall=None),
#                     ts=Corr(pearson=0.6633835644883472,
#                             spearman=0.6868131868131868,
#                             accuracy=None,
#                             kendall=None))

results = meta_gjg.corr_sentence(metric)
# Output:
# GJGSentenceCorrOutput(corr=Corr(pearson=None,
#                                 spearman=None,
#                                 accuracy=0.6729157079690282,
#                                 kendall=0.34583141593805644))

Project details

Release history Release notifications | RSS feed

0.1.1

Mar 31, 2025

This version

0.1.0

Feb 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_metrics-0.1.0.tar.gz (29.2 kB view details)

Uploaded Feb 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gec_metrics-0.1.0-py3-none-any.whl (45.1 kB view details)

Uploaded Feb 17, 2025 Python 3

File details

Details for the file gec_metrics-0.1.0.tar.gz.

File metadata

Download URL: gec_metrics-0.1.0.tar.gz
Upload date: Feb 17, 2025
Size: 29.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_metrics-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`96e1491530aa9dd85a6664e432a153e81b0a9c02070ed7bbac8d8ce89c886667`
MD5	`01c2709fbfc4e23b46d170a9c3f81070`
BLAKE2b-256	`e8c5af55db66e934ebb66865b265e4f7cd90f3785845b5f1c78d284df5d2e84e`

See more details on using hashes here.

File details

Details for the file gec_metrics-0.1.0-py3-none-any.whl.

File metadata

Download URL: gec_metrics-0.1.0-py3-none-any.whl
Upload date: Feb 17, 2025
Size: 45.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_metrics-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b89b46dd40643ba1dfc60acc701d96622f52d41d4fced0ed581a3c1402d2148e`
MD5	`dd4e4986fe127a9f0ca7d7da36e6647b`
BLAKE2b-256	`d3e0e65b8610f2c5fdbb2e66fa3bc30900be7a0425cc79faad5ac88a54aad984`

See more details on using hashes here.

gec-metrics 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

gec-metrics

Install

Common Usage

API

CLI

Metrics

Reference-based

M2

GLEU+ [Napoles+ 15] [Napoles+ 16]

ERRANT [Felice+ 16] [Bryant+ 17]

GoToScorer [Gotou+ 20]

PT-M2 [Gong+ 22]

PT-ERRANT [Gong+ 22]

CLEME [Ye+ 23]

GREEN [Koyama+ 24]

Reference-based (but sources-free)

BERTScore [Zhang+ 19]

BARTScore [Yuan+ 21]

Reference-free

SOME [Yoshimura+ 20]

Scribendi [Islam+ 21]

IMPARA [Maeda+ 22]

Meta Evaluation

Preparation

SEEDA: [Kobayashi+ 24]

GJG15: [Grundkiewicz+ 15]

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes