A library for Grammatical Error Correction evaluation.

Project description

gec-metrics

A library for evaluation of Grammatical Error Correction.

PyPI - Version GitHub License

Install

pip install gec-metrics

To install latest version,

pip install git+https://github.com/gotutiyan/gec-metrics
python -m spacy download en_core_web_sm

Common Usage

API

Valid IDs for get_metric() can be found with get_metric_ids().

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
srcs = ['This sentences contain grammatical error .']
hyps = ['This sentence contains an grammatical error .']
refs = [
    ['This sentence contains an grammatical error .'],
    ['This sentence contains grammatical errors .']
] # (num_refs, num_sents)

# Corpus-level score
# If the metric is reference-free, the argument `references=` is not needed.
corpus_score: float = metric.score_corpus(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)
# Sentence-level scores
sent_scores: list[float] = metric.score_sentence(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)

CLI

As the corresponding configurations differ depending on the metric, they are described and entered in yaml. If no yaml is provided, the default configuration is used.
You can input multiple hypotheses.

gecmetrics-eval \
    --src <sources file> \
    --hyps <hypotheses file 1> <hypotheses file 2> ... \
    --refs <references file 1> <references file 2> ... \
    --metric <metric id> \
    --config config.yaml

# The output will be:
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 1>
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 2>
# ...

The config.yaml with default values can be generated via gecmetrics-gen-config.

gecmetrics-gen-config > config.yaml

Metrics

gec-metrics supports the following metrics.
All of arguments in the following examples indicate default values.

Reference-based

GLEU+ [Napoles+ 15] [Napoles+ 16]

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

We also provide a reproduction of the official implementation as GLEUOfficial.
The official one ignores ngram frequency differences when calculating the difference set between source and reference.

from gec_metrics import get_metric
metric_cls = get_metric('gleuofficial')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

ERRANT [Felice+ 16] [Bryant+ 17]

from gec_metrics import get_metric
metric_cls = get_metric('errant')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    language='en'  # Language for SpaCy.
))

GoToScorer [Gotou+ 20]

from gec_metrics import get_metric
metric_cls = get_metric('gotoscorer')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    ref_id=0,  # The reference id
    no_weight=False,  # If True, all weights are 1.0
    weight_file=''  # It is required if no_weight=False
))

You can generate a weight file via gecmetrics-gen-gotoscorer-weight.
The output is a JSON file.

gecmetrics-gen-gotoscorer-weight \
    --src <raw text file> \
    --ref <raw text file> \
    --hyp <raw text file 1> <raw text file 2> ... <raw text file N> \
    --out weight.json

PT-ERRANT [Gong+ 22]

from gec_metrics import get_metric
metric_cls = get_metric('pterrant')
weight_model_id = 'bertscore'
weight_model_cls = get_metric(weight_model_id)
metric = metric_cls(metric_cls.Config(
    beta=0.5,
    weight_model_name=weight_model_id,
    weight_model_config=weight_model_cls.Config(  # Optional: you can pass config
        score_type='f',
        rescale_with_baseline=True
    )
))

GREEN [Koyama+ 24]

from gec_metrics import get_metric
metric_cls = get_metric('green')
metric = metric_cls(metric_cls.Config(
    n=4,  # Max n of ngram
    beta=2.0,  # The beta for F-beta
    unit='word'  # 'word' or 'char'. Choose word-level or character-level
))

Reference-based (but sources-free)

These metrics are intended to be used for a component of PT-{M2, ERRANT}, but are also exposed to API.

BERTScore [Zhang+ 19]

The default config follows the default setting of [Gong+ 22].

from gec_metrics import get_metric
metric_cls = get_metric('bertscore')
metric = metric_cls(metric_cls.Config(
    model_type='bert-base-uncased',
    num_layers=None,
    batch_size=64,
    nthreads=4,
    all_layers=False,
    idf=False,
    idf_sents=None,
    lang='en',
    rescale_with_baseline=True,
    baseline_path=None,
    use_fast_tokenizer=False,
    score_type='f'
))

Reference-free

SOME [Yoshimura+ 20]

Download pre-trained models in advance from here.

from gec_metrics import get_metric
metric_cls = get_metric('some')
metric = metric_cls(metric_cls.Config(
    model_g='gfm-models/grammer',
    model_f='gfm-models/fluency',
    model_m='gfm-models/meaning',
    weight_f=0.55,
    weight_g=0.43,
    weight_m=0.02,
    batch_size=32
))

Scribendi [Islam+ 21]

from gec_metrics import get_metric
metric_cls = get_metric('scribendi')
metric = metric_cls(metric_cls.Config(
    model='gpt2',  # The model name or path to the language model to compute perplexity
    threshold=0.8  # The threshold for the maximum values of token-sort-ratio and levelshtein distance ratio
))

IMPARA [Maeda+ 22]

Note that the QE model is an unofficial model which achieves comparable correlation with the human evaluation results.
By default, it uses an unofficial pretrained QE model: [gotutiyan/IMPARA-QE].

from gec_metrics import get_metric
metric_cls = get_metric('impara')
metric = metric_cls(metric_cls.Config(
    model_qe='gotutiyan/IMPARA-QE',  # The model name or path for quality estimation.
    model_se='bert-base-cased',  # The model name or path for similarity estimation.
    threshold=0.9  # The threshold for the similarity score.
))

LLM-S, LLM-E [Kobayashi+24]

llmkobayashi24 is a common prefix.
llmkobayashi24hfsent and llmkobayashi24hfedit is a huggingface model based LLM-S and LLM-E.
llmkobayashi24openaisent and llmkobayashi24openaiedit is a OpenAI model based LLM-S and LLM-E.

from gec_metrics import get_metric
metric_cls = get_metric('llmkobayashi24hfsent')
metric = metric_cls(metric_cls.Config(
    model='meta-llama/Llama-2-13b-chat-hf',  # The model name or path for a language model.
))

from gec_metrics import get_metric
metric_cls = get_metric('llmkobayashi24openaisent')
metric = metric_cls(metric_cls.Config(
    model='gpt-4o-mini-2024-07-18'
    organization='<Organization key>'
    api_key='<API key>'
    base_url=None,  # use it when using Gemini
))

Meta Evaluation

To perform meta evaluation easily, we provide meta-evaluation scripts.

Preparation

To donwload test data and human scores, you must download datasets by using the shell.

gecmetrics-prepare-meta-eval
# The above is the same as:
# bash src/gec_metrics/meta_eval/prepare_meta_eval.sh

This shell creates meta_eval_data/ directory which consists of SEEDA dataset and CoNLL14 official submissions.

meta_eval_data/
├── GJG15
│   └── judgments.xml
├── conll14
│   ├── official_submissions
│   │   ├── AMU
│   │   ├── CAMB
│   │   ├── ...
│   ├── REF0
│   └── REF1
└── SEEDA
    ├── outputs
    │   ├── all
    │   │   ├── ...
    │   └── subset
    │       ├── ...
    ├── scores
    │   ├── human
    │   │   ├── ...├── ...

Common Usage

gec_metrics.get_meta_eval() supports ['gjg', 'seeda'].

.corr_system() performs system-level meta-evaluation.
.corr_sentence() performs sentence-level meta-evaluation.

from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta = meta_cls(
    meta_cls.Config(system='base')
)
# System correlation
results = meta.corr_system(metric)
# Output:
# SEEDASystemCorrOutput(ew_edit=Corr(pearson=0.9007842791853424,
#                                    spearman=0.9300699300699302,
#                                    accuracy=None,
#                                    kendall=None),
#                       ew_sent=Corr(pearson=0.8749437873537543,
#                                    spearman=0.9090909090909092,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_edit=Corr(pearson=0.9123732084071973,
#                                    spearman=0.9440559440559443,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_sent=Corr(pearson=0.8856173179230024,
#                                    spearman=0.9020979020979022,
#                                    accuracy=None,
#                                    kendall=None))

# Sentence correlation
results = meta.corr_sentence(metric)
# Output:
# SEEDASentenceCorrOutput(sent=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6715701950751519,
#                                   kendall=0.3431403901503038),
#                         edit=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6734561494551116,
#                                   kendall=0.3469122989102231))

SEEDA: [Kobayashi+ 24]

from gec_metrics import get_meta_eval
meta_cls = get_meta_eval('seeda')
meta = meta_cls(
    meta_cls.Config(system='base')
)

The .corr_system() returns a gec_metrics.meta_eval.meta_eval.SEEDASystemCorrOutput. instance. This is a dataclass containig ew_sent, ew_edit, ts_sent, ts_edit.

ew_* means using ExpectedWins human evaluation scores and ts_* means using TrueSkill.
*_edit and *_sent means SEEDA-E and SEEDA-S.

The .corr_sentence() returns a gec_metrics.meta_eval.meta_eval.SEEDASentenceCorrOutput. instance. This is a dataclass containig sent, edit.

edit and sent means SEEDA-E and SEEDA-S.

The window_analysis_system() performs the window analysis.

This returns SEEDAWindowAnalysisSystemCorrOutput instance contaiing the same attributes as .corr_system(). Each attribute has dict[tuple, MetaEvalSEEDA.Corr] and the tuple means start and end rank of human evaluation.

window_analysis_plot() can be used for visualization.

# An exmaple of window-analysis visualization.
from gec_metrics.metrics import ERRANT
from gec_metrics.meta_eval import MetaEvalSEEDA
import matplotlib.pyplot as plt
metric = ERRANT(ERRANT.Config(beta=0.5))
meta = MetaEvalSEEDA(
MetaEvalSEEDA.Config(system='base')
)
window_results = meta.window_analysis_system(metric)
fig = meta.window_analysis_plot(window_results.ts_edit)
plt.savefig('window-errant.png')

GJG15: [Grundkiewicz+ 15]

This is referred to GJG15 in the SEEDA paper.
Basically, TrueSkill ranking is used to compute the correlation.

from gec_metrics import get_meta_eval
meta_cls = get_meta_eval('gjg')
meta = meta_cls()

Project details

Release history Release notifications | RSS feed

This version

0.1.1

Mar 31, 2025

0.1.0

Feb 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_metrics-0.1.1.tar.gz (190.7 kB view details)

Uploaded Mar 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gec_metrics-0.1.1-py3-none-any.whl (55.9 kB view details)

Uploaded Mar 31, 2025 Python 3

File details

Details for the file gec_metrics-0.1.1.tar.gz.

File metadata

Download URL: gec_metrics-0.1.1.tar.gz
Upload date: Mar 31, 2025
Size: 190.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_metrics-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`132c15fe00e71de8112da30a9652c78ffce7394215dc8d93045f7bfb78b41768`
MD5	`4c76967a4ac76ce1076626007a552958`
BLAKE2b-256	`621d5c90171802b62473d29d97a952b305654dca3fe249df726114c399da26a6`

See more details on using hashes here.

File details

Details for the file gec_metrics-0.1.1-py3-none-any.whl.

File metadata

Download URL: gec_metrics-0.1.1-py3-none-any.whl
Upload date: Mar 31, 2025
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.11

File hashes

Hashes for gec_metrics-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ff3433355ad65334a529a879c05056b8b2fce6662ad344d161d43d5f63649ce`
MD5	`5bffbb18669597a91d42e9028c9b6570`
BLAKE2b-256	`aae99619153a6a2a1598798f5a21bd87bbec29b9230a54060347527c9331d2ae`

See more details on using hashes here.

gec-metrics 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

gec-metrics

Install

Common Usage

API

CLI

Metrics

Reference-based

GLEU+ [Napoles+ 15] [Napoles+ 16]

ERRANT [Felice+ 16] [Bryant+ 17]

GoToScorer [Gotou+ 20]

PT-ERRANT [Gong+ 22]

GREEN [Koyama+ 24]

Reference-based (but sources-free)

BERTScore [Zhang+ 19]

Reference-free

SOME [Yoshimura+ 20]

Scribendi [Islam+ 21]

IMPARA [Maeda+ 22]

LLM-S, LLM-E [Kobayashi+24]

Meta Evaluation

Preparation

Common Usage

SEEDA: [Kobayashi+ 24]

GJG15: [Grundkiewicz+ 15]

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes