Skip to main content

A library for Grammatical Error Correction evaluation.

Project description

gec-metrics

A library for evaluation of Grammatical Error Correction.

PyPI - Version GitHub License

[API Docs] [Demo]

Install

pip install gec-metrics

To install latest version,

pip install git+https://github.com/gotutiyan/gec-metrics
python -m spacy download en_core_web_sm

Common Usage

API

Valid IDs for get_metric() can be found with get_metric_ids().

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
srcs = ['This sentences contain grammatical error .']
hyps = ['This sentence contains an grammatical error .']
refs = [
    ['This sentence contains an grammatical error .'],
    ['This sentence contains grammatical errors .']
] # (num_refs, num_sents)

# Corpus-level score
# If the metric is reference-free, the argument `references=` is not needed.
corpus_score: float = metric.score_corpus(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)
# Sentence-level scores
sent_scores: list[float] = metric.score_sentence(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)

CLI

  • As the corresponding configurations differ depending on the metric, they are described and entered in yaml. If no yaml is provided, the default configuration is used.
  • You can input multiple hypotheses.
gecmetrics-eval \
    --src <sources file> \
    --hyps <hypotheses file 1> <hypotheses file 2> ... \
    --refs <references file 1> <references file 2> ... \
    --metric <metric id> \
    --config config.yaml

# The output will be:
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 1>
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 2>
# ...

The config.yaml with default values can be generated via gecmetrics-gen-config.

gecmetrics-gen-config > config.yaml

Metrics

gec-metrics supports the following metrics.
All of arguments in the following examples indicate default values.

Reference-based

GLEU+ [Napoles+ 15] [Napoles+ 16]

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

We also provide a reproduction of the official implementation as GLEUOfficial.
The official one ignores ngram frequency differences when calculating the difference set between source and reference.

from gec_metrics import get_metric
metric_cls = get_metric('gleuofficial')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

ERRANT [Felice+ 16] [Bryant+ 17]

from gec_metrics import get_metric
metric_cls = get_metric('errant')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    language='en'  # Language for SpaCy.
))

GoToScorer [Gotou+ 20]

from gec_metrics import get_metric
metric_cls = get_metric('gotoscorer')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    ref_id=0,  # The reference id
    no_weight=False,  # If True, all weights are 1.0
    weight_file=''  # It is required if no_weight=False
))

You can generate a weight file via gecmetrics-gen-gotoscorer-weight.
The output is a JSON file.

gecmetrics-gen-gotoscorer-weight \
    --src <raw text file> \
    --ref <raw text file> \
    --hyp <raw text file 1> <raw text file 2> ... <raw text file N> \
    --out weight.json

PT-ERRANT [Gong+ 22]

from gec_metrics import get_metric
metric_cls = get_metric('pterrant')
weight_model_id = 'bertscore'
weight_model_cls = get_metric(weight_model_id)
metric = metric_cls(metric_cls.Config(
    beta=0.5,
    weight_model_name=weight_model_id,
    weight_model_config=weight_model_cls.Config(  # Optional: you can pass config
        score_type='f',
        rescale_with_baseline=True
    )
))

GREEN [Koyama+ 24]

from gec_metrics import get_metric
metric_cls = get_metric('green')
metric = metric_cls(metric_cls.Config(
    n=4,  # Max n of ngram
    beta=2.0,  # The beta for F-beta
    unit='word'  # 'word' or 'char'. Choose word-level or character-level
))

Reference-based (but sources-free)

These metrics are intended to be used for a component of PT-{M2, ERRANT}, but are also exposed to API.

BERTScore [Zhang+ 19]

The default config follows the default setting of [Gong+ 22].

from gec_metrics import get_metric
metric_cls = get_metric('bertscore')
metric = metric_cls(metric_cls.Config(
    model_type='bert-base-uncased',
    num_layers=None,
    batch_size=64,
    nthreads=4,
    all_layers=False,
    idf=False,
    idf_sents=None,
    lang='en',
    rescale_with_baseline=True,
    baseline_path=None,
    use_fast_tokenizer=False,
    score_type='f'
))

Reference-free

SOME [Yoshimura+ 20]

Download pre-trained models in advance from here.

from gec_metrics import get_metric
metric_cls = get_metric('some')
metric = metric_cls(metric_cls.Config(
    model_g='gfm-models/grammer',
    model_f='gfm-models/fluency',
    model_m='gfm-models/meaning',
    weight_f=0.55,
    weight_g=0.43,
    weight_m=0.02,
    batch_size=32
))

Scribendi [Islam+ 21]

from gec_metrics import get_metric
metric_cls = get_metric('scribendi')
metric = metric_cls(metric_cls.Config(
    model='gpt2',  # The model name or path to the language model to compute perplexity
    threshold=0.8  # The threshold for the maximum values of token-sort-ratio and levelshtein distance ratio
))

IMPARA [Maeda+ 22]

Note that the QE model is an unofficial model which achieves comparable correlation with the human evaluation results.
By default, it uses an unofficial pretrained QE model: [gotutiyan/IMPARA-QE].

from gec_metrics import get_metric
metric_cls = get_metric('impara')
metric = metric_cls(metric_cls.Config(
    model_qe='gotutiyan/IMPARA-QE',  # The model name or path for quality estimation.
    model_se='bert-base-cased',  # The model name or path for similarity estimation.
    threshold=0.9  # The threshold for the similarity score.
))

LLM-S, LLM-E [Kobayashi+24]

  • llmkobayashi24 is a common prefix.
  • llmkobayashi24hfsent and llmkobayashi24hfedit is a huggingface model based LLM-S and LLM-E.
  • llmkobayashi24openaisent and llmkobayashi24openaiedit is a OpenAI model based LLM-S and LLM-E.
from gec_metrics import get_metric
metric_cls = get_metric('llmkobayashi24hfsent')
metric = metric_cls(metric_cls.Config(
    model='meta-llama/Llama-2-13b-chat-hf',  # The model name or path for a language model.
))
from gec_metrics import get_metric
metric_cls = get_metric('llmkobayashi24openaisent')
metric = metric_cls(metric_cls.Config(
    model='gpt-4o-mini-2024-07-18'
    organization='<Organization key>'
    api_key='<API key>'
    base_url=None,  # use it when using Gemini
))

Meta Evaluation

To perform meta evaluation easily, we provide meta-evaluation scripts.

Preparation

To donwload test data and human scores, you must download datasets by using the shell.

gecmetrics-prepare-meta-eval
# The above is the same as:
# bash src/gec_metrics/meta_eval/prepare_meta_eval.sh

This shell creates meta_eval_data/ directory which consists of SEEDA dataset and CoNLL14 official submissions.

meta_eval_data/
├── GJG15
│   └── judgments.xml
├── conll14
│   ├── official_submissions
│   │   ├── AMU
│   │   ├── CAMB
│   │   ├── ...
│   ├── REF0
│   └── REF1
└── SEEDA
    ├── outputs
    │   ├── all
    │   │   ├── ...
    │   └── subset
    │       ├── ...
    ├── scores
    │   ├── human
    │   │   ├── ...├── ...

Common Usage

gec_metrics.get_meta_eval() supports ['gjg', 'seeda'].

  • .corr_system() performs system-level meta-evaluation.
  • .corr_sentence() performs sentence-level meta-evaluation.
from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta = meta_cls(
    meta_cls.Config(system='base')
)
# System correlation
results = meta.corr_system(metric)
# Output:
# SEEDASystemCorrOutput(ew_edit=Corr(pearson=0.9007842791853424,
#                                    spearman=0.9300699300699302,
#                                    accuracy=None,
#                                    kendall=None),
#                       ew_sent=Corr(pearson=0.8749437873537543,
#                                    spearman=0.9090909090909092,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_edit=Corr(pearson=0.9123732084071973,
#                                    spearman=0.9440559440559443,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_sent=Corr(pearson=0.8856173179230024,
#                                    spearman=0.9020979020979022,
#                                    accuracy=None,
#                                    kendall=None))

# Sentence correlation
results = meta.corr_sentence(metric)
# Output:
# SEEDASentenceCorrOutput(sent=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6715701950751519,
#                                   kendall=0.3431403901503038),
#                         edit=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6734561494551116,
#                                   kendall=0.3469122989102231))

SEEDA: [Kobayashi+ 24]

from gec_metrics import get_meta_eval
meta_cls = get_meta_eval('seeda')
meta = meta_cls(
    meta_cls.Config(system='base')
)

The .corr_system() returns a gec_metrics.meta_eval.meta_eval.SEEDASystemCorrOutput. instance. This is a dataclass containig ew_sent, ew_edit, ts_sent, ts_edit.

  • ew_* means using ExpectedWins human evaluation scores and ts_* means using TrueSkill.
  • *_edit and *_sent means SEEDA-E and SEEDA-S.

The .corr_sentence() returns a gec_metrics.meta_eval.meta_eval.SEEDASentenceCorrOutput. instance. This is a dataclass containig sent, edit.

  • edit and sent means SEEDA-E and SEEDA-S.

The window_analysis_system() performs the window analysis.

  • This returns SEEDAWindowAnalysisSystemCorrOutput instance contaiing the same attributes as .corr_system(). Each attribute has dict[tuple, MetaEvalSEEDA.Corr] and the tuple means start and end rank of human evaluation.
  • window_analysis_plot() can be used for visualization.
    # An exmaple of window-analysis visualization.
    from gec_metrics.metrics import ERRANT
    from gec_metrics.meta_eval import MetaEvalSEEDA
    import matplotlib.pyplot as plt
    metric = ERRANT(ERRANT.Config(beta=0.5))
    meta = MetaEvalSEEDA(
    MetaEvalSEEDA.Config(system='base')
    )
    window_results = meta.window_analysis_system(metric)
    fig = meta.window_analysis_plot(window_results.ts_edit)
    plt.savefig('window-errant.png')
    

GJG15: [Grundkiewicz+ 15]

This is referred to GJG15 in the SEEDA paper.
Basically, TrueSkill ranking is used to compute the correlation.

from gec_metrics import get_meta_eval
meta_cls = get_meta_eval('gjg')
meta = meta_cls()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_metrics-0.1.1.tar.gz (190.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gec_metrics-0.1.1-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file gec_metrics-0.1.1.tar.gz.

File metadata

  • Download URL: gec_metrics-0.1.1.tar.gz
  • Upload date:
  • Size: 190.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for gec_metrics-0.1.1.tar.gz
Algorithm Hash digest
SHA256 132c15fe00e71de8112da30a9652c78ffce7394215dc8d93045f7bfb78b41768
MD5 4c76967a4ac76ce1076626007a552958
BLAKE2b-256 621d5c90171802b62473d29d97a952b305654dca3fe249df726114c399da26a6

See more details on using hashes here.

File details

Details for the file gec_metrics-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for gec_metrics-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ff3433355ad65334a529a879c05056b8b2fce6662ad344d161d43d5f63649ce
MD5 5bffbb18669597a91d42e9028c9b6570
BLAKE2b-256 aae99619153a6a2a1598798f5a21bd87bbec29b9230a54060347527c9331d2ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page