Skip to main content

A library for Grammatical Error Correction evaluation.

Project description

gec-metrics

A library for evaluation of Grammatical Error Correction.

Install

pip install git+https://github.com/gotutiyan/gec-metrics
python -m spacy download en_core_web_sm

Or,

git clone git@github.com:gotutiyan/gec-metrics.git
cd gec-metrics
pip install -e ./
python -m spacy download en_core_web_sm

Common Usage

API

gec_metrics.get_metric() supports ['errant', 'gleu', 'gleuofficial', 'green', 'gotoscorer', 'impara', 'some', 'scribendi'].

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
srcs = ['This sentences contain grammatical error .']
hyps = ['This sentence contains an grammatical error .']
refs = [
    ['This sentence contains an grammatical error .'],
    ['This sentence contains grammatical errors .']
] # (num_refs, num_sents)

# Corpus-level score
# If the metric is reference-free, the argument `references=` is not needed.
corpus_score: float = metric.score_corpus(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)
# Sentence-level scores
sent_scores: list[float] = metric.score_sentence(
    sources=srcs,
    hypotheses=hyps,
    references=refs
)

CLI

  • As the corresponding configurations differ depending on the metric, they are described and entered in yaml. If no yaml is provided, the default configuration is used.
  • --metric supports ['errant', 'gleu', 'gleuofficial', 'green', 'gotoscorer', 'impara', 'some', 'scribendi'].
  • You can input multiple hypotheses,
gecmetrics-eval \
    --src <sources file> \
    --hyps <hypotheses file 1> <hypotheses file 2> ... \
    --refs <references file 1> <references file 2> ... \
    --metric <metric id> \
    --config config.yaml

# The output will be:
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 1>
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 2>
# ...

The config.yaml with default values can be generated via gecmetrics-gen-config.

gecmetrics-gen-config > config.yaml

Metrics

gec-metrics supports the following metrics.
All of arguments in the following examples indicate default values.

Reference-based

M2

To be added.

GLEU+ [Napoles+ 15] [Napoles+ 16]

from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

We also provide a reproduction of the official implementation as GLEUOfficial.
The official one ignores ngram frequency differences when calculating the difference set between source and reference.

from gec_metrics import get_metric
metric_cls = get_metric('gleuofficial')
metric = metric_cls(metric_cls.Config(
    iter=500,  # The number of iterations 
    n=4,  # max n-gram
    unit='word'  # 'word' or 'char'
))

ERRANT [Felice+ 16] [Bryant+ 17]

from gec_metrics import get_metric
metric_cls = get_metric('errant')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    language='en'  # Language for SpaCy.
))

GoToScorer [Gotou+ 20]

from gec_metrics import get_metric
metric_cls = get_metric('gotoscorer')
metric = metric_cls(metric_cls.Config(
    beta=0.5,  # The beta for F-beta score
    ref_id=0,  # The reference id
    no_weight=False,  # If True, all weights are 1.0
    weight_file=''  # It is required if no_weight=False
))

You can generate a weight file via gecmetrics-gen-gotoscorer-weight.
The output is a JSON file.

gecmetrics-gen-gotoscorer-weight \
    --src <raw text file> \
    --ref <raw text file> \
    --hyp <raw text file 1> <raw text file 2> ... <raw text file N> \
    --out weight.json

PT-M2 [Gong+ 22]

To be added.

PT-ERRANT [Gong+ 22]

from gec_metrics import get_metric
metric_cls = get_metric('pterrant')
weight_model_id = 'bertscore'
weight_model_cls = get_metric(weight_model_id)
metric = metric_cls(metric_cls.Config(
    beta=0.5,
    weight_model_name=weight_model_id,
    weight_model_config=weight_model_cls.Config(  # Optional: you can pass config
        score_type='f',
        rescale_with_baseline=True
    )
))

CLEME [Ye+ 23]

To be added.

GREEN [Koyama+ 24]

from gec_metrics import get_metric
metric_cls = get_metric('green')
metric = metric_cls(metric_cls.Config(
    n=4,  # Max n of ngram
    beta=2.0,  # The beta for F-beta
    unit='word'  # 'word' or 'char'. Choose word-level or character-level
))

Reference-based (but sources-free)

These metrics are intended to be used for a component of PT-{M2, ERRANT}, but are also exposed to API.

BERTScore [Zhang+ 19]

The default config follows the default setting of [Gong+ 22].

from gec_metrics import get_metric
metric_cls = get_metric('bertscore')
metric = metric_cls(metric_cls.Config(
    model_type='bert-base-uncased',
    num_layers=None,
    batch_size=64,
    nthreads=4,
    all_layers=False,
    idf=False,
    idf_sents=None,
    lang='en',
    rescale_with_baseline=True,
    baseline_path=None,
    use_fast_tokenizer=False,
    score_type='f'
))

BARTScore [Yuan+ 21]

To be added.

Reference-free

SOME [Yoshimura+ 20]

Download pre-trained models in advance from here.

from gec_metrics import get_metric
metric_cls = get_metric('some')
metric = metric_cls(metric_cls.Config(
    model_g='gfm-models/grammer',
    model_f='gfm-models/fluency',
    model_m='gfm-models/meaning',
    weight_f=0.55,
    weight_g=0.43,
    weight_m=0.02,
    batch_size=32
))

Scribendi [Islam+ 21]

from gec_metrics import get_metric
metric_cls = get_metric('scribendi')
metric = metric_cls(metric_cls.Config(
    model='gpt2',  # The model name or path to the language model to compute perplexity
    threshold=0.8  # The threshold for the maximum values of token-sort-ratio and levelshtein distance ratio
))

IMPARA [Maeda+ 22]

Note that the QE model is an unofficial model which achieves comparable correlation with the human evaluation results.
By default, it uses an unofficial pretrained QE model: [gotutiyan/IMPARA-QE].

from gec_metrics import get_metric
metric_cls = get_metric('impara')
metric = metric_cls(metric_cls.Config(
    model_qe='gotutiyan/IMPARA-QE',  # The model name or path for quality estimation.
    model_se='bert-base-cased',  # The model name or path for similarity estimation.
    threshold=0.9  # The threshold for the similarity score.
))

Meta Evaluation

To perform meta evaluation easily, we provide meta-evaluation scripts.

Preparation

To donwload test data and human scores, you must download datasets by using the shell.

gecmetrics-prepare-meta-eval
# The above is the same as:
# bash src/gec_metrics/meta_eval/prepare_meta_eval.sh

This shell creates meta_eval_data/ directory which consists of SEEDA dataset and CoNLL14 official submissions.

meta_eval_data/
├── GJG15
│   └── judgments.xml
├── conll14
│   ├── official_submissions
│   │   ├── AMU
│   │   ├── CAMB
│   │   ├── ...
│   ├── REF0
│   └── REF1
└── SEEDA
    ├── outputs
    │   ├── all
    │   │   ├── ...
    │   └── subset
    │       ├── ...
    ├── scores
    │   ├── human
    │   │   ├── ...├── ...

SEEDA: [Kobayashi+ 24]

The examples below uses ERRANT as a metric, but can also use other metrics based on gec_metrics.metrics.MetricBase.

  • ew_* means using ExpectedWins human evaluation scores and ts_* means using TrueSkill.
  • *_edit and *_sent means SEEDA-E and SEEDA-S.
from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta_seeda = meta_cls(
    meta_cls.Config(system='base')
)
# System correlation
results = meta_seeda.corr_system(metric)
# Output:
# SEEDASystemCorrOutput(ew_edit=Corr(pearson=0.9007842791853424,
#                                    spearman=0.9300699300699302,
#                                    accuracy=None,
#                                    kendall=None),
#                       ew_sent=Corr(pearson=0.8749437873537543,
#                                    spearman=0.9090909090909092,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_edit=Corr(pearson=0.9123732084071973,
#                                    spearman=0.9440559440559443,
#                                    accuracy=None,
#                                    kendall=None),
#                       ts_sent=Corr(pearson=0.8856173179230024,
#                                    spearman=0.9020979020979022,
#                                    accuracy=None,
#                                    kendall=None))

# Sentence correlation
results = meta_seeda.corr_sentence(metric)
# Output:
# SEEDASentenceCorrOutput(sent=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6715701950751519,
#                                   kendall=0.3431403901503038),
#                         edit=Corr(pearson=None,
#                                   spearman=None,
#                                   accuracy=0.6734561494551116,
#                                   kendall=0.3469122989102231))

The window analysis can be done by window_analysis_system().

  • ew_* uses Expected Wins human evaluation scores and ts_* uses TrueSkill.
  • *_edit and *_sent means SEEDA-E and SEEDA-S.
  • Each is a dictionary: {(start_rank, end_rank): MetaEvalSEEDA.Corr}.
from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta_seeda = meta_cls(
    meta_cls.Config(system='base')
)
results = meta_seeda.window_analysis_system(metric, window=4)
assert results.ew_edit is not None
assert results.ew_sent is not None
assert results.ts_edit is not None
assert results.ts_sent is not None

for k, v in results.ts_sent.items():
    print(f'From {k[0]} to {k[1]}: {v.pearson=}, {v.spearman=}')

GJG15: [Grundkiewicz+ 15]

This is referred to GJG15 in the SEEDA paper.
Basically, TrueSkill ranking is used to compute the correlation.

from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('gjg')
meta_gjg = meta_cls(meta_cls.Config())
# System correlation
results = meta_gjg.corr_system(metric)
# Output:
# GJGSystemCorrOutput(ew=Corr(pearson=0.601290729078602,
#                             spearman=0.5934065934065934,
#                             accuracy=None,
#                             kendall=None),
#                     ts=Corr(pearson=0.6633835644883472,
#                             spearman=0.6868131868131868,
#                             accuracy=None,
#                             kendall=None))

results = meta_gjg.corr_sentence(metric)
# Output:
# GJGSentenceCorrOutput(corr=Corr(pearson=None,
#                                 spearman=None,
#                                 accuracy=0.6729157079690282,
#                                 kendall=0.34583141593805644))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gec_metrics-0.1.0.tar.gz (29.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gec_metrics-0.1.0-py3-none-any.whl (45.1 kB view details)

Uploaded Python 3

File details

Details for the file gec_metrics-0.1.0.tar.gz.

File metadata

  • Download URL: gec_metrics-0.1.0.tar.gz
  • Upload date:
  • Size: 29.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for gec_metrics-0.1.0.tar.gz
Algorithm Hash digest
SHA256 96e1491530aa9dd85a6664e432a153e81b0a9c02070ed7bbac8d8ce89c886667
MD5 01c2709fbfc4e23b46d170a9c3f81070
BLAKE2b-256 e8c5af55db66e934ebb66865b265e4f7cd90f3785845b5f1c78d284df5d2e84e

See more details on using hashes here.

File details

Details for the file gec_metrics-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gec_metrics-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b89b46dd40643ba1dfc60acc701d96622f52d41d4fced0ed581a3c1402d2148e
MD5 dd4e4986fe127a9f0ca7d7da36e6647b
BLAKE2b-256 d3e0e65b8610f2c5fdbb2e66fa3bc30900be7a0425cc79faad5ac88a54aad984

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page