A library for Grammatical Error Correction evaluation.
Project description
gec-metrics
A library for evaluation of Grammatical Error Correction.
Install
pip install gec-metrics
To install latest version,
pip install git+https://github.com/gotutiyan/gec-metrics
python -m spacy download en_core_web_sm
Common Usage
API
Valid IDs for get_metric() can be found with get_metric_ids().
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
srcs = ['This sentences contain grammatical error .']
hyps = ['This sentence contains an grammatical error .']
refs = [
['This sentence contains an grammatical error .'],
['This sentence contains grammatical errors .']
] # (num_refs, num_sents)
# Corpus-level score
# If the metric is reference-free, the argument `references=` is not needed.
corpus_score: float = metric.score_corpus(
sources=srcs,
hypotheses=hyps,
references=refs
)
# Sentence-level scores
sent_scores: list[float] = metric.score_sentence(
sources=srcs,
hypotheses=hyps,
references=refs
)
CLI
- As the corresponding configurations differ depending on the metric, they are described and entered in yaml. If no yaml is provided, the default configuration is used.
- You can input multiple hypotheses.
gecmetrics-eval \
--src <sources file> \
--hyps <hypotheses file 1> <hypotheses file 2> ... \
--refs <references file 1> <references file 2> ... \
--metric <metric id> \
--config config.yaml
# The output will be:
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 1>
# Score=XXXXX | Metric=<metric id> | hyp_file=<hypotheses file 2>
# ...
The config.yaml with default values can be generated via gecmetrics-gen-config.
gecmetrics-gen-config > config.yaml
Metrics
gec-metrics supports the following metrics.
All of arguments in the following examples indicate default values.
Reference-based
GLEU+ [Napoles+ 15] [Napoles+ 16]
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config(
iter=500, # The number of iterations
n=4, # max n-gram
unit='word' # 'word' or 'char'
))
We also provide a reproduction of the official implementation as GLEUOfficial.
The official one ignores ngram frequency differences when calculating the difference set between source and reference.
from gec_metrics import get_metric
metric_cls = get_metric('gleuofficial')
metric = metric_cls(metric_cls.Config(
iter=500, # The number of iterations
n=4, # max n-gram
unit='word' # 'word' or 'char'
))
ERRANT [Felice+ 16] [Bryant+ 17]
from gec_metrics import get_metric
metric_cls = get_metric('errant')
metric = metric_cls(metric_cls.Config(
beta=0.5, # The beta for F-beta score
language='en' # Language for SpaCy.
))
GoToScorer [Gotou+ 20]
from gec_metrics import get_metric
metric_cls = get_metric('gotoscorer')
metric = metric_cls(metric_cls.Config(
beta=0.5, # The beta for F-beta score
ref_id=0, # The reference id
no_weight=False, # If True, all weights are 1.0
weight_file='' # It is required if no_weight=False
))
You can generate a weight file via gecmetrics-gen-gotoscorer-weight.
The output is a JSON file.
gecmetrics-gen-gotoscorer-weight \
--src <raw text file> \
--ref <raw text file> \
--hyp <raw text file 1> <raw text file 2> ... <raw text file N> \
--out weight.json
PT-ERRANT [Gong+ 22]
from gec_metrics import get_metric
metric_cls = get_metric('pterrant')
weight_model_id = 'bertscore'
weight_model_cls = get_metric(weight_model_id)
metric = metric_cls(metric_cls.Config(
beta=0.5,
weight_model_name=weight_model_id,
weight_model_config=weight_model_cls.Config( # Optional: you can pass config
score_type='f',
rescale_with_baseline=True
)
))
GREEN [Koyama+ 24]
from gec_metrics import get_metric
metric_cls = get_metric('green')
metric = metric_cls(metric_cls.Config(
n=4, # Max n of ngram
beta=2.0, # The beta for F-beta
unit='word' # 'word' or 'char'. Choose word-level or character-level
))
Reference-based (but sources-free)
These metrics are intended to be used for a component of PT-{M2, ERRANT}, but are also exposed to API.
BERTScore [Zhang+ 19]
The default config follows the default setting of [Gong+ 22].
from gec_metrics import get_metric
metric_cls = get_metric('bertscore')
metric = metric_cls(metric_cls.Config(
model_type='bert-base-uncased',
num_layers=None,
batch_size=64,
nthreads=4,
all_layers=False,
idf=False,
idf_sents=None,
lang='en',
rescale_with_baseline=True,
baseline_path=None,
use_fast_tokenizer=False,
score_type='f'
))
Reference-free
SOME [Yoshimura+ 20]
Download pre-trained models in advance from here.
from gec_metrics import get_metric
metric_cls = get_metric('some')
metric = metric_cls(metric_cls.Config(
model_g='gfm-models/grammer',
model_f='gfm-models/fluency',
model_m='gfm-models/meaning',
weight_f=0.55,
weight_g=0.43,
weight_m=0.02,
batch_size=32
))
Scribendi [Islam+ 21]
from gec_metrics import get_metric
metric_cls = get_metric('scribendi')
metric = metric_cls(metric_cls.Config(
model='gpt2', # The model name or path to the language model to compute perplexity
threshold=0.8 # The threshold for the maximum values of token-sort-ratio and levelshtein distance ratio
))
IMPARA [Maeda+ 22]
Note that the QE model is an unofficial model which achieves comparable correlation with the human evaluation results.
By default, it uses an unofficial pretrained QE model: [gotutiyan/IMPARA-QE].
from gec_metrics import get_metric
metric_cls = get_metric('impara')
metric = metric_cls(metric_cls.Config(
model_qe='gotutiyan/IMPARA-QE', # The model name or path for quality estimation.
model_se='bert-base-cased', # The model name or path for similarity estimation.
threshold=0.9 # The threshold for the similarity score.
))
LLM-S, LLM-E [Kobayashi+24]
llmkobayashi24is a common prefix.llmkobayashi24hfsentandllmkobayashi24hfeditis a huggingface model based LLM-S and LLM-E.llmkobayashi24openaisentandllmkobayashi24openaieditis a OpenAI model based LLM-S and LLM-E.
from gec_metrics import get_metric
metric_cls = get_metric('llmkobayashi24hfsent')
metric = metric_cls(metric_cls.Config(
model='meta-llama/Llama-2-13b-chat-hf', # The model name or path for a language model.
))
from gec_metrics import get_metric
metric_cls = get_metric('llmkobayashi24openaisent')
metric = metric_cls(metric_cls.Config(
model='gpt-4o-mini-2024-07-18'
organization='<Organization key>'
api_key='<API key>'
base_url=None, # use it when using Gemini
))
Meta Evaluation
To perform meta evaluation easily, we provide meta-evaluation scripts.
Preparation
To donwload test data and human scores, you must download datasets by using the shell.
gecmetrics-prepare-meta-eval
# The above is the same as:
# bash src/gec_metrics/meta_eval/prepare_meta_eval.sh
This shell creates meta_eval_data/ directory which consists of SEEDA dataset and CoNLL14 official submissions.
meta_eval_data/
├── GJG15
│ └── judgments.xml
├── conll14
│ ├── official_submissions
│ │ ├── AMU
│ │ ├── CAMB
│ │ ├── ...
│ ├── REF0
│ └── REF1
└── SEEDA
├── outputs
│ ├── all
│ │ ├── ...
│ └── subset
│ ├── ...
├── scores
│ ├── human
│ │ ├── ...├── ...
Common Usage
gec_metrics.get_meta_eval() supports ['gjg', 'seeda'].
.corr_system()performs system-level meta-evaluation..corr_sentence()performs sentence-level meta-evaluation.
from gec_metrics import get_meta_eval
from gec_metrics import get_metric
metric_cls = get_metric('gleu')
metric = metric_cls(metric_cls.Config())
meta_cls = get_meta_eval('seeda')
meta = meta_cls(
meta_cls.Config(system='base')
)
# System correlation
results = meta.corr_system(metric)
# Output:
# SEEDASystemCorrOutput(ew_edit=Corr(pearson=0.9007842791853424,
# spearman=0.9300699300699302,
# accuracy=None,
# kendall=None),
# ew_sent=Corr(pearson=0.8749437873537543,
# spearman=0.9090909090909092,
# accuracy=None,
# kendall=None),
# ts_edit=Corr(pearson=0.9123732084071973,
# spearman=0.9440559440559443,
# accuracy=None,
# kendall=None),
# ts_sent=Corr(pearson=0.8856173179230024,
# spearman=0.9020979020979022,
# accuracy=None,
# kendall=None))
# Sentence correlation
results = meta.corr_sentence(metric)
# Output:
# SEEDASentenceCorrOutput(sent=Corr(pearson=None,
# spearman=None,
# accuracy=0.6715701950751519,
# kendall=0.3431403901503038),
# edit=Corr(pearson=None,
# spearman=None,
# accuracy=0.6734561494551116,
# kendall=0.3469122989102231))
SEEDA: [Kobayashi+ 24]
from gec_metrics import get_meta_eval
meta_cls = get_meta_eval('seeda')
meta = meta_cls(
meta_cls.Config(system='base')
)
The .corr_system() returns a gec_metrics.meta_eval.meta_eval.SEEDASystemCorrOutput. instance. This is a dataclass containig ew_sent, ew_edit, ts_sent, ts_edit.
ew_*means using ExpectedWins human evaluation scores andts_*means using TrueSkill.*_editand*_sentmeans SEEDA-E and SEEDA-S.
The .corr_sentence() returns a gec_metrics.meta_eval.meta_eval.SEEDASentenceCorrOutput. instance. This is a dataclass containig sent, edit.
editandsentmeans SEEDA-E and SEEDA-S.
The window_analysis_system() performs the window analysis.
- This returns
SEEDAWindowAnalysisSystemCorrOutputinstance contaiing the same attributes as.corr_system(). Each attribute hasdict[tuple, MetaEvalSEEDA.Corr]and the tuple means start and end rank of human evaluation. window_analysis_plot()can be used for visualization.# An exmaple of window-analysis visualization. from gec_metrics.metrics import ERRANT from gec_metrics.meta_eval import MetaEvalSEEDA import matplotlib.pyplot as plt metric = ERRANT(ERRANT.Config(beta=0.5)) meta = MetaEvalSEEDA( MetaEvalSEEDA.Config(system='base') ) window_results = meta.window_analysis_system(metric) fig = meta.window_analysis_plot(window_results.ts_edit) plt.savefig('window-errant.png')
GJG15: [Grundkiewicz+ 15]
This is referred to GJG15 in the SEEDA paper.
Basically, TrueSkill ranking is used to compute the correlation.
from gec_metrics import get_meta_eval
meta_cls = get_meta_eval('gjg')
meta = meta_cls()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gec_metrics-0.1.1.tar.gz.
File metadata
- Download URL: gec_metrics-0.1.1.tar.gz
- Upload date:
- Size: 190.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
132c15fe00e71de8112da30a9652c78ffce7394215dc8d93045f7bfb78b41768
|
|
| MD5 |
4c76967a4ac76ce1076626007a552958
|
|
| BLAKE2b-256 |
621d5c90171802b62473d29d97a952b305654dca3fe249df726114c399da26a6
|
File details
Details for the file gec_metrics-0.1.1-py3-none-any.whl.
File metadata
- Download URL: gec_metrics-0.1.1-py3-none-any.whl
- Upload date:
- Size: 55.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ff3433355ad65334a529a879c05056b8b2fce6662ad344d161d43d5f63649ce
|
|
| MD5 |
5bffbb18669597a91d42e9028c9b6570
|
|
| BLAKE2b-256 |
aae99619153a6a2a1598798f5a21bd87bbec29b9230a54060347527c9331d2ae
|