Evaluation toolkit for neural language generation.
Project description
Jury
Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as adopting datasets.Metric
.
Main advantages that Jury offers are:
- Easy to use for any NLG system.
- Calculate many metrics at once.
- Metrics calculations are handled concurrently to save processing time.
- It supports evaluating multiple predictions.
To see more, check the official Jury blog post.
Installation
Through pip,
pip install jury
or build from source,
git clone https://github.com/obss/jury.git
cd jury
python setup.py install
Usage
API Usage
It is only two lines of code to evaluate generated outputs.
from jury import Jury
scorer = Jury()
predictions = [
["the cat is on the mat", "There is cat playing on the mat"],
["Look! a wonderful day."]
]
references = [
["the cat is playing on the mat.", "The cat plays on the mat."],
["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)
Specify metrics you want to use on instantiation.
scorer = Jury(metrics=["bleu", "meteor"])
scores = scorer(predictions, references)
Use of Metrics standalone
You can directly import metrics from jury.metrics
as classes, and then instantiate and use as desired.
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references)
The additional parameters can either be specified on compute()
from jury.metrics import Bleu
bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references, max_order=4)
, or alternatively on instantiation
bleu = Bleu._construct(compute_kwargs={"max_order": 1})
Note that you can seemlessly access both jury
and datasets
metrics through jury.load_metric
.
import jury
bleu = jury.load_metric("bleu")
bleu_1 = jury.load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1})
# metrics not available in `jury` but in `datasets`
wer = jury.load_metric("wer") # It falls back to `datasets` package with a warning
CLI Usage
You can specify predictions file and references file paths and get the resulting scores. Each line should be paired in both files.
jury eval --predictions /path/to/predictions.txt --references /path/to/references.txt --reduce_fn max
If you want to specify metrics, and do not want to use default, specify it in config file (json) in metrics
key.
{
"predictions": "/path/to/predictions.txt",
"references": "/path/to/references.txt",
"reduce_fn": "max",
"metrics": [
"bleu",
"meteor"
]
}
Then, you can call jury eval with config
argument.
jury eval --config path/to/config.json
Custom Metrics
You can use custom metrics with inheriting jury.metrics.Metric
, you can see current metrics implemented on Jury from jury/metrics. Jury falls back to datasets
implementation of metrics for the ones that are currently not supported by Jury, you can see the metrics available for datasets
on datasets/metrics.
Jury itself uses datasets.Metric
as a base class to drive its own base class as jury.metrics.Metric
. The interface is similar; however, Jury makes the metrics to take a unified input type by handling the inputs for each metrics, and allows supporting several input types as;
- single prediction & single reference
- single prediction & multiple reference
- multiple prediction & multiple reference
As a custom metric both base classes can be used; however, we strongly recommend using jury.metrics.Metric
as it has several advantages such as supporting computations for the input types above or unifying the type of the input.
from jury.metrics import MetricForTask
class CustomMetric(MetricForTask):
def _compute_single_pred_single_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_single_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
def _compute_multi_pred_multi_ref(
self, predictions, references, reduce_fn = None, **kwargs
):
raise NotImplementedError
For more details, have a look at base metric implementation jury.metrics.Metric
Contributing
PRs are welcomed as always :)
Installation
git clone https://github.com/obss/jury.git
cd jury
pip install -e .[dev]
Tests
To tests simply run.
python tests/run_tests.py
Code Style
To check code style,
python tests/run_code_style.py check
To format codebase,
python tests/run_code_style.py format
License
Licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file jury-2.1.0.tar.gz
.
File metadata
- Download URL: jury-2.1.0.tar.gz
- Upload date:
- Size: 48.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e73159337b50ba0c607407f1b3b220e58e7a8e080542904fe45048d577f36190 |
|
MD5 | 5e29e906d04a75d9e22e7fff563c6e8a |
|
BLAKE2b-256 | 7aada22e287f26631106dfe5c45aebdcb55d2667af95c3d40af43cac39cc38e2 |
File details
Details for the file jury-2.1.0-py3-none-any.whl
.
File metadata
- Download URL: jury-2.1.0-py3-none-any.whl
- Upload date:
- Size: 80.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22f4b0bccdffa34caad874d65147038ff8c69adf9ba68d004aad37051f77e8c0 |
|
MD5 | 8fe7ffb71b7d64a5f04321287f134747 |
|
BLAKE2b-256 | df367f727fbfd2b29896e4d644ed173bc7325a090d30651543b161f0d7e88ea7 |