Skip to main content

Evaluation toolkit for neural language generation.

Project description

Jury

Python versions downloads PyPI version Latest Release Open in Colab
Build status Dependencies Code style: black License: MIT

Simple toolkit for evaluating NLP experiments offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as extending proper class.

Main advantages that Jury offers are:

  • Easy to use for any NLG system.
  • Calculate many metrics at once.
  • Metrics calculations are handled concurrently to save processing time.
  • It supports evaluating multiple predictions seamlessly.

To see more, check the official Jury blog post.

Installation

Through pip,

pip install jury

or build from source,

git clone https://github.com/obss/jury.git
cd jury
python setup.py install

Usage

API Usage

It is only two lines of code to evaluate generated outputs.

from jury import Jury

scorer = Jury()
predictions = [
    ["the cat is on the mat", "There is cat playing on the mat"], 
    ["Look!    a wonderful day."]
]
references = [
    ["the cat is playing on the mat.", "The cat plays on the mat."], 
    ["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)

Specify metrics you want to use on instantiation.

scorer = Jury(metrics=["bleu", "meteor"])
scores = scorer(predictions, references)

Use of Metrics standalone

You can directly import metrics from jury.metrics as classes, and then instantiate and use as desired.

from jury.metrics import Bleu

bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references)

The additional parameters can either be specified on compute()

from jury.metrics import Bleu

bleu = Bleu.construct()
score = bleu.compute(predictions=predictions, references=references, max_order=4)

, or alternatively on instantiation

from jury.metrics import Bleu
bleu = Bleu.construct(compute_kwargs={"max_order": 1})
score = bleu.compute(predictions=predictions, references=references)

Note that you can seemlessly access both jury and datasets metrics through jury.load_metric.

import jury

bleu = jury.load_metric("bleu")
bleu_1 = jury.load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1})
# metrics not available in `jury` but in `datasets`
wer = jury.load_metric("wer") # It falls back to `datasets` package with a warning

CLI Usage

You can specify predictions file and references file paths and get the resulting scores. Each line should be paired in both files. You can optionally provide reduce function and an export path for results to be written.

jury eval --predictions /path/to/predictions.txt --references /path/to/references.txt --reduce_fn max --export /path/to/export.txt

You can also provide prediction folders and reference folders to evaluate multiple experiments. In this set up, however, it is required that the prediction and references files you need to evaluate as a pair have the same file name. These common names are paired together for prediction and reference.

jury eval --predictions /path/to/predictions_folder --references /path/to/references_folder --reduce_fn max --export /path/to/export.txt

If you want to specify metrics, and do not want to use default, specify it in config file (json) in metrics key.

{
  "predictions": "/path/to/predictions.txt",
  "references": "/path/to/references.txt",
  "reduce_fn": "max",
  "metrics": [
    "bleu",
    "meteor"
  ]
}

Then, you can call jury eval with config argument.

jury eval --config path/to/config.json

Custom Metrics

You can use custom metrics with inheriting jury.metrics.Metric, you can see current metrics implemented on Jury from jury/metrics. Jury falls back to datasets implementation of metrics for the ones that are currently not supported by Jury, you can see the metrics available for datasets on datasets/metrics.

Jury itself uses datasets.Metric as a base class to drive its own base class as jury.metrics.Metric. The interface is similar; however, Jury makes the metrics to take a unified input type by handling the inputs for each metrics, and allows supporting several input types as;

  • single prediction & single reference
  • single prediction & multiple reference
  • multiple prediction & multiple reference

As a custom metric both base classes can be used; however, we strongly recommend using jury.metrics.Metric as it has several advantages such as supporting computations for the input types above or unifying the type of the input.

from jury.metrics import MetricForTask

class CustomMetric(MetricForTask):
    def _compute_single_pred_single_ref(
        self, predictions, references, reduce_fn = None, **kwargs
    ):
        raise NotImplementedError

    def _compute_single_pred_multi_ref(
        self, predictions, references, reduce_fn = None, **kwargs
    ):
        raise NotImplementedError

    def _compute_multi_pred_multi_ref(
            self, predictions, references, reduce_fn = None, **kwargs
    ):
        raise NotImplementedError

For more details, have a look at base metric implementation jury.metrics.Metric

Contributing

PRs are welcomed as always :)

Installation

git clone https://github.com/obss/jury.git
cd jury
pip install -e .[dev]

Tests

To tests simply run.

python tests/run_tests.py

Code Style

To check code style,

python tests/run_code_style.py check

To format codebase,

python tests/run_code_style.py format

License

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jury-2.1.2.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

jury-2.1.2-py3-none-any.whl (82.4 kB view details)

Uploaded Python 3

File details

Details for the file jury-2.1.2.tar.gz.

File metadata

  • Download URL: jury-2.1.2.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for jury-2.1.2.tar.gz
Algorithm Hash digest
SHA256 ba1936d0b7737daa327f6652c5dcfa275b9d9cc6a9d0fc223eaf2b4c3c24dffb
MD5 e55188781dfb4ef4290cc488926be62d
BLAKE2b-256 d2859db87fd1afb7d73844646b282b8f5f4873a25c243eb34e89120bb1b631fd

See more details on using hashes here.

File details

Details for the file jury-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: jury-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 82.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for jury-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 acf55a44cfc371fe224acf861147a6eb515b8678f24e1a4ccee0b674adef426e
MD5 f5a0ae946d8cba9552a1d6226e9a9e9d
BLAKE2b-256 265b9690768991441e97d209d549d554ab9330d472865963a22404ee96bbcb08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page