Computate metrics for machine translation

These details have not been verified by PyPI

Project links

Homepage

Project description

cyzil

Description

Cyzil provides tools that enable quick and in-depth analysis of sequence generation models such as machine translation models. It contains a Cython module that provides fast computation of standard metrics. It covers edit distance (Levenstein Distance) and BLEU score proposed by Papineni et al. (2002) so far.

Command-line tool
Python API
User guide

Requirements

Python 3.7+

Command-line tool

User Guide

With cyzil, you can compute BLEU score and Edit distance on your terminal. All you have to do is to specify the path to a reference file (correct translations) and a candidate file (translation generated by a machine translation model). The reference and candidate sentences should be stored in separate lines, e.g. sentence 1\n sentence 2\n ... sentence k\n. Please see examples here. For computing score, you can tokenize sentences by white space or nltk tokenizer. By default, it tokenizes sentences by white space.

Usage

The following code shows an example for corpus-leve BLEU score. It prints out the precision, the brevity penalty and BLEU score.

> cyzil-bleu-corpus \
    --reference data/ref.en \
    --candidate data/can.en \
    --ngram 4 \
    --tokenizer nltk
[0.9041149616241455, 1.0, 0.9041149616241455]

The below is an example for corpus-level edit distance.

> cyzil-edit-distance-corpus \
    --reference data/ref.en \
    --candidate data/can.en \
    --tokenizer nltk
[0.5, 0.04545454680919647]

Computing Score for Each Pair

Cyzil also computes the metric of each reference-candidate pair to for in-depth analysis of sequence generation models. The output can be stored in a csv file. Each row of output corresponds to each reference-candidate pair.

Here is an example for BLEU score. The first column of the output is the precision, the second is the brevity penalty and the last column is the BLEU score.

> cyzil-bleu-points \
    --reference data/ref.en \
    --candidate data/can.en \
    --ngram 4 \
    --tokenizer nltk \
    --output output.csv

Edit distance can be computed as follows. The first column of the output is edit distance and the second column is normalized edit distance.

> cyzil-edit-distance-points \
    --reference data/ref.en \
    --candidate data/can.en \
    --tokenizer nltk \
    --output output.csv

For more details, please refer to help of each command, e.g. cyzil-bleu-corpus -h.

Python API

Cyzil can be imported as a python module into your program. The following shows example of API calls. For more details, please refer to User Guide.

import cyzil

reference = ['this', 'is', 'a', 'test']
candidate = ['this', 'is', 'a', 'test']

cyzil.bleu_sentence(reference, candidate, max_ngram=4)

cyzil.bleu_corpus([reference], [candidate], max_ngram=4)

cyzil.bleu_points([reference], [candidate], max_ngram=4)

cyzil.edit_distance_sentence(reference, candidate)

cyzil.edit_distance_corpus([reference], [candidate])

cyzil.edit_distance_points([reference], [candidate])

Testing

git clone this repository
cd into the repository
run pytest. If you don't have pytest, run pip install pytest first.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.1

May 6, 2020

0.3.0

May 6, 2020

0.2.5

Apr 23, 2020

0.2.4

Apr 23, 2020

0.2.3

Apr 22, 2020

This version

0.2.2

Apr 22, 2020

0.2.1

Apr 22, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cyzil-0.2.2.tar.gz (88.5 kB view hashes)

Uploaded Apr 22, 2020 Source

Hashes for cyzil-0.2.2.tar.gz

Hashes for cyzil-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`f53eec7116e7cc0a674b99bace0294d28026024eb230abe32468784f438556dd`
MD5	`c84f6e6b567261f826fb9389391906bc`
BLAKE2b-256	`4fa0ad59f2896750f8c08c67d28b1fbec8455b37bdc89257b0a8b1ad97108d34`