Skip to main content

Translation Edit Rate on the character level

Project description

CharacTER

CharacTER: Translation Edit Rate on Character Level

CharacTer (cer) is a novel character level metric inspired by the commonly applied translation edit rate (TER). It is defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered to match a reference word and could be shifted, if the edit distance between them is below a threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower TER.

Modifications by Bram Vanroy

Bram Vanroy made some changes to this package that do not affect the result of the metric but that should improve usability. Code has been re-written to avoid the need for custom C++ code (instead the C implementation of Levenshtein alongside an LRU cache is used), to make functions more accessible and readable, and typing info has been included. Packaging has also improved to make uploading to PyPi a breeze. This means that the package can now be installed via pip:

pip install cer

The main functions are calculate_cer and calculate_cer_corpus, which both expect tokenized input. The first argument contains the hypotheses and the second the references.

from cer import calculate_cer

cer_score = calculate_cer(["i", "like", "your", "bag"], ["i", "like", "their", "bags"])
cer_score
0.3333333333333333

calculate_cer_corpus is similar but instead it expects a sequence of sequence of words, basically a corpus of sentences of words. It will report some statistics of the sentence-level CER scores that were calculated.

from cer import calculate_cer_corpus

hyps = ["this week the saudis denied information published in the new york times",
        "this is in fact an estimate"]
refs = ["saudi arabia denied this week information published in the american new york times",
        "this is actually an estimate"]

hyps = [sent.split() for sent in hyps]
refs = [sent.split() for sent in refs]

cer_corpus_score = calculate_cer_corpus(hyps, refs)
cer_corpus_score
{
    'count': 2,
    'mean': 0.3127282211789254,
    'median': 0.3127282211789254,
    'std': 0.07561653111280243,
    'min': 0.25925925925925924,
    'max': 0.36619718309859156
}

In addition to the Python interface, a command-line entry-point is also installed, which you can use as calculate-cer. Its idea is to calculate aggregate scores on the corpus-level (similar to calculate_cer_corpus) based on two input files. One with hypotheses and one with references (one on each line). Results are written to stdout.

usage: calculate-cer [-h] [-r] fhyp fref

CharacTER: Character Level Translation Edit Rate

positional arguments:
  fhyp                Path to file containing hypothesis sentences. One per line.
  fref                Path to file containing reference sentences. One per line.

optional arguments:
  -h, --help          show this help message and exit
  -r, --per_sentence  Whether to output CER scores per ref/hyp pair in addition to corpus-level statistics

License

GPLv3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cer-1.2.0.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

cer-1.2.0-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file cer-1.2.0.tar.gz.

File metadata

  • Download URL: cer-1.2.0.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for cer-1.2.0.tar.gz
Algorithm Hash digest
SHA256 485cb7ea2e6cbafcaed2147905bd06c3a5453cba8e598fe5b5976c625ba5904e
MD5 c4e64b1b384a7c314e002604fa451644
BLAKE2b-256 bc8a9799debef342e3819a198f135ba2de5c278c13fefffbfb4ce7dd131b7920

See more details on using hashes here.

File details

Details for the file cer-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: cer-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for cer-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a978a7a178bc369b0d32084ef5768b9636fd4e400fa7bbad1452f7f013a5107f
MD5 8755f18d09a3dcd1d234bf81d0aa697e
BLAKE2b-256 06722993280000b13dd8b3b6be348807f01fe84997d5d42333fe0516685507c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page