Skip to main content

a tool for evaluation

Project description

# SLTev

SLTev is an open-source tool for assessing the quality of spoken language translation (SLT) in a comprehensive way. Based on timestamped golden transcript and reference translation into a target language, SLTev reports the quality, delay and stability of a given SLT candidate output.

SLTev can also evaluate the intermediate steps alone: the output of automatic speech recognition (ASR) and machine translation (MT).

You can see our short presentaion at EACL 2021 - System Demonstration here: Full details in the paper (bibtex below):

## Requirements

  • python3.6 or higher
  • some pip-installed modules: - sacrebleu, sacremoses - gitpython, gitdir, filelock
  • mwerSegmenter

## File Naming Convention

Depending on whether your system produces (spoken language) translation (SLT), or just the speech recognition (ASR), you should use the following naming template of your input and output files.

### Golden Transcripts: .OSt, .OStt - <file-name> . <language> . <OSt/OStt> - e.g. kaccNlwi6lUCEM.en.OSt, kaccNlwi6lUCEM.cs.OStt

### Word Alignment for Better Estimation: .align - <file-name> . <source-language> . <target-language> . <align> - e.g.

### System Outputs from Translation: .slt, .mt - <file-name> . <source-language> . <target-language> . <slt/mt> - e.g.,

### System Outputs from ASR: .asr, .asrt - <file-name> . <source-language> . <source-language> . <asr/asrt> - e.g. kaccNlwi6lUCEM.en.en.asr

## Installation

Install the Python module (Python 3 only)

` pip3 install SLTev `

Also, you can install from the source:

` python3 install `

## Package Overview

  • SLTev: Contains scripts for running SLTev
  • sample-data: Contains sample input and output files
  • test: Test files

## Evaluating

SLTev scoring relies on reference outputs (golden transcript for ASR, reference translation for MT and SLT).

You can run SLTev and provide it with your custom reference outputs, or you can pick the easier option: use our provided test set (elitr-testset) to evaluate your system on our inputs. The added benefit of elitr-testset scoring is that it makes your results comparable to others (subject to SLTev and test set versions, of course).

### Evaluating on elitr-testset

SLTev works best if you want to evaluate your system on files provided in elitr-testset (

The procedure is simple: 1. Choose an “index”, i.e. a subset of files that you want to test on, here: We illustrate the rest with SLTev-sample as the index.

2. Ask SLTev to provide you with the current version of input files: ` SLTev -g SLTev-sample --outdir my-evaluation-run-1 # To use your existing checkout of elitr-testset, add -T /PATH/TO/YOUR/elitr-testset # To populate of elitr-testset links, add ELITR_CONFIDENTIAL_PASSWORD=<password> before SLTev, #   e.g.: ELITR_CONFIDENTIAL_PASSWORD=myPass SLTev -g SLTev-sample --outdir my-evaluation-run-1 `

  1. Run your models on files in my-evaluation-run-1 and put the outputs into the same directory, with filename suffixes as described above.

4. Run SLTev to get the scores: ` SLTev -e my-evaluation-run-1/ # To aggregate scores instead of produce score files, add --aggregate # To reduce the number of scores, add --simple `

### Evaluating with Your Custom Reference Files

In order to evaluate a hypothesis with custom files, you can use MTeval, SLTeval, ASReval commands as follow: Each one of them takes a list of input file paths (-i or –input) and a list of the format of the input files in orders (-f or –file-formats). The input file formats can be chosen from the following items: * ost: original speech transcribed, i.e. the golden transcript * ref: reference translation * ostt: timestamped golden transcript * slt: timestamped online MT hypothesis, with partial outputs * mt: finalized MT hypothesis (i.e. one segment per line; segmentation can differ from the reference one) * align: align files (output of the MGIZA) * asrt: timestamped ASR hypothesis, with partial outputs * asr: finalized ASR hypothesis (i.e. one segment per line; segmentation can differ from the golden one)

Please note that candidate files must be at the before or after of their input files. In the following examples, A and B are correct and C is not.

  1. SLTeval -i slt_pth ostt_path ref_path -f slt ostt ref
  2. SLTeval -i ostt_path ref_path slt_path -f ostt ref slt
  3. SLTeval -i ostt_path slt_path ref_path -f ostt slt ref

#### Evaluating MT

To evaluate the output of a machine translation system without any timing information, use the following command.

Note that SLTev is not intended for the basic case where MT output segment correspond 1-1 to the reference; SLTev will always resegment in some way.

` MTeval -i file1 file2 ... -f file1_format file2_format ... # To reduce the number of scores, add --simple ` Demo example: ` git clone cd SLTev MTeval -i sample-data/ sample-data/sample.cs.OSt -f mt ref ` Should give you output like this: ` Evaluating the file  sample-data/  in terms of translation quality against  sample-data/sample.cs.OSt P ... considering Partial segments in delay and quality calculation (in addition to Complete segments) T ... considering source Timestamps supplied with MT output W ... segmenting by mWER segmenter (i.e. not segmenting by MT source timestamps) A ... considering word alignment (by GIZA) to relax word delay (i.e. relaxing more than just linear delay calculation) ------------------------------------------------------------------------------------------------------------ --       TokenCount    reference1             37 avg      TokenCount    reference*             37 --       SentenceCount reference1             4 avg      SentenceCount reference*             4 tot      sacreBLEU     docAsAWhole            32.786 avg      sacreBLEU     mwerSegmenter          25.850 `

#### Evaluating SLT

Spoken language translation evaluates “machine translation in time”. So a time-stamped MT output (slt) is compared with the reference translation (non-timed, ref) and the timing of the golden transcript (ostt).

` SLTeval -i file1 file2 ... -f file1_format file2_format ... # To reduce the number of scores, add --simple ` Demo example: ` # get sample-data as in the MT example above SLTeval -i sample-data/sample.en.cs.slt sample-data/sample.cs.OSt sample-data/sample.en.OStt -f slt ref ostt ` Should give you: ` Evaluating the file  sample-data/sample.en.cs.slt  in terms of translation quality against  sample-data/sample.cs.OSt ... tot      Delay         PW                     336.845 ... tot      Flicker       count_changed_content  23 ... tot      sacreBLEU     docAsAWhole            32.786 ... `

#### Evaluating ASR

In basic speech recognition evaluation, timing is ignored. For this type of evaluation, use the following command and provide ASR output (asr) and the golden transcript without timestamps (ost):

` ASReval -i file1 file2 ... -f file1_format file2_format ... # To reduce the number of scores, add --simple ` Demo example: ` # get sample-data as in the MT example above ASReval -i sample-data/sample.en.en.asr sample-data/sample.en.OSt -f asr ost ` Should give you: ` Evaluating the file  sample-data/sample.en.en.asr  in terms of  WER score against  sample-data/sample.en.OSt ------------------------------------------------------------- L ... lowercasing P ... removing punctuation C ... concatenating all sentences W ... using mwersegmemter M ... using Moses tokenizer ------------------------------------------------------------- LPC    0.265 LPW    0.274 WM     0.323 ` Here we learn that the WER score (lower is better) for this sample file varies between .265 and .323 depending on the pre-processing technique. In ASR research, the most common pre-processing strategy is what we call LPW, i.e. lowecase, remove punctuation and use mWERsegmenter to mimic the segmentation of the reference transcript. If we consider casing and punctuation (labelled WM), the score gets naturally worse.

#### Evaluating ASR with timing (ASRT)

ASRT is like SLT but in the source language, i.e. evaluating the time-stamped output of an ASR system (asrt) against the golden transcript which has to be provided twice: without timestamps (ost) and with timing and partial segments (ostt). All the files are in the same language and the ost file must have the exact same number of segments as there are “C”omplete segments in the ostt file.

` ASReval -i file1 file2 ... -f file1_format file2_format ... # To reduce the number of scores, add --simple ` Demo example: ` ASReval -i sample-data/sample.en.en.asrt sample-data/sample.en.OSt sample-data/sample.en.OStt -f asrt ost ostt ` #### Notes 1. The asrt and slt files have timestamps and, mt and asr do not have them. 2. For using MTeval, SLTeval, ASReval commands, you do not need to follow naming templates, it is the -f parameter that specifies the use of the file. 3. You can evaluate several hypotheses at once. Also, you can use short file formats. For example, the following commands are equal:

` MTeval -i file1 hypo1 file2 hypo2 -f ref mt ref mt ` OR ` MTeval -i file1 hypo1 file2 hypo2 -f ref mt ` 4. You can use the pipeline as input instead of -i parameter, for example, the following commands are equal:

` MTeval -i file1 hypo1 file2 hypo2 -f ref mt ` OR ` echo "file1 hypo1" |  MTeval -f ref mt `

## Terminology and Abbreviations

  • OSt … original speech manually transcribed (i.e. golden transcript)
  • OStt … original speech manually transcribed with word-level timestamps
  • mt … the unrevised output of text-based translation; the source of MT can be .asr (machine-transcribed OS) or .OSt (human-transcribed OS)
  • slt … timestamped online MT hypothesis, i.e. the output of an MT system ran in online mode, with timestamps recorded
  • asr … the unrevised output of a speech recognition system
  • asrt … the unrevised output of speech recognition system; timestamped at the word level

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for SLTev, version 1.2.2
Filename, size File type Python version Upload date Hashes
Filename, size SLTev-1.2.2-py2.py3-none-any.whl (2.3 MB) File type Wheel Python version py2.py3 Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page