Evaluate simultaneous speech/text translation systems (shortform and longform) with quality (BLEU, chrF, COMET) and latency (YAAL) metrics. For longform, re-segments outputs to match reference segmentation.
Project description
OmniSTEval
A tool for evaluating simultaneous speech/text translation systems — both shortform (segment-level) and longform (document-level with resegmentation).
For longform systems, OmniSTEval re-segments the translation outputs to match reference segmentation, enabling segment-level quality (BLEU, chrF, COMET) and latency (YAAL, LongYAAL, etc.) evaluation.
Implements YAAL, LongYAAL and the SoftSegmenter alignment algorithm from Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation.
How It Works
Shortform evaluation
For the shortform case (i.e., when the input speech is segmented into individual segments following a sentence boundary), OmniSTEval directly evaluates quality and latency metrics on the per-segment outputs without any resegmentation.
In addition to the standard quality metrics (BLEU, chrF, COMET) and latency metrics (YAAL, AL, LAAL, AP, DAL), shortform evaluation computes a set of degeneracy diagnostics that help detect whether a system has learned a degenerate simultaneous policy — see Shortform Degeneracy Diagnostics below.
Longform evaluation (with resegmentation)
Simultaneous speech translation systems (e.g., those evaluated via SimulEval or StreamEval) produce a single long-form output per audio recording. However, most evaluation metrics (like BLEU and YAAL) are designed for segmented inputs. OmniSTEval takes the reference speech segmentation and aligns the hypothesis words (including their emission timestamps) to it, effectively re-segmenting the hypothesis according to the reference segments.
The pipeline consists of the following steps:
- Tokenization — Reference and hypothesis words are optionally tokenized using the Moses tokenizer. For Chinese/Japanese (or when
--langis not set), tokenization is skipped and character-level units are used instead. - Alignment — A dynamic programming algorithm (similar to Needleman-Wunsch / DTW, but without gap penalties) aligns hypothesis words to reference words. The alignment maximizes a Jaccard-based character-set similarity score at word level, or exact match at character level. Punctuation is prevented from aligning with non-punctuation tokens, which helps to mitigate segmentation errors around sentence boundaries.
- Re-segmentation — Aligned hypothesis words are grouped by their assigned reference segment IDs, producing one hypothesis segment per reference segment.
- Evaluation — Each re-segmented instance is scored with the enabled metrics:
Shortform Degeneracy Diagnostics
When running omnisteval shortform, three additional percent-scale metrics and a boolean flag are computed to help identify systems that have learned a degenerate simultaneous policy — i.e., a policy that appears simultaneous (outputs arrive before end-of-segment) without actually being driven by source content.
| Metric | Display name | Description |
|---|---|---|
swf |
Simultaneous Words Fraction (%) | Fraction of output words emitted before the end-of-segment signal. Corpus micro-average: 100 × Σ |{d < src_len}| / Σ |delays|. |
efsw |
Expected Simul. Words Fraction (%) | Expected fraction of the segment duration that precedes the average latency per segment. 100 × Σ max(0, src_len − YAAL_i) / Σ src_len. |
dsptv |
Degeneracy Test Value | Signed difference EFSW − SWF. A large positive value means that the system is emitting a few words with a significantly lower latency while translating the rest of the segment after the end-of-segment signal. |
degenerate_policy |
Likely Degenerate Simultaneous Policy | YES when |DSPTV| > 20, NO otherwise. When YES, an additional warning banner is also printed. |
Interpretation:
- Both SWF and EFSW measure the "simultaneous‑ness" of the system, but from different angles. SWF is purely empirical; EFSW is derived from YAAL.
- A well-behaved simultaneous system should have SWF ≈ EFSW, giving a DSPTV near 0.
- A large
|DSPTV|(> 20 pp) suggests the system's emission pattern is inconsistent with its latency profile — a common signature of degenerate strategies such as outputting all words at the very start or end of each segment.
Example report output (degenerate case):
================================================================
OmniSTEval v0.1.7 | Shortform evaluation
================================================================
Settings
----------------------------------------------------------------
Hypothesis instances.log
Reference references.txt
BLEU tokenizer 13a
Char-level no
Fix CA emissions no
Metrics quality, latency
Version 0.1.7
Scores
----------------------------------------------------------------
BLEU 18.2271
chrF 44.5324
YAAL (CU) 1135.6097
AL (CU) 1803.9192
LAAL (CU) 1857.7128
AP (CU) 0.7948
DAL (CU) 3532.4812
YAAL (CA) 1272.7485
AL (CA) 2021.1781
LAAL (CA) 2071.7031
AP (CA) 0.8903
DAL (CA) 3883.0303
Simultaneous Words Fraction (%) 32.9060
Expected Simul. Words Fraction (%) 81.1100
Degeneracy Test Value 48.2040
Likely Degenerate Simultaneous Policy YES
*** Likely Degenerate Simultaneous Policy ***
================================================================
Installation
pip install OmniSTEval
Or install from source:
git clone https://github.com/pe-trik/omnisteval.git
cd omnisteval
pip install -e .
For COMET scoring support:
pip install OmniSTEval[comet]
For SimulStream log support:
pip install OmniSTEval[simulstream]
Requirements
- Python 3.8+
mosestokenizer>=1.2.1PyYAML>=6.0.3sacrebleu>=2.5.1- (Optional)
unbabel-comet— for COMET scoring - (Optional)
simulstream— for reading SimulStream logs
Usage
OmniSTEval provides two subcommands: shortform and longform.
Shortform evaluation
Evaluate a segment-level (shortform) SimulEval output directly:
omnisteval shortform \
--hypothesis_file instances.log \
--ref_sentences_file reference_sentences.txt \
--bleu_tokenizer 13a \
--output_folder evaluation_output
Longform evaluation with speech resegmentation
Re-segment a long-form hypothesis to match the reference speech segmentation, then evaluate:
omnisteval longform \
--speech_segmentation ref_segments.yaml \
--ref_sentences_file reference_sentences.txt \
--hypothesis_file simuleval_instance_file.log \
--lang en \
--bleu_tokenizer 13a \
--output_folder segmentation_output
Longform evaluation with text resegmentation
Re-segment based on text-level document/segment IDs (no latency metrics):
omnisteval longform \
--text_segmentation text_segmentation.txt \
--ref_sentences_file reference_sentences.txt \
--hypothesis_file hypotheses.txt \
--hypothesis_format text \
--lang en \
--output_folder segmentation_output
Longform evaluation with SimulStream logs
If you have a SimulStream logfile (streaming outputs), use --hypothesis_format simulstream and
provide the SimulStream evaluation config file with --simulstream_config_file.
omnisteval longform \
--speech_segmentation ref_segments.yaml \
--ref_sentences_file references.txt \
--hypothesis_file simulstream_log.jsonl \
--simulstream_config_file cfg.yaml \
--hypothesis_format simulstream \
--lang de \
--bleu_tokenizer 13a \
--output_folder segmentation_output
Evaluate a pre-resegmented log
If you already have a resegmented JSONL file (e.g., from a previous longform run), you can evaluate it directly:
omnisteval longform \
--resegmented_hypothesis instances.resegmented.jsonl \
--bleu_tokenizer 13a \
--output_folder evaluation_output
With COMET scoring
omnisteval longform \
--speech_segmentation ref_segments.yaml \
--ref_sentences_file reference_sentences.txt \
--hypothesis_file simuleval_instance_file.log \
--source_sentences_file source_sentences.txt \
--comet \
--lang en \
--output_folder segmentation_output
Custom emission timestamp field names
If your JSONL hypothesis uses different keys for emission timestamps:
omnisteval longform \
--speech_segmentation ref_segments.yaml \
--ref_sentences_file reference_sentences.txt \
--hypothesis_file hypothesis.log \
--emission_cu_key my_delays \
--emission_ca_key my_elapsed \
--lang en \
--output_folder segmentation_output
Arguments
Common arguments (both subcommands)
| Argument | Required | Default | Description |
|---|---|---|---|
--output_folder |
No | — | Directory where output files will be written. When omitted, the evaluation report is printed to stdout only — no files are saved. |
--char_level |
One of these | False |
Use character-level alignment and scoring instead of word-level. |
--word_level |
One of these | False |
Use word-level alignment and scoring instead of character-level. |
--no_quality |
No | False |
Disable quality metrics (BLEU, chrF, COMET). |
--no_latency |
No | False |
Disable latency metrics (YAAL). Automatically set for text-only hypotheses. |
--comet |
No | False |
Enable COMET scoring. Requires --source_sentences_file and unbabel-comet. |
--comet_model |
No | Unbabel/wmt22-comet-da |
COMET model name. |
--bleu_tokenizer |
No | 13a |
Tokenizer for SacreBLEU (e.g., 13a, intl, ja-mecab, zh). |
--source_sentences_file |
No | — | Path to source sentences file (one per segment, for COMET scoring). |
--emission_cu_key |
No | delays |
JSON key for computation-unaware emission timestamps in JSONL hypothesis. |
--emission_ca_key |
No | elapsed |
JSON key for computation-aware emission timestamps in JSONL hypothesis. |
--fix_simuleval_emission_ca |
No | False |
Fix computation-aware emission timestamps for CA-YAAL. |
shortform arguments
| Argument | Required | Default | Description |
|---|---|---|---|
--hypothesis_file |
Yes | — | Path to the JSONL hypothesis file (one JSON per line with prediction, delays, etc.). |
--ref_sentences_file |
Yes | — | Path to the reference sentences file (one sentence per line). |
longform arguments
| Argument | Required | Default | Description |
|---|---|---|---|
--speech_segmentation |
One of these | — | Path to a YAML/JSON speech segmentation file. Mutually exclusive with --text_segmentation and --resegmented_hypothesis. |
--text_segmentation |
or one of these | — | Path to a text segmentation file (docid=DOC_ID,segid=SEG_ID format). Mutually exclusive with --speech_segmentation and --resegmented_hypothesis. |
--resegmented_hypothesis |
or this | — | Path to a pre-resegmented JSONL file. Mutually exclusive with segmentation inputs. |
--ref_sentences_file |
For reseg. | — | Path to reference sentences file. Required for resegmentation mode. |
--hypothesis_file |
For reseg. | — | Path to the hypothesis file. Required for resegmentation mode. |
--hypothesis_format |
No | jsonl |
Format of the hypothesis file: jsonl (SimulEval/JSONL output), text, or simulstream (SimulStream logfile; requires the simulstream package). |
--simulstream_config_file |
No | — | Path to a SimulStream evaluation config YAML file. Required when --hypothesis_format=simulstream. |
--lang |
No | None |
Language code for Moses tokenizer (e.g., en, de). |
--offset_delays |
No | False |
Offset delays relative to the first segment of each recording. |
Input Formats
Text Segmentation
A plain-text file with one entry per line in the format docid=DOC_ID,segid=SEG_ID, where DOC_ID and SEG_ID are 0-based integers. One line per reference sentence. The number of unique document IDs must equal the number of hypothesis lines.
docid=0,segid=0
docid=0,segid=1
docid=0,segid=2
docid=1,segid=0
docid=1,segid=1
docid— 0-based document index (maps to hypothesis line number)segid— 0-based segment index within the document
Speech Segmentation (YAML/JSON)
A list of segments, each with the following fields:
- {wav: recording.wav, offset: 2.433, duration: 9.05, speaker_id: spk1}
- {wav: recording.wav, offset: 15.003, duration: 9.675, speaker_id: spk1}
wav— audio filename (used to group segments by recording)offset— segment start time in secondsduration— segment duration in secondsspeaker_id— (optional) speaker identifier
Reference Sentences
One sentence per line, aligned 1:1 with the segmentation entries:
Hello, this is Elena and I will present our work.
We will discuss what lexical borrowing is.
Hypothesis File
JSONL format (--hypothesis_format jsonl, default) — One JSON object per line (SimulEval output format):
{"source": "recording.wav", "prediction": "Hello this is Elena ...", "delays": [4067.0, 4067.0, ...], "elapsed": [4100.0, 4200.0, ...], "source_length": 220000}
source— audio filename as a string (e.g.,"recording.wav"), or an array with the recording name as the first element (e.g.,["recording.wav"]) for backward compatibility with SimulEval logsprediction— the full hypothesis textdelays— (optional) per-token computation-unaware emission timestamps (in ms); length must match the number of words (or characters if--char_level) inpredictionelapsed— (optional) per-token computation-aware emission timestamps (in ms)source_length— (optional, but highly recommended for reliable YAAL/LongYAAL latency evaluation) total recording length in ms
The key names for delays and elapsed can be customized with --emission_cu_key and --emission_ca_key.
Text format (--hypothesis_format text) — One hypothesis per line, matched by order to recordings in the segmentation file. Latency metrics are not available in this mode.
Hello this is Elena and I will present our work.
We will discuss what lexical borrowing is.
SimulStream format (--hypothesis_format simulstream) and pass the SimulStream evaluation config with --simulstream_config_file. OmniSTEval uses the SimulStream LogReader to extract the final text and per-unit latencies.
-
What is read from the SimulStream log:
final_text— the final hypothesis string for the recording (normalized and tokenized according to--char_level).ideal_delays— per-unit computation-unaware delays (units in seconds inside SimulStream). These are converted to milliseconds by OmniSTEval (cu * 1000) and used asemission_cu.computational_aware_delays— per-unit computation-aware delays (seconds). These are converted to ms and used asemission_ca.
-
Notes:
- Install the
simulstreampackage to enable this mode:pip install simulstream. - The
LogReaderis called withlatency_unit='char'when--char_levelis set, otherwiselatency_unit='word'. Make sure tokenization in your config matches the evaluation settings so the number of units equals the number of delays. - Provide the SimulStream evaluation config file via
--simulstream_config_fileso the reader can parse latencies correctly.
- Install the
Source Sentences (for COMET)
One source-language sentence per line, aligned 1:1 with the segmentation entries (same count as reference sentences):
Hola, soy Elena y presentaré nuestro trabajo.
Discutiremos qué es el préstamo léxico.
Output
Each run prints a human-readable evaluation report to stdout:
================================================================
OmniSTEval v0.1.7 | Longform evaluation (with resegmentation)
================================================================
Settings
----------------------------------------------------------------
Hypothesis simulstream_log.jsonl
Hypothesis format simulstream
Reference references.txt
Segmentation ref_segments.yaml
Seg. type speech
Language de
BLEU tokenizer 13a
Char-level no
Offset delays no
Fix CA emissions no
Metrics quality, latency
Version 0.1.7
Scores
----------------------------------------------------------------
BLEU 26.9845
chrF 55.2428
LongYAAL (CU) 2194.0496
LongAL (CU) 2500.9289
LongLAAL (CU) 2566.8508
LongAP (CU) 3.8295
LongDAL (CU) 3194.7830
LongYAAL (CA) 2466.8468
LongAL (CA) 2800.4354
LongLAAL (CA) 2860.0414
LongAP (CA) 4.0568
LongDAL (CA) 3496.1945
Instance-level Details
----------------------------------------------------------------
Empty Predictions: 15
Total Instances: 2641
For long-form evaluation, empty predictions may naturally occur for segments with short references
due to resegmentation or segments containing non-speech content such as music or silence.
However, if many segments or segments with substantial references have empty predictions,
this may indicate an issue with SimulST system or resegmentation.
Instances with empty predictions:
----------------------------------------------------------------
Instance 115 with reference 'Vielen Dank.' has an empty prediction.
Instance 118 with reference 'AB: Vielen Dank.' has an empty prediction.
Instance 150 with reference 'MS: 1,5 Millionen. (DC: Okay.)' has an empty prediction.
Instance 159 with reference 'MK: Ich kann helfen.' has an empty prediction.
Instance 224 with reference 'Mann 3: Ich bin der Schmusetyp.' has an empty prediction.
Instance 315 with reference 'June Cohen: Nun, Morgan, im Namen der Transparenz dies: Was ist nun genau mit den $7100 passiert?' has an empty prediction.
Instance 319 with reference '(Applaus)' has an empty prediction.
Instance 612 with reference '(Musik)' has an empty prediction.
Instance 631 with reference 'Sie können das die ganze Zeit machen.' has an empty prediction.
Instance 1131 with reference '(Musik)' has an empty prediction.
Instance 1133 with reference 'Ich danke Ihnen.' has an empty prediction.
Instance 2005 with reference '♫ ♫ Everybody 's looking forward to the weekend, weekend.' has an empty prediction.
Instance 2006 with reference '♫ ♫ Friday, Friday.' has an empty prediction.
Instance 2040 with reference '(Video) Bear Vasquez: Was bedeutet das?' has an empty prediction.
Instance 2628 with reference '(klirren)' has an empty prediction.
================================================================
If --output_folder is provided, three files are also written:
evaluation_report.txt
The same human-readable report shown on stdout, saved to a file for archival. Contains the version, all settings used, and the metric scores — sufficient to reproduce the reported values.
scores.tsv
A tab-separated file with one score per line — useful for scripting:
metric value
bleu 22.9418
chrf 52.1659
yaal 2969.2223
long_al 3000.5570
long_laal 3102.8672
long_ap 1.0506
long_dal 4130.1372
ca_yaal 5905.4040
ca_long_al 6100.6042
ca_long_laal 6159.5048
ca_long_ap 1.7176
ca_long_dal 7409.7906
instances.resegmented.jsonl (longform resegmentation only)
A JSONL file with one re-segmented instance per line (one per reference segment):
{"index": 0, "docid": 0, "segid": 0, "prediction": "Hello , this is Elena and I will present our work .", "reference": "Hello, this is Elena and I will present our work.", "source_length": 9050.0, "emission_cu": [4067.0, 4067.0, ...], "emission_ca": [4100.0, 4200.0, ...], "time_to_recording_end": 220000.0}
{"index": 1, "docid": 0, "segid": 1, ...}
Each entry contains the hypothesis words assigned to that segment, with emission timestamps offset relative to the segment start. In text-only mode, emission_cu, emission_ca, and time_to_recording_end fields are omitted.
Score columns
Quality (both modes)
BLEU— corpus-level SacreBLEU scorechrF— corpus-level chrF scoreCOMET— COMET score (only if--cometis enabled)
Latency (shortform — is_longform=False)
YAAL (CU)/YAAL (CA)— Yet Another Average Lagging (computation-unaware / computation-aware)AL (CU)/AL (CA)— Average Lagging (adapted from SimulEval)LAAL (CU)/LAAL (CA)— Length-Adaptive Average Lagging (adapted from SimulEval)AP (CU)/AP (CA)— Average Proportion (adapted from SimulEval)DAL (CU)/DAL (CA)— Differentiable Average Lagging (adapted from SimulEval)
Latency (longform — is_longform=True)
Same metrics as above, prefixed with Long: LongYAAL (CU), LongAL (CU), LongLAAL (CU), LongAP (CU), LongDAL (CU), and their (CA) counterparts.
Shortform degeneracy diagnostics (shortform only)
Simultaneous Words Fraction (%)— see Shortform Degeneracy DiagnosticsExpected Simul. Words Fraction (%)— see aboveDegeneracy Test Value— signed difference EFSW − SWFLikely Degenerate Simultaneous Policy—YES/NO
Examples
See the examples/ directory for sample input files and expected output:
- Shortform evaluation (with degeneracy diagnostics):
examples/short_form_degenerate_policy/ - Speech resegmentation:
examples/speech_resegmentation_example/ - Text resegmentation:
examples/text_resegmentation_example/ - SimulStream resegmentation example:
examples/simulstream_example/
Run an example with:
cd examples/speech_resegmentation_example
bash resegment.sh
Or evaluate a shortform system:
cd examples/short_form_degenerate_policy_example
bash evaluate.sh
Citation
If you use this tool in your research, please cite it as follows:
@article{polak2025better,
title={Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation},
author={Pol{\'a}k, Peter and Papi, Sara and Bentivogli, Luisa and Bojar, Ond{\v{r}}ej},
journal={arXiv preprint arXiv:2509.17349},
year={2025}
}
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnisteval-0.1.7.tar.gz.
File metadata
- Download URL: omnisteval-0.1.7.tar.gz
- Upload date:
- Size: 41.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d5a9d8bf946aae031dc2f63e71e4007d9e3d6457a38c1be1acf7ec0e39d41f8
|
|
| MD5 |
1afb33ca779f4e13fbe4d9eec970539f
|
|
| BLAKE2b-256 |
49994294f31f8039c9c9914293981cd6e3c7f71831f29782ed894badbedef1c4
|
Provenance
The following attestation bundles were made for omnisteval-0.1.7.tar.gz:
Publisher:
publish.yml on pe-trik/OmniSTEval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omnisteval-0.1.7.tar.gz -
Subject digest:
6d5a9d8bf946aae031dc2f63e71e4007d9e3d6457a38c1be1acf7ec0e39d41f8 - Sigstore transparency entry: 1286099949
- Sigstore integration time:
-
Permalink:
pe-trik/OmniSTEval@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/pe-trik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file omnisteval-0.1.7-py3-none-any.whl.
File metadata
- Download URL: omnisteval-0.1.7-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc031db9ffdea04b9e185df1c9addc53107c5921c1b11fa614bf2855a03d55b1
|
|
| MD5 |
b9bb6eec67fd2ec98220a0a3b96b8b0b
|
|
| BLAKE2b-256 |
25294443a7207d2e53530095527740553ebc5e5dd54faedcf43eae25f8c0cec1
|
Provenance
The following attestation bundles were made for omnisteval-0.1.7-py3-none-any.whl:
Publisher:
publish.yml on pe-trik/OmniSTEval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omnisteval-0.1.7-py3-none-any.whl -
Subject digest:
bc031db9ffdea04b9e185df1c9addc53107c5921c1b11fa614bf2855a03d55b1 - Sigstore transparency entry: 1286100021
- Sigstore integration time:
-
Permalink:
pe-trik/OmniSTEval@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/pe-trik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4 -
Trigger Event:
push
-
Statement type: