Skip to main content

Evaluate simultaneous speech/text translation systems (shortform and longform) with quality (BLEU, chrF, COMET) and latency (YAAL) metrics. For longform, re-segments outputs to match reference segmentation.

Project description

OmniSTEval

A tool for evaluating simultaneous speech/text translation systems — both shortform (segment-level) and longform (document-level with resegmentation).

For longform systems, OmniSTEval re-segments the translation outputs to match reference segmentation, enabling segment-level quality (BLEU, chrF, COMET) and latency (YAAL, LongYAAL, etc.) evaluation.

Implements YAAL, LongYAAL and the SoftSegmenter alignment algorithm from Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation.

How It Works

Shortform evaluation

For the shortform case (i.e., when the input speech is segmented into individual segments following a sentence boundary), OmniSTEval directly evaluates quality and latency metrics on the per-segment outputs without any resegmentation.

In addition to the standard quality metrics (BLEU, chrF, COMET) and latency metrics (YAAL, AL, LAAL, AP, DAL), shortform evaluation computes a set of degeneracy diagnostics that help detect whether a system has learned a degenerate simultaneous policy — see Shortform Degeneracy Diagnostics below.

Longform evaluation (with resegmentation)

Simultaneous speech translation systems (e.g., those evaluated via SimulEval or StreamEval) produce a single long-form output per audio recording. However, most evaluation metrics (like BLEU and YAAL) are designed for segmented inputs. OmniSTEval takes the reference speech segmentation and aligns the hypothesis words (including their emission timestamps) to it, effectively re-segmenting the hypothesis according to the reference segments.

The pipeline consists of the following steps:

  1. Tokenization — Reference and hypothesis words are optionally tokenized using the Moses tokenizer. For Chinese/Japanese (or when --lang is not set), tokenization is skipped and character-level units are used instead.
  2. Alignment — A dynamic programming algorithm (similar to Needleman-Wunsch / DTW, but without gap penalties) aligns hypothesis words to reference words. The alignment maximizes a Jaccard-based character-set similarity score at word level, or exact match at character level. Punctuation is prevented from aligning with non-punctuation tokens, which helps to mitigate segmentation errors around sentence boundaries.
  3. Re-segmentation — Aligned hypothesis words are grouped by their assigned reference segment IDs, producing one hypothesis segment per reference segment.
  4. Evaluation — Each re-segmented instance is scored with the enabled metrics:
    • Quality: BLEU and chrF via SacreBLEU, optionally COMET
    • Latency: LongYAAL (Yet Another Average Lagging) — both computation-aware and computation-unaware variants, plus LongAL, LongLAAL, LongAP, and LongDAL (adapted from SimulEval)

Shortform Degeneracy Diagnostics

When running omnisteval shortform, three additional percent-scale metrics and a boolean flag are computed to help identify systems that have learned a degenerate simultaneous policy — i.e., a policy that appears simultaneous (outputs arrive before end-of-segment) without actually being driven by source content.

Metric Display name Description
swf Simultaneous Words Fraction (%) Fraction of output words emitted before the end-of-segment signal. Corpus micro-average: 100 × Σ |{d < src_len}| / Σ |delays|.
efsw Expected Simul. Words Fraction (%) Expected fraction of the segment duration that precedes the average latency per segment. 100 × Σ max(0, src_len − YAAL_i) / Σ src_len.
dsptv Degeneracy Test Value Signed difference EFSW − SWF. A large positive value means that the system is emitting a few words with a significantly lower latency while translating the rest of the segment after the end-of-segment signal.
degenerate_policy Likely Degenerate Simultaneous Policy YES when |DSPTV| > 20, NO otherwise. When YES, an additional warning banner is also printed.

Interpretation:

  • Both SWF and EFSW measure the "simultaneous‑ness" of the system, but from different angles. SWF is purely empirical; EFSW is derived from YAAL.
  • A well-behaved simultaneous system should have SWF ≈ EFSW, giving a DSPTV near 0.
  • A large |DSPTV| (> 20 pp) suggests the system's emission pattern is inconsistent with its latency profile — a common signature of degenerate strategies such as outputting all words at the very start or end of each segment.

Example report output (degenerate case):

================================================================
OmniSTEval v0.1.7  |  Shortform evaluation
================================================================

Settings
----------------------------------------------------------------
  Hypothesis        instances.log
  Reference         references.txt
  BLEU tokenizer    13a
  Char-level        no
  Fix CA emissions  no
  Metrics           quality, latency
  Version           0.1.7

Scores
----------------------------------------------------------------
  BLEU                                   18.2271
  chrF                                   44.5324
  YAAL (CU)                              1135.6097
  AL (CU)                                1803.9192
  LAAL (CU)                              1857.7128
  AP (CU)                                0.7948
  DAL (CU)                               3532.4812
  YAAL (CA)                              1272.7485
  AL (CA)                                2021.1781
  LAAL (CA)                              2071.7031
  AP (CA)                                0.8903
  DAL (CA)                               3883.0303
  Simultaneous Words Fraction (%)        32.9060
  Expected Simul. Words Fraction (%)     81.1100
  Degeneracy Test Value                  48.2040
  Likely Degenerate Simultaneous Policy  YES

  *** Likely Degenerate Simultaneous Policy ***

================================================================

Installation

pip install OmniSTEval

Or install from source:

git clone https://github.com/pe-trik/omnisteval.git
cd omnisteval
pip install -e .

For COMET scoring support:

pip install OmniSTEval[comet]

For SimulStream log support:

pip install OmniSTEval[simulstream]

Requirements

  • Python 3.8+
  • mosestokenizer>=1.2.1
  • PyYAML>=6.0.3
  • sacrebleu>=2.5.1
  • (Optional) unbabel-comet — for COMET scoring
  • (Optional) simulstream — for reading SimulStream logs

Usage

OmniSTEval provides two subcommands: shortform and longform.

Shortform evaluation

Evaluate a segment-level (shortform) SimulEval output directly:

omnisteval shortform \
  --hypothesis_file instances.log \
  --ref_sentences_file reference_sentences.txt \
  --bleu_tokenizer 13a \
  --output_folder evaluation_output

Longform evaluation with speech resegmentation

Re-segment a long-form hypothesis to match the reference speech segmentation, then evaluate:

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file simuleval_instance_file.log \
  --lang en \
  --bleu_tokenizer 13a \
  --output_folder segmentation_output

Longform evaluation with text resegmentation

Re-segment based on text-level document/segment IDs (no latency metrics):

omnisteval longform \
  --text_segmentation text_segmentation.txt \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file hypotheses.txt \
  --hypothesis_format text \
  --lang en \
  --output_folder segmentation_output

Longform evaluation with SimulStream logs

If you have a SimulStream logfile (streaming outputs), use --hypothesis_format simulstream and provide the SimulStream evaluation config file with --simulstream_config_file.

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file references.txt \
  --hypothesis_file simulstream_log.jsonl \
  --simulstream_config_file cfg.yaml \
  --hypothesis_format simulstream \
  --lang de \
  --bleu_tokenizer 13a \
  --output_folder segmentation_output

Evaluate a pre-resegmented log

If you already have a resegmented JSONL file (e.g., from a previous longform run), you can evaluate it directly:

omnisteval longform \
  --resegmented_hypothesis instances.resegmented.jsonl \
  --bleu_tokenizer 13a \
  --output_folder evaluation_output

With COMET scoring

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file simuleval_instance_file.log \
  --source_sentences_file source_sentences.txt \
  --comet \
  --lang en \
  --output_folder segmentation_output

Custom emission timestamp field names

If your JSONL hypothesis uses different keys for emission timestamps:

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file hypothesis.log \
  --emission_cu_key my_delays \
  --emission_ca_key my_elapsed \
  --lang en \
  --output_folder segmentation_output

Arguments

Common arguments (both subcommands)

Argument Required Default Description
--output_folder No Directory where output files will be written. When omitted, the evaluation report is printed to stdout only — no files are saved.
--char_level One of these False Use character-level alignment and scoring instead of word-level.
--word_level One of these False Use word-level alignment and scoring instead of character-level.
--no_quality No False Disable quality metrics (BLEU, chrF, COMET).
--no_latency No False Disable latency metrics (YAAL). Automatically set for text-only hypotheses.
--comet No False Enable COMET scoring. Requires --source_sentences_file and unbabel-comet.
--comet_model No Unbabel/wmt22-comet-da COMET model name.
--bleu_tokenizer No 13a Tokenizer for SacreBLEU (e.g., 13a, intl, ja-mecab, zh).
--source_sentences_file No Path to source sentences file (one per segment, for COMET scoring).
--emission_cu_key No delays JSON key for computation-unaware emission timestamps in JSONL hypothesis.
--emission_ca_key No elapsed JSON key for computation-aware emission timestamps in JSONL hypothesis.
--fix_simuleval_emission_ca No False Fix computation-aware emission timestamps for CA-YAAL.

shortform arguments

Argument Required Default Description
--hypothesis_file Yes Path to the JSONL hypothesis file (one JSON per line with prediction, delays, etc.).
--ref_sentences_file Yes Path to the reference sentences file (one sentence per line).

longform arguments

Argument Required Default Description
--speech_segmentation One of these Path to a YAML/JSON speech segmentation file. Mutually exclusive with --text_segmentation and --resegmented_hypothesis.
--text_segmentation or one of these Path to a text segmentation file (docid=DOC_ID,segid=SEG_ID format). Mutually exclusive with --speech_segmentation and --resegmented_hypothesis.
--resegmented_hypothesis or this Path to a pre-resegmented JSONL file. Mutually exclusive with segmentation inputs.
--ref_sentences_file For reseg. Path to reference sentences file. Required for resegmentation mode.
--hypothesis_file For reseg. Path to the hypothesis file. Required for resegmentation mode.
--hypothesis_format No jsonl Format of the hypothesis file: jsonl (SimulEval/JSONL output), text, or simulstream (SimulStream logfile; requires the simulstream package).
--simulstream_config_file No Path to a SimulStream evaluation config YAML file. Required when --hypothesis_format=simulstream.
--lang No None Language code for Moses tokenizer (e.g., en, de).
--offset_delays No False Offset delays relative to the first segment of each recording.

Input Formats

Text Segmentation

A plain-text file with one entry per line in the format docid=DOC_ID,segid=SEG_ID, where DOC_ID and SEG_ID are 0-based integers. One line per reference sentence. The number of unique document IDs must equal the number of hypothesis lines.

docid=0,segid=0
docid=0,segid=1
docid=0,segid=2
docid=1,segid=0
docid=1,segid=1
  • docid — 0-based document index (maps to hypothesis line number)
  • segid — 0-based segment index within the document

Speech Segmentation (YAML/JSON)

A list of segments, each with the following fields:

- {wav: recording.wav, offset: 2.433, duration: 9.05, speaker_id: spk1}
- {wav: recording.wav, offset: 15.003, duration: 9.675, speaker_id: spk1}
  • wav — audio filename (used to group segments by recording)
  • offset — segment start time in seconds
  • duration — segment duration in seconds
  • speaker_id — (optional) speaker identifier

Reference Sentences

One sentence per line, aligned 1:1 with the segmentation entries:

Hello, this is Elena and I will present our work.
We will discuss what lexical borrowing is.

Hypothesis File

JSONL format (--hypothesis_format jsonl, default) — One JSON object per line (SimulEval output format):

{"source": "recording.wav", "prediction": "Hello this is Elena ...", "delays": [4067.0, 4067.0, ...], "elapsed": [4100.0, 4200.0, ...], "source_length": 220000}
  • source — audio filename as a string (e.g., "recording.wav"), or an array with the recording name as the first element (e.g., ["recording.wav"]) for backward compatibility with SimulEval logs
  • prediction — the full hypothesis text
  • delays — (optional) per-token computation-unaware emission timestamps (in ms); length must match the number of words (or characters if --char_level) in prediction
  • elapsed — (optional) per-token computation-aware emission timestamps (in ms)
  • source_length — (optional, but highly recommended for reliable YAAL/LongYAAL latency evaluation) total recording length in ms

The key names for delays and elapsed can be customized with --emission_cu_key and --emission_ca_key.

Text format (--hypothesis_format text) — One hypothesis per line, matched by order to recordings in the segmentation file. Latency metrics are not available in this mode.

Hello this is Elena and I will present our work.
We will discuss what lexical borrowing is.

SimulStream format (--hypothesis_format simulstream) and pass the SimulStream evaluation config with --simulstream_config_file. OmniSTEval uses the SimulStream LogReader to extract the final text and per-unit latencies.

  • What is read from the SimulStream log:

    • final_text — the final hypothesis string for the recording (normalized and tokenized according to --char_level).
    • ideal_delays — per-unit computation-unaware delays (units in seconds inside SimulStream). These are converted to milliseconds by OmniSTEval (cu * 1000) and used as emission_cu.
    • computational_aware_delays — per-unit computation-aware delays (seconds). These are converted to ms and used as emission_ca.
  • Notes:

    • Install the simulstream package to enable this mode: pip install simulstream.
    • The LogReader is called with latency_unit='char' when --char_level is set, otherwise latency_unit='word'. Make sure tokenization in your config matches the evaluation settings so the number of units equals the number of delays.
    • Provide the SimulStream evaluation config file via --simulstream_config_file so the reader can parse latencies correctly.

Source Sentences (for COMET)

One source-language sentence per line, aligned 1:1 with the segmentation entries (same count as reference sentences):

Hola, soy Elena y presentaré nuestro trabajo.
Discutiremos qué es el préstamo léxico.

Output

Each run prints a human-readable evaluation report to stdout:

================================================================
OmniSTEval v0.1.7  |  Longform evaluation (with resegmentation)
================================================================

Settings
----------------------------------------------------------------
  Hypothesis         simulstream_log.jsonl
  Hypothesis format  simulstream
  Reference          references.txt
  Segmentation       ref_segments.yaml
  Seg. type          speech
  Language           de
  BLEU tokenizer     13a
  Char-level         no
  Offset delays      no
  Fix CA emissions   no
  Metrics            quality, latency
  Version            0.1.7

Scores
----------------------------------------------------------------
  BLEU           26.9845
  chrF           55.2428
  LongYAAL (CU)  2194.0496
  LongAL (CU)    2500.9289
  LongLAAL (CU)  2566.8508
  LongAP (CU)    3.8295
  LongDAL (CU)   3194.7830
  LongYAAL (CA)  2466.8468
  LongAL (CA)    2800.4354
  LongLAAL (CA)  2860.0414
  LongAP (CA)    4.0568
  LongDAL (CA)   3496.1945

Instance-level Details
----------------------------------------------------------------
Empty Predictions: 15
Total Instances:   2641

For long-form evaluation, empty predictions may naturally occur for segments with short references
due to resegmentation or segments containing non-speech content such as music or silence.
However, if many segments or segments with substantial references have empty predictions,
this may indicate an issue with SimulST system or resegmentation.


Instances with empty predictions:
----------------------------------------------------------------
Instance 115 with reference 'Vielen Dank.' has an empty prediction.
Instance 118 with reference 'AB: Vielen Dank.' has an empty prediction.
Instance 150 with reference 'MS: 1,5 Millionen. (DC: Okay.)' has an empty prediction.
Instance 159 with reference 'MK: Ich kann helfen.' has an empty prediction.
Instance 224 with reference 'Mann 3: Ich bin der Schmusetyp.' has an empty prediction.
Instance 315 with reference 'June Cohen: Nun, Morgan, im Namen der Transparenz dies: Was ist nun genau mit den $7100 passiert?' has an empty prediction.
Instance 319 with reference '(Applaus)' has an empty prediction.
Instance 612 with reference '(Musik)' has an empty prediction.
Instance 631 with reference 'Sie können das die ganze Zeit machen.' has an empty prediction.
Instance 1131 with reference '(Musik)' has an empty prediction.
Instance 1133 with reference 'Ich danke Ihnen.' has an empty prediction.
Instance 2005 with reference '♫ ♫ Everybody 's looking forward to the weekend, weekend.' has an empty prediction.
Instance 2006 with reference '♫ ♫ Friday, Friday.' has an empty prediction.
Instance 2040 with reference '(Video) Bear Vasquez: Was bedeutet das?' has an empty prediction.
Instance 2628 with reference '(klirren)' has an empty prediction.
================================================================

If --output_folder is provided, three files are also written:

evaluation_report.txt

The same human-readable report shown on stdout, saved to a file for archival. Contains the version, all settings used, and the metric scores — sufficient to reproduce the reported values.

scores.tsv

A tab-separated file with one score per line — useful for scripting:

metric	value
bleu	22.9418
chrf	52.1659
yaal	2969.2223
long_al	3000.5570
long_laal	3102.8672
long_ap	1.0506
long_dal	4130.1372
ca_yaal	5905.4040
ca_long_al	6100.6042
ca_long_laal	6159.5048
ca_long_ap	1.7176
ca_long_dal	7409.7906

instances.resegmented.jsonl (longform resegmentation only)

A JSONL file with one re-segmented instance per line (one per reference segment):

{"index": 0, "docid": 0, "segid": 0, "prediction": "Hello , this is Elena and I will present our work .", "reference": "Hello, this is Elena and I will present our work.", "source_length": 9050.0, "emission_cu": [4067.0, 4067.0, ...], "emission_ca": [4100.0, 4200.0, ...], "time_to_recording_end": 220000.0}
{"index": 1, "docid": 0, "segid": 1, ...}

Each entry contains the hypothesis words assigned to that segment, with emission timestamps offset relative to the segment start. In text-only mode, emission_cu, emission_ca, and time_to_recording_end fields are omitted.

Score columns

Quality (both modes)

  • BLEU — corpus-level SacreBLEU score
  • chrF — corpus-level chrF score
  • COMET — COMET score (only if --comet is enabled)

Latency (shortform — is_longform=False)

  • YAAL (CU) / YAAL (CA) — Yet Another Average Lagging (computation-unaware / computation-aware)
  • AL (CU) / AL (CA) — Average Lagging (adapted from SimulEval)
  • LAAL (CU) / LAAL (CA) — Length-Adaptive Average Lagging (adapted from SimulEval)
  • AP (CU) / AP (CA) — Average Proportion (adapted from SimulEval)
  • DAL (CU) / DAL (CA) — Differentiable Average Lagging (adapted from SimulEval)

Latency (longform — is_longform=True)

Same metrics as above, prefixed with Long: LongYAAL (CU), LongAL (CU), LongLAAL (CU), LongAP (CU), LongDAL (CU), and their (CA) counterparts.

Shortform degeneracy diagnostics (shortform only)

  • Simultaneous Words Fraction (%) — see Shortform Degeneracy Diagnostics
  • Expected Simul. Words Fraction (%) — see above
  • Degeneracy Test Value — signed difference EFSW − SWF
  • Likely Degenerate Simultaneous PolicyYES / NO

Examples

See the examples/ directory for sample input files and expected output:

  • Shortform evaluation (with degeneracy diagnostics): examples/short_form_degenerate_policy/
  • Speech resegmentation: examples/speech_resegmentation_example/
  • Text resegmentation: examples/text_resegmentation_example/
  • SimulStream resegmentation example: examples/simulstream_example/

Run an example with:

cd examples/speech_resegmentation_example
bash resegment.sh

Or evaluate a shortform system:

cd examples/short_form_degenerate_policy_example
bash evaluate.sh

Citation

If you use this tool in your research, please cite it as follows:

@article{polak2025better,
  title={Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation},
  author={Pol{\'a}k, Peter and Papi, Sara and Bentivogli, Luisa and Bojar, Ond{\v{r}}ej},
  journal={arXiv preprint arXiv:2509.17349},
  year={2025}
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnisteval-0.1.7.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnisteval-0.1.7-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file omnisteval-0.1.7.tar.gz.

File metadata

  • Download URL: omnisteval-0.1.7.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omnisteval-0.1.7.tar.gz
Algorithm Hash digest
SHA256 6d5a9d8bf946aae031dc2f63e71e4007d9e3d6457a38c1be1acf7ec0e39d41f8
MD5 1afb33ca779f4e13fbe4d9eec970539f
BLAKE2b-256 49994294f31f8039c9c9914293981cd6e3c7f71831f29782ed894badbedef1c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnisteval-0.1.7.tar.gz:

Publisher: publish.yml on pe-trik/OmniSTEval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omnisteval-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: omnisteval-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omnisteval-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 bc031db9ffdea04b9e185df1c9addc53107c5921c1b11fa614bf2855a03d55b1
MD5 b9bb6eec67fd2ec98220a0a3b96b8b0b
BLAKE2b-256 25294443a7207d2e53530095527740553ebc5e5dd54faedcf43eae25f8c0cec1

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnisteval-0.1.7-py3-none-any.whl:

Publisher: publish.yml on pe-trik/OmniSTEval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page