Evaluate simultaneous speech/text translation systems (shortform and longform) with quality (BLEU, chrF, COMET) and latency (YAAL) metrics. For longform, re-segments outputs to match reference segmentation.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

polak_pe

These details have not been verified by PyPI

Project links

Paper

Project description

OmniSTEval

A tool for evaluating simultaneous speech/text translation systems — both shortform (segment-level) and longform (document-level with resegmentation).

For longform systems, OmniSTEval re-segments the translation outputs to match reference segmentation, enabling segment-level quality (BLEU, chrF, COMET) and latency (YAAL, LongYAAL, etc.) evaluation.

Implements YAAL, LongYAAL and the SoftSegmenter alignment algorithm from Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation.

How It Works

Shortform evaluation

For the shortform case (i.e., when the input speech is segmented into individual segments following a sentence boundary), OmniSTEval directly evaluates quality and latency metrics on the per-segment outputs without any resegmentation.

In addition to the standard quality metrics (BLEU, chrF, COMET) and latency metrics (YAAL, AL, LAAL, AP, DAL), shortform evaluation computes a set of degeneracy diagnostics that help detect whether a system has learned a degenerate simultaneous policy — see Shortform Degeneracy Diagnostics below.

Longform evaluation (with resegmentation)

Simultaneous speech translation systems (e.g., those evaluated via SimulEval or StreamEval) produce a single long-form output per audio recording. However, most evaluation metrics (like BLEU and YAAL) are designed for segmented inputs. OmniSTEval takes the reference speech segmentation and aligns the hypothesis words (including their emission timestamps) to it, effectively re-segmenting the hypothesis according to the reference segments.

The pipeline consists of the following steps:

Tokenization — Reference and hypothesis words are optionally tokenized using the Moses tokenizer. For Chinese/Japanese (or when --lang is not set), tokenization is skipped and character-level units are used instead.
Alignment — A dynamic programming algorithm (similar to Needleman-Wunsch / DTW, but without gap penalties) aligns hypothesis words to reference words. The alignment maximizes a Jaccard-based character-set similarity score at word level, or exact match at character level. Punctuation is prevented from aligning with non-punctuation tokens, which helps to mitigate segmentation errors around sentence boundaries.
Re-segmentation — Aligned hypothesis words are grouped by their assigned reference segment IDs, producing one hypothesis segment per reference segment.
Evaluation — Each re-segmented instance is scored with the enabled metrics:
- Quality: BLEU and chrF via SacreBLEU, optionally COMET
- Latency: LongYAAL (Yet Another Average Lagging) — both computation-aware and computation-unaware variants, plus LongAL, LongLAAL, LongAP, and LongDAL (adapted from SimulEval)

Shortform Degeneracy Diagnostics

When running omnisteval shortform, three additional percent-scale metrics and a boolean flag are computed to help identify systems that have learned a degenerate simultaneous policy — i.e., a policy that appears simultaneous (outputs arrive before end-of-segment) without actually being driven by source content.

Metric	Display name	Description
`swf`	Simultaneous Words Fraction (%)	Fraction of output words emitted before the end-of-segment signal. Corpus micro-average: `100 × Σ \|{d < src_len}\| / Σ \|delays\|`.
`efsw`	Expected Simul. Words Fraction (%)	Expected fraction of the segment duration that precedes the average latency per segment. `100 × Σ max(0, src_len − YAAL_i) / Σ src_len`.
`dsptv`	Degeneracy Test Value	Signed difference `EFSW − SWF`. A large positive value means that the system is emitting a few words with a significantly lower latency while translating the rest of the segment after the end-of-segment signal.
`degenerate_policy`	Likely Degenerate Simultaneous Policy	`YES` when `\|DSPTV\| > 20`, `NO` otherwise. When `YES`, an additional warning banner is also printed.

Interpretation:

Both SWF and EFSW measure the "simultaneous‑ness" of the system, but from different angles. SWF is purely empirical; EFSW is derived from YAAL.
A well-behaved simultaneous system should have SWF ≈ EFSW, giving a DSPTV near 0.
A large |DSPTV| (> 20 pp) suggests the system's emission pattern is inconsistent with its latency profile — a common signature of degenerate strategies such as outputting all words at the very start or end of each segment.

Example report output (degenerate case):

================================================================
OmniSTEval v0.1.7  |  Shortform evaluation
================================================================

Settings
----------------------------------------------------------------
  Hypothesis        instances.log
  Reference         references.txt
  BLEU tokenizer    13a
  Char-level        no
  Fix CA emissions  no
  Metrics           quality, latency
  Version           0.1.7

Scores
----------------------------------------------------------------
  BLEU                                   18.2271
  chrF                                   44.5324
  YAAL (CU)                              1135.6097
  AL (CU)                                1803.9192
  LAAL (CU)                              1857.7128
  AP (CU)                                0.7948
  DAL (CU)                               3532.4812
  YAAL (CA)                              1272.7485
  AL (CA)                                2021.1781
  LAAL (CA)                              2071.7031
  AP (CA)                                0.8903
  DAL (CA)                               3883.0303
  Simultaneous Words Fraction (%)        32.9060
  Expected Simul. Words Fraction (%)     81.1100
  Degeneracy Test Value                  48.2040
  Likely Degenerate Simultaneous Policy  YES

  *** Likely Degenerate Simultaneous Policy ***

================================================================

Installation

pip install OmniSTEval

Or install from source:

git clone https://github.com/pe-trik/omnisteval.git
cd omnisteval
pip install -e .

For COMET scoring support:

pip install OmniSTEval[comet]

For SimulStream log support:

pip install OmniSTEval[simulstream]

Requirements

Python 3.8+
mosestokenizer>=1.2.1
PyYAML>=6.0.3
sacrebleu>=2.5.1
(Optional) unbabel-comet — for COMET scoring
(Optional) simulstream — for reading SimulStream logs

Usage

OmniSTEval provides two subcommands: shortform and longform.

Shortform evaluation

Evaluate a segment-level (shortform) SimulEval output directly:

omnisteval shortform \
  --hypothesis_file instances.log \
  --ref_sentences_file reference_sentences.txt \
  --bleu_tokenizer 13a \
  --output_folder evaluation_output

Longform evaluation with speech resegmentation

Re-segment a long-form hypothesis to match the reference speech segmentation, then evaluate:

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file simuleval_instance_file.log \
  --lang en \
  --bleu_tokenizer 13a \
  --output_folder segmentation_output

Longform evaluation with text resegmentation

Re-segment based on text-level document/segment IDs (no latency metrics):

omnisteval longform \
  --text_segmentation text_segmentation.txt \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file hypotheses.txt \
  --hypothesis_format text \
  --lang en \
  --output_folder segmentation_output

Longform evaluation with SimulStream logs

If you have a SimulStream logfile (streaming outputs), use --hypothesis_format simulstream and provide the SimulStream evaluation config file with --simulstream_config_file.

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file references.txt \
  --hypothesis_file simulstream_log.jsonl \
  --simulstream_config_file cfg.yaml \
  --hypothesis_format simulstream \
  --lang de \
  --bleu_tokenizer 13a \
  --output_folder segmentation_output

Evaluate a pre-resegmented log

If you already have a resegmented JSONL file (e.g., from a previous longform run), you can evaluate it directly:

omnisteval longform \
  --resegmented_hypothesis instances.resegmented.jsonl \
  --bleu_tokenizer 13a \
  --output_folder evaluation_output

With COMET scoring

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file simuleval_instance_file.log \
  --source_sentences_file source_sentences.txt \
  --comet \
  --lang en \
  --output_folder segmentation_output

Custom emission timestamp field names

If your JSONL hypothesis uses different keys for emission timestamps:

omnisteval longform \
  --speech_segmentation ref_segments.yaml \
  --ref_sentences_file reference_sentences.txt \
  --hypothesis_file hypothesis.log \
  --emission_cu_key my_delays \
  --emission_ca_key my_elapsed \
  --lang en \
  --output_folder segmentation_output

Arguments

Common arguments (both subcommands)

Argument	Required	Default	Description
`--output_folder`	No	—	Directory where output files will be written. When omitted, the evaluation report is printed to stdout only — no files are saved.
`--char_level`	One of these	`False`	Use character-level alignment and scoring instead of word-level.
`--word_level`	One of these	`False`	Use word-level alignment and scoring instead of character-level.
`--no_quality`	No	`False`	Disable quality metrics (BLEU, chrF, COMET).
`--no_latency`	No	`False`	Disable latency metrics (YAAL). Automatically set for text-only hypotheses.
`--comet`	No	`False`	Enable COMET scoring. Requires `--source_sentences_file` and `unbabel-comet`.
`--comet_model`	No	`Unbabel/wmt22-comet-da`	COMET model name.
`--bleu_tokenizer`	No	`13a`	Tokenizer for SacreBLEU (e.g., `13a`, `intl`, `ja-mecab`, `zh`).
`--source_sentences_file`	No	—	Path to source sentences file (one per segment, for COMET scoring).
`--emission_cu_key`	No	`delays`	JSON key for computation-unaware emission timestamps in JSONL hypothesis.
`--emission_ca_key`	No	`elapsed`	JSON key for computation-aware emission timestamps in JSONL hypothesis.
`--fix_simuleval_emission_ca`	No	`False`	Fix computation-aware emission timestamps for CA-YAAL.

`shortform` arguments

Argument	Required	Default	Description
`--hypothesis_file`	Yes	—	Path to the JSONL hypothesis file (one JSON per line with `prediction`, delays, etc.).
`--ref_sentences_file`	Yes	—	Path to the reference sentences file (one sentence per line).

`longform` arguments

Argument	Required	Default	Description
`--speech_segmentation`	One of these	—	Path to a YAML/JSON speech segmentation file. Mutually exclusive with `--text_segmentation` and `--resegmented_hypothesis`.
`--text_segmentation`	or one of these	—	Path to a text segmentation file (`docid=DOC_ID,segid=SEG_ID` format). Mutually exclusive with `--speech_segmentation` and `--resegmented_hypothesis`.
`--resegmented_hypothesis`	or this	—	Path to a pre-resegmented JSONL file. Mutually exclusive with segmentation inputs.
`--ref_sentences_file`	For reseg.	—	Path to reference sentences file. Required for resegmentation mode.
`--hypothesis_file`	For reseg.	—	Path to the hypothesis file. Required for resegmentation mode.
`--hypothesis_format`	No	`jsonl`	Format of the hypothesis file: `jsonl` (SimulEval/JSONL output), `text`, or `simulstream` (SimulStream logfile; requires the `simulstream` package).
`--simulstream_config_file`	No	—	Path to a SimulStream evaluation config YAML file. Required when `--hypothesis_format=simulstream`.
`--lang`	No	`None`	Language code for Moses tokenizer (e.g., `en`, `de`).
`--offset_delays`	No	`False`	Offset delays relative to the first segment of each recording.

Input Formats

Text Segmentation

A plain-text file with one entry per line in the format docid=DOC_ID,segid=SEG_ID, where DOC_ID and SEG_ID are 0-based integers. One line per reference sentence. The number of unique document IDs must equal the number of hypothesis lines.

docid=0,segid=0
docid=0,segid=1
docid=0,segid=2
docid=1,segid=0
docid=1,segid=1

docid — 0-based document index (maps to hypothesis line number)
segid — 0-based segment index within the document

Speech Segmentation (YAML/JSON)

A list of segments, each with the following fields:

- {wav: recording.wav, offset: 2.433, duration: 9.05, speaker_id: spk1}
- {wav: recording.wav, offset: 15.003, duration: 9.675, speaker_id: spk1}

wav — audio filename (used to group segments by recording)
offset — segment start time in seconds
duration — segment duration in seconds
speaker_id — (optional) speaker identifier

Reference Sentences

One sentence per line, aligned 1:1 with the segmentation entries:

Hello, this is Elena and I will present our work.
We will discuss what lexical borrowing is.

Hypothesis File

JSONL format (--hypothesis_format jsonl, default) — One JSON object per line (SimulEval output format):

{"source": "recording.wav", "prediction": "Hello this is Elena ...", "delays": [4067.0, 4067.0, ...], "elapsed": [4100.0, 4200.0, ...], "source_length": 220000}

source — audio filename as a string (e.g., "recording.wav"), or an array with the recording name as the first element (e.g., ["recording.wav"]) for backward compatibility with SimulEval logs
prediction — the full hypothesis text
delays — (optional) per-token computation-unaware emission timestamps (in ms); length must match the number of words (or characters if --char_level) in prediction
elapsed — (optional) per-token computation-aware emission timestamps (in ms)
source_length — (optional, but highly recommended for reliable YAAL/LongYAAL latency evaluation) total recording length in ms

The key names for delays and elapsed can be customized with --emission_cu_key and --emission_ca_key.

Text format (--hypothesis_format text) — One hypothesis per line, matched by order to recordings in the segmentation file. Latency metrics are not available in this mode.

Hello this is Elena and I will present our work.
We will discuss what lexical borrowing is.

SimulStream format (--hypothesis_format simulstream) and pass the SimulStream evaluation config with --simulstream_config_file. OmniSTEval uses the SimulStream LogReader to extract the final text and per-unit latencies.

What is read from the SimulStream log:
- final_text — the final hypothesis string for the recording (normalized and tokenized according to --char_level).
- ideal_delays — per-unit computation-unaware delays (units in seconds inside SimulStream). These are converted to milliseconds by OmniSTEval (cu * 1000) and used as emission_cu.
- computational_aware_delays — per-unit computation-aware delays (seconds). These are converted to ms and used as emission_ca.
Notes:
- Install the simulstream package to enable this mode: pip install simulstream.
- The LogReader is called with latency_unit='char' when --char_level is set, otherwise latency_unit='word'. Make sure tokenization in your config matches the evaluation settings so the number of units equals the number of delays.
- Provide the SimulStream evaluation config file via --simulstream_config_file so the reader can parse latencies correctly.

Source Sentences (for COMET)

One source-language sentence per line, aligned 1:1 with the segmentation entries (same count as reference sentences):

Hola, soy Elena y presentaré nuestro trabajo.
Discutiremos qué es el préstamo léxico.

Output

Each run prints a human-readable evaluation report to stdout:

================================================================
OmniSTEval v0.1.7  |  Longform evaluation (with resegmentation)
================================================================

Settings
----------------------------------------------------------------
  Hypothesis         simulstream_log.jsonl
  Hypothesis format  simulstream
  Reference          references.txt
  Segmentation       ref_segments.yaml
  Seg. type          speech
  Language           de
  BLEU tokenizer     13a
  Char-level         no
  Offset delays      no
  Fix CA emissions   no
  Metrics            quality, latency
  Version            0.1.7

Scores
----------------------------------------------------------------
  BLEU           26.9845
  chrF           55.2428
  LongYAAL (CU)  2194.0496
  LongAL (CU)    2500.9289
  LongLAAL (CU)  2566.8508
  LongAP (CU)    3.8295
  LongDAL (CU)   3194.7830
  LongYAAL (CA)  2466.8468
  LongAL (CA)    2800.4354
  LongLAAL (CA)  2860.0414
  LongAP (CA)    4.0568
  LongDAL (CA)   3496.1945

Instance-level Details
----------------------------------------------------------------
Empty Predictions: 15
Total Instances:   2641

For long-form evaluation, empty predictions may naturally occur for segments with short references
due to resegmentation or segments containing non-speech content such as music or silence.
However, if many segments or segments with substantial references have empty predictions,
this may indicate an issue with SimulST system or resegmentation.


Instances with empty predictions:
----------------------------------------------------------------
Instance 115 with reference 'Vielen Dank.' has an empty prediction.
Instance 118 with reference 'AB: Vielen Dank.' has an empty prediction.
Instance 150 with reference 'MS: 1,5 Millionen. (DC: Okay.)' has an empty prediction.
Instance 159 with reference 'MK: Ich kann helfen.' has an empty prediction.
Instance 224 with reference 'Mann 3: Ich bin der Schmusetyp.' has an empty prediction.
Instance 315 with reference 'June Cohen: Nun, Morgan, im Namen der Transparenz dies: Was ist nun genau mit den $7100 passiert?' has an empty prediction.
Instance 319 with reference '(Applaus)' has an empty prediction.
Instance 612 with reference '(Musik)' has an empty prediction.
Instance 631 with reference 'Sie können das die ganze Zeit machen.' has an empty prediction.
Instance 1131 with reference '(Musik)' has an empty prediction.
Instance 1133 with reference 'Ich danke Ihnen.' has an empty prediction.
Instance 2005 with reference '♫ ♫ Everybody 's looking forward to the weekend, weekend.' has an empty prediction.
Instance 2006 with reference '♫ ♫ Friday, Friday.' has an empty prediction.
Instance 2040 with reference '(Video) Bear Vasquez: Was bedeutet das?' has an empty prediction.
Instance 2628 with reference '(klirren)' has an empty prediction.
================================================================

If --output_folder is provided, three files are also written:

`evaluation_report.txt`

The same human-readable report shown on stdout, saved to a file for archival. Contains the version, all settings used, and the metric scores — sufficient to reproduce the reported values.

`scores.tsv`

A tab-separated file with one score per line — useful for scripting:

metric	value
bleu	22.9418
chrf	52.1659
yaal	2969.2223
long_al	3000.5570
long_laal	3102.8672
long_ap	1.0506
long_dal	4130.1372
ca_yaal	5905.4040
ca_long_al	6100.6042
ca_long_laal	6159.5048
ca_long_ap	1.7176
ca_long_dal	7409.7906

`instances.resegmented.jsonl` (longform resegmentation only)

A JSONL file with one re-segmented instance per line (one per reference segment):

{"index": 0, "docid": 0, "segid": 0, "prediction": "Hello , this is Elena and I will present our work .", "reference": "Hello, this is Elena and I will present our work.", "source_length": 9050.0, "emission_cu": [4067.0, 4067.0, ...], "emission_ca": [4100.0, 4200.0, ...], "time_to_recording_end": 220000.0}
{"index": 1, "docid": 0, "segid": 1, ...}

Each entry contains the hypothesis words assigned to that segment, with emission timestamps offset relative to the segment start. In text-only mode, emission_cu, emission_ca, and time_to_recording_end fields are omitted.

Score columns

Quality (both modes)

BLEU — corpus-level SacreBLEU score
chrF — corpus-level chrF score
COMET — COMET score (only if --comet is enabled)

Latency (shortform — `is_longform=False`)

YAAL (CU) / YAAL (CA) — Yet Another Average Lagging (computation-unaware / computation-aware)
AL (CU) / AL (CA) — Average Lagging (adapted from SimulEval)
LAAL (CU) / LAAL (CA) — Length-Adaptive Average Lagging (adapted from SimulEval)
AP (CU) / AP (CA) — Average Proportion (adapted from SimulEval)
DAL (CU) / DAL (CA) — Differentiable Average Lagging (adapted from SimulEval)

Latency (longform — `is_longform=True`)

Same metrics as above, prefixed with Long: LongYAAL (CU), LongAL (CU), LongLAAL (CU), LongAP (CU), LongDAL (CU), and their (CA) counterparts.

Shortform degeneracy diagnostics (shortform only)

Simultaneous Words Fraction (%) — see Shortform Degeneracy Diagnostics
Expected Simul. Words Fraction (%) — see above
Degeneracy Test Value — signed difference EFSW − SWF
Likely Degenerate Simultaneous Policy — YES / NO

Examples

See the examples/ directory for sample input files and expected output:

Shortform evaluation (with degeneracy diagnostics): examples/short_form_degenerate_policy/
Speech resegmentation: examples/speech_resegmentation_example/
Text resegmentation: examples/text_resegmentation_example/
SimulStream resegmentation example: examples/simulstream_example/

Run an example with:

cd examples/speech_resegmentation_example
bash resegment.sh

Or evaluate a shortform system:

cd examples/short_form_degenerate_policy_example
bash evaluate.sh

Citation

If you use this tool in your research, please cite it as follows:

@article{polak2025better,
  title={Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation},
  author={Pol{\'a}k, Peter and Papi, Sara and Bentivogli, Luisa and Bojar, Ond{\v{r}}ej},
  journal={arXiv preprint arXiv:2509.17349},
  year={2025}
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

polak_pe

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

This version

0.1.7

Apr 13, 2026

0.1.6

Mar 12, 2026

0.1.5

Mar 12, 2026

0.1.4

Mar 6, 2026

0.1.3

Mar 1, 2026

0.1.2

Mar 1, 2026

0.1.1

Mar 1, 2026

0.1.0

Feb 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnisteval-0.1.7.tar.gz (41.1 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omnisteval-0.1.7-py3-none-any.whl (39.9 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file omnisteval-0.1.7.tar.gz.

File metadata

Download URL: omnisteval-0.1.7.tar.gz
Upload date: Apr 13, 2026
Size: 41.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omnisteval-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`6d5a9d8bf946aae031dc2f63e71e4007d9e3d6457a38c1be1acf7ec0e39d41f8`
MD5	`1afb33ca779f4e13fbe4d9eec970539f`
BLAKE2b-256	`49994294f31f8039c9c9914293981cd6e3c7f71831f29782ed894badbedef1c4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnisteval-0.1.7.tar.gz:

Publisher: publish.yml on pe-trik/OmniSTEval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: omnisteval-0.1.7.tar.gz
- Subject digest: 6d5a9d8bf946aae031dc2f63e71e4007d9e3d6457a38c1be1acf7ec0e39d41f8
- Sigstore transparency entry: 1286099949
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: pe-trik/OmniSTEval@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/pe-trik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4
- Trigger Event: push

File details

Details for the file omnisteval-0.1.7-py3-none-any.whl.

File metadata

Download URL: omnisteval-0.1.7-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 39.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omnisteval-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc031db9ffdea04b9e185df1c9addc53107c5921c1b11fa614bf2855a03d55b1`
MD5	`b9bb6eec67fd2ec98220a0a3b96b8b0b`
BLAKE2b-256	`25294443a7207d2e53530095527740553ebc5e5dd54faedcf43eae25f8c0cec1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnisteval-0.1.7-py3-none-any.whl:

Publisher: publish.yml on pe-trik/OmniSTEval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: omnisteval-0.1.7-py3-none-any.whl
- Subject digest: bc031db9ffdea04b9e185df1c9addc53107c5921c1b11fa614bf2855a03d55b1
- Sigstore transparency entry: 1286100021
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: pe-trik/OmniSTEval@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/pe-trik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a6cc9b6285d8bf768c59454bea20b7ab1faeda4
- Trigger Event: push

OmniSTEval 0.1.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

OmniSTEval

How It Works

Shortform evaluation

Longform evaluation (with resegmentation)

Shortform Degeneracy Diagnostics

Installation

Requirements

Usage

Shortform evaluation

Longform evaluation with speech resegmentation

Longform evaluation with text resegmentation

Longform evaluation with SimulStream logs

Evaluate a pre-resegmented log

With COMET scoring

Custom emission timestamp field names

Arguments

Common arguments (both subcommands)

shortform arguments

longform arguments

Input Formats

Text Segmentation

Speech Segmentation (YAML/JSON)

Reference Sentences

Hypothesis File

Source Sentences (for COMET)

Output

evaluation_report.txt

scores.tsv

instances.resegmented.jsonl (longform resegmentation only)

Score columns

Quality (both modes)

Latency (shortform — is_longform=False)

Latency (longform — is_longform=True)

Shortform degeneracy diagnostics (shortform only)

Examples

Citation

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`shortform` arguments

`longform` arguments

`evaluation_report.txt`

`scores.tsv`

`instances.resegmented.jsonl` (longform resegmentation only)

Latency (shortform — `is_longform=False`)

Latency (longform — `is_longform=True`)