Open speech and translation benchmarking toolkit supporting MT, ASR, TTS, SimulST, VC, and paralinguistics with optimized CJK language support

These details have not been verified by PyPI

Project links

Project description

OpenSTBench

English | Chinese

OpenSTBench is an evaluation toolkit centered on translation and speech translation. It provides a unified way to score text translation quality, speech output quality, preservation-related properties, and streaming latency.

What It Can Be Used For

This project is best suited for these directions:

MT or S2TT text-side evaluation with BLEU, chrF++, COMET, and BLEURT
S2ST evaluation by combining text quality, speech quality, speaker similarity, and latency
Streaming or simultaneous speech translation latency evaluation with a custom agent
Preservation analysis for speech translation outputs, including speaker similarity, emotion, and paralinguistic similarity
Temporal consistency analysis for speech translation or dubbing outputs, including duration compliance and duration error

Core Modules

Module	Main Use	Typical Metrics
`TranslationEvaluator`	Text-side translation quality	`sacreBLEU`, `chrF++`, `COMET`, `BLEURT`
`SpeechQualityEvaluator`	Naturalness and text-speech consistency	`UTMOS`, `WER_Consistency`, `CER_Consistency`
`SpeakerSimilarityEvaluator`	Speaker preservation	`wavlm_similarity`, `resemblyzer_similarity`
`EmotionEvaluator`	Emotion preservation or classification accuracy	`Emotion2Vec_Cosine_Similarity`, `Audio_Emotion_Accuracy`
`ParalinguisticEvaluator`	Non-verbal and paralinguistic preservation	`Paralinguistic_Fidelity_Cosine`, `Acoustic_Event_Preservation_Rate`, `Acoustic_Event_Preservation_Macro_F1`, `Acoustic_Event_Preservation_Macro_Recall`, `Event_Aligned_Preservation_Rate`, `Conditional_Relative_Onset_Error`
`TemporalConsistencyEvaluator`	Source-target temporal structure consistency	`Duration_Consistency_SLC_0.2`, `Duration_Consistency_SLC_0.4`
`LatencyEvaluator`	Streaming / simultaneous translation latency	`StartOffset`, `ATD`, `CustomATD`, `RTF`, `Model_Generate_RTF`

Installation

Basic install:

pip install OpenSTBench

Optional extras:

pip install "OpenSTBench[comet]"
pip install "OpenSTBench[whisper]"
pip install "OpenSTBench[speech_quality]"
pip install "OpenSTBench[emotion]"
pip install "OpenSTBench[paralinguistics]"
pip install "OpenSTBench[all]"

If you need BLEURT:

pip install git+https://github.com/lucadiliello/bleurt-pytorch.git

Import

PyPI package name:

OpenSTBench

Python import name:

openstbench

Example:

from openstbench import (
    TranslationEvaluator,
    SpeechQualityEvaluator,
    TemporalConsistencyEvaluator,
)

Quick Start

Quick-start scripts live under examples/.

Python examples:

examples/python/translation_eval.py
examples/python/speech_quality_eval.py
examples/python/speaker_similarity_eval.py
examples/python/emotion_eval.py
examples/python/paralinguistic_eval.py
examples/python/paralinguistic_identity_baseline.py
examples/python/temporal_consistency_eval.py
examples/python/latency_eval.py

Shell examples:

examples/bash/install_extras.sh
examples/bash/run_latency_cli.sh

Minimal temporal consistency example:

from openstbench import TemporalConsistencyEvaluator

evaluator = TemporalConsistencyEvaluator(
    thresholds=(0.2, 0.4),
)

results, diagnostics = evaluator.evaluate_all(
    source_audio="./source_wavs",
    target_audio="./generated_wavs",
    sample_ids=["sample_1", "sample_2"],
    return_diagnostics=True,
)

Latency output distinguishes two RTF variants:

Real_Time_Factor_(RTF): system-level RTF. This includes agent policy overhead, pre/post-processing, and other runtime costs around model inference.
Model_Generate_RTF: model-level RTF. This is reported only when the agent explicitly records model inference time via record_model_inference_time(...) or returns it in Segment.config["model_inference_time"].

Input Conventions

Common text inputs support:

Python List[str]
.txt files with one sample per line
.json files

Common audio inputs support:

folder path
Python List[str]
.txt files
.json files

Notes

For zh / ja / ko, the toolkit uses CJK-aware handling for text-side evaluation.
SpeechQualityEvaluator returns CER_Consistency for zh / ja / ko, and WER_Consistency for most other languages.
ParalinguisticEvaluator always supports Paralinguistic_Fidelity_Cosine, a continuous CLAP-based audio similarity score between source and target speech.
TemporalConsistencyEvaluator supports List[str], audio folders, .txt path lists, and .json path lists for both source_audio and target_audio.
TemporalConsistencyEvaluator reports thresholded duration compliance metrics (Duration_Consistency_SLC_*).
The discrete preservation branch is an utterance-level single-label task. With source-side gold labels, it reports Acoustic_Event_Preservation_Rate, Acoustic_Event_Preservation_Macro_F1, and Acoustic_Event_Preservation_Macro_Recall.
If source_onsets_ms are available, the evaluator can also report alignment-aware metrics: Event_Aligned_Preservation_Rate and Conditional_Relative_Onset_Error.
Alignment is computed on relative onset position, not absolute wall-clock time. This makes it suitable for cross-lingual S2ST where source and target utterance durations naturally differ.
If target-side onset timestamps are not provided, the default localizer estimates them with CLAP sliding-window scoring conditioned on the target event label.
These alignment metrics should be interpreted as weak, coarse-grained alignment signals rather than timestamp-accurate event localization benchmarks.
If source-side gold labels are not available, the evaluator can still run in prediction-only mode and reports Predicted_Event_Consistency_Rate, Predicted_Event_Consistency_Macro_F1, and Predicted_Event_Consistency_Macro_Recall.
The default discrete predictor is a closed-set CLAP classifier over candidate_labels. Users may replace it with any custom predictor object that implements predict(audio_paths, candidate_labels).
The default event localizer is also replaceable. Custom localizers only need to implement localize(audio_paths, labels, candidate_labels).
Dataset-specific label mapping is intentionally outside the core package. Pass candidate_labels and label_normalizer at call time so the same evaluator works across datasets without changing core code.
For offline environments, clap_model_path accepts either a Hugging Face repo id or a local model directory or snapshot.
Model-loading parameters such as clap_model_path, wavlm_model_path, whisper_model, e2v_model_path, comet_model, and bleurt_path now use a consistent local-first rule: if the supplied local path exists, it is used; otherwise the evaluator falls back to the default remote model id.
In S2S latency evaluation, alignment prefers the model's native transcript when available. If the model is audio-only, the evaluator can optionally use ASR fallback to prepare alignment text.
For S2S forced alignment, pass language-appropriate MFA models through alignment_acoustic_model and alignment_dictionary_model. The defaults are English.
Some modules rely on optional dependencies or local model paths in offline environments.

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.3

May 1, 2026

This version

0.3.2

Apr 30, 2026

0.3.1

Apr 26, 2026

0.3.0

Apr 23, 2026

0.2.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openstbench-0.3.2.tar.gz (41.6 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openstbench-0.3.2-py3-none-any.whl (47.7 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file openstbench-0.3.2.tar.gz.

File metadata

Download URL: openstbench-0.3.2.tar.gz
Upload date: Apr 30, 2026
Size: 41.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for openstbench-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`27126fe331b4c75ca8eb6b453165498ad59e0c6da250ea3a2d1c06d0da919a33`
MD5	`1bf9e092e8d15a5f6cc28a5e7fd2aea7`
BLAKE2b-256	`77bffdd553be1153057ff328c86a0c3614f251983be5f57db71381ef98b0f7d0`

See more details on using hashes here.

File details

Details for the file openstbench-0.3.2-py3-none-any.whl.

File metadata

Download URL: openstbench-0.3.2-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 47.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for openstbench-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e4dccde47c6380251b555ef888b2d9c7c28cbf500f9394943f5ce2384d71e4f5`
MD5	`325128e55eedbd5e1ac8f7e46200b905`
BLAKE2b-256	`fbdd657f2fe73caf41ca3fb1ffbf2983ff037374857d47eadf6c4db0e5fc8878`

See more details on using hashes here.

OpenSTBench 0.3.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

OpenSTBench

What It Can Be Used For

Core Modules

Installation

Import

Quick Start

Input Conventions

Notes

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes