Open speech and translation benchmarking toolkit supporting MT, ASR, TTS, SimulST, VC, and paralinguistics with optimized CJK language support
Project description
OpenSTBench
English | Chinese
OpenSTBench is an evaluation toolkit centered on translation and speech translation. It provides a unified way to score text translation quality, speech output quality, preservation-related properties, and streaming latency.
What It Can Be Used For
This project is best suited for these directions:
- MT or S2TT text-side evaluation with
BLEU,chrF++,COMET, andBLEURT - S2ST evaluation by combining text quality, speech quality, speaker similarity, and latency
- Streaming or simultaneous speech translation latency evaluation with a custom agent
- Preservation analysis for speech translation outputs, including speaker similarity, emotion, and paralinguistic similarity
- Temporal consistency analysis for speech translation or dubbing outputs, including duration compliance and duration error
Core Modules
| Module | Main Use | Typical Metrics |
|---|---|---|
TranslationEvaluator |
Text-side translation quality | sacreBLEU, chrF++, COMET, BLEURT |
SpeechQualityEvaluator |
Naturalness and text-speech consistency | UTMOS, WER_Consistency, CER_Consistency |
SpeakerSimilarityEvaluator |
Speaker preservation | wavlm_similarity, resemblyzer_similarity |
EmotionEvaluator |
Emotion preservation or classification accuracy | Emotion2Vec_Cosine_Similarity, Audio_Emotion_Accuracy |
ParalinguisticEvaluator |
Non-verbal and paralinguistic preservation | Paralinguistic_Fidelity_Cosine, Acoustic_Event_Preservation_Rate, Acoustic_Event_Preservation_Macro_F1, Acoustic_Event_Preservation_Macro_Recall, Event_Aligned_Preservation_Rate, Conditional_Relative_Onset_Error |
TemporalConsistencyEvaluator |
Source-target temporal structure consistency | Duration_Consistency_SLC_0.2, Duration_Consistency_SLC_0.4 |
LatencyEvaluator |
Streaming / simultaneous translation latency | StartOffset, ATD, CustomATD, RTF, Model_Generate_RTF |
Installation
Basic install:
pip install OpenSTBench
Optional extras:
pip install "OpenSTBench[comet]"
pip install "OpenSTBench[whisper]"
pip install "OpenSTBench[speech_quality]"
pip install "OpenSTBench[emotion]"
pip install "OpenSTBench[paralinguistics]"
pip install "OpenSTBench[all]"
If you need BLEURT:
pip install git+https://github.com/lucadiliello/bleurt-pytorch.git
Import
PyPI package name:
OpenSTBench
Python import name:
openstbench
Example:
from openstbench import (
TranslationEvaluator,
SpeechQualityEvaluator,
TemporalConsistencyEvaluator,
)
Quick Start
Quick-start scripts live under examples/.
Python examples:
examples/python/translation_eval.pyexamples/python/speech_quality_eval.pyexamples/python/speaker_similarity_eval.pyexamples/python/emotion_eval.pyexamples/python/paralinguistic_eval.pyexamples/python/paralinguistic_identity_baseline.pyexamples/python/temporal_consistency_eval.pyexamples/python/latency_eval.py
Shell examples:
examples/bash/install_extras.shexamples/bash/run_latency_cli.sh
Minimal temporal consistency example:
from openstbench import TemporalConsistencyEvaluator
evaluator = TemporalConsistencyEvaluator(
thresholds=(0.2, 0.4),
)
results, diagnostics = evaluator.evaluate_all(
source_audio="./source_wavs",
target_audio="./generated_wavs",
sample_ids=["sample_1", "sample_2"],
return_diagnostics=True,
)
Latency output distinguishes two RTF variants:
Real_Time_Factor_(RTF): system-level RTF. This includes agent policy overhead, pre/post-processing, and other runtime costs around model inference.Model_Generate_RTF: model-level RTF. This is reported only when the agent explicitly records model inference time viarecord_model_inference_time(...)or returns it inSegment.config["model_inference_time"].
Input Conventions
Common text inputs support:
- Python
List[str] .txtfiles with one sample per line.jsonfiles
Common audio inputs support:
- folder path
- Python
List[str] .txtfiles.jsonfiles
Notes
- For
zh/ja/ko, the toolkit uses CJK-aware handling for text-side evaluation. SpeechQualityEvaluatorreturnsCER_Consistencyforzh/ja/ko, andWER_Consistencyfor most other languages.ParalinguisticEvaluatoralways supportsParalinguistic_Fidelity_Cosine, a continuous CLAP-based audio similarity score between source and target speech.TemporalConsistencyEvaluatorsupportsList[str], audio folders,.txtpath lists, and.jsonpath lists for bothsource_audioandtarget_audio.TemporalConsistencyEvaluatorreports thresholded duration compliance metrics (Duration_Consistency_SLC_*).- The discrete preservation branch is an utterance-level single-label task. With source-side gold labels, it reports
Acoustic_Event_Preservation_Rate,Acoustic_Event_Preservation_Macro_F1, andAcoustic_Event_Preservation_Macro_Recall. - If
source_onsets_msare available, the evaluator can also report alignment-aware metrics:Event_Aligned_Preservation_RateandConditional_Relative_Onset_Error. - Alignment is computed on relative onset position, not absolute wall-clock time. This makes it suitable for cross-lingual S2ST where source and target utterance durations naturally differ.
- If target-side onset timestamps are not provided, the default localizer estimates them with CLAP sliding-window scoring conditioned on the target event label.
- These alignment metrics should be interpreted as weak, coarse-grained alignment signals rather than timestamp-accurate event localization benchmarks.
- If source-side gold labels are not available, the evaluator can still run in prediction-only mode and reports
Predicted_Event_Consistency_Rate,Predicted_Event_Consistency_Macro_F1, andPredicted_Event_Consistency_Macro_Recall. - The default discrete predictor is a closed-set CLAP classifier over
candidate_labels. Users may replace it with any custom predictor object that implementspredict(audio_paths, candidate_labels). - The default event localizer is also replaceable. Custom localizers only need to implement
localize(audio_paths, labels, candidate_labels). - Dataset-specific label mapping is intentionally outside the core package. Pass
candidate_labelsandlabel_normalizerat call time so the same evaluator works across datasets without changing core code. - For offline environments,
clap_model_pathaccepts either a Hugging Face repo id or a local model directory or snapshot. - Model-loading parameters such as
clap_model_path,wavlm_model_path,whisper_model,e2v_model_path,comet_model, andbleurt_pathnow use a consistent local-first rule: if the supplied local path exists, it is used; otherwise the evaluator falls back to the default remote model id. - In S2S latency evaluation, alignment prefers the model's native transcript when available. If the model is audio-only, the evaluator can optionally use ASR fallback to prepare alignment text.
- For S2S forced alignment, pass language-appropriate MFA models through
alignment_acoustic_modelandalignment_dictionary_model. The defaults are English. - Some modules rely on optional dependencies or local model paths in offline environments.
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openstbench-0.3.2.tar.gz.
File metadata
- Download URL: openstbench-0.3.2.tar.gz
- Upload date:
- Size: 41.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27126fe331b4c75ca8eb6b453165498ad59e0c6da250ea3a2d1c06d0da919a33
|
|
| MD5 |
1bf9e092e8d15a5f6cc28a5e7fd2aea7
|
|
| BLAKE2b-256 |
77bffdd553be1153057ff328c86a0c3614f251983be5f57db71381ef98b0f7d0
|
File details
Details for the file openstbench-0.3.2-py3-none-any.whl.
File metadata
- Download URL: openstbench-0.3.2-py3-none-any.whl
- Upload date:
- Size: 47.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4dccde47c6380251b555ef888b2d9c7c28cbf500f9394943f5ce2384d71e4f5
|
|
| MD5 |
325128e55eedbd5e1ac8f7e46200b905
|
|
| BLAKE2b-256 |
fbdd657f2fe73caf41ca3fb1ffbf2983ff037374857d47eadf6c4db0e5fc8878
|