Multi-metric evaluation toolkit supporting MT, ASR, TTS, SimulST, VC, and Paralinguistics with optimized CJK language support
Project description
MultiMetric-Eval
English | 中文
MultiMetric-Eval is an evaluation toolkit centered on translation and speech translation. It provides a unified way to score text translation quality, speech output quality, preservation-related properties, and streaming latency.
What It Can Be Used For
This project is best suited for these directions:
- MT or S2TT text-side evaluation with
BLEU,chrF++,COMET, andBLEURT - S2ST evaluation by combining text quality, speech quality, speaker similarity, and latency
- Streaming or simultaneous speech translation latency evaluation with a custom agent
- Preservation analysis for speech translation outputs, including speaker similarity, emotion, and paralinguistic similarity
Capability Boundary
MultiMetric-Eval is an evaluator, not a model training or inference framework.
It is a good fit when you already have model outputs and want to score them in a consistent way.
It is not designed to be:
- a general-purpose ASR toolkit
- a general-purpose TTS toolkit
- a model serving framework
- a replacement for task-specific toolkits in unrelated speech domains
Core Modules
| Module | Main Use | Typical Metrics |
|---|---|---|
TranslationEvaluator |
Text-side translation quality | sacreBLEU, chrF++, COMET, BLEURT |
SpeechQualityEvaluator |
Naturalness and text-speech consistency | UTMOS, WER_Consistency, CER_Consistency |
SpeakerSimilarityEvaluator |
Speaker preservation | wavlm_similarity, resemblyzer_similarity |
EmotionEvaluator |
Emotion preservation or classification accuracy | Emotion2Vec_Cosine_Similarity, Audio_Emotion_Accuracy |
ParalinguisticEvaluator |
Non-verbal and paralinguistic similarity | Paralinguistic_Fidelity_Cosine, Discrete_Acoustic_Event_F1_Strict, Discrete_Acoustic_Event_F1_Relaxed |
LatencyEvaluator |
Streaming / simultaneous translation latency | StartOffset, ATD, CustomATD, RTF, Model_Generate_RTF |
Installation
Basic install:
pip install multimetriceval
Optional extras:
pip install "multimetriceval[comet]"
pip install "multimetriceval[whisper]"
pip install "multimetriceval[emotion]"
pip install "multimetriceval[paralinguistics]"
pip install "multimetriceval[all]"
If you need BLEURT:
pip install git+https://github.com/lucadiliello/bleurt-pytorch.git
Import
PyPI package name:
multimetriceval
Python import name:
multimetric_eval
Example:
from multimetric_eval import TranslationEvaluator, SpeechQualityEvaluator
Quick Start
Quick-start scripts live under examples/.
Python examples:
examples/python/translation_eval.pyexamples/python/speech_quality_eval.pyexamples/python/speaker_similarity_eval.pyexamples/python/emotion_eval.pyexamples/python/paralinguistic_eval.pyexamples/python/paralinguistic_identity_baseline.pyexamples/python/latency_eval.py
Shell examples:
examples/bash/install_extras.shexamples/bash/run_latency_cli.sh
Latency output now distinguishes two RTF variants:
Real_Time_Factor_(RTF): system-level RTF. This includes agent policy overhead, pre/post-processing, and other runtime costs around model inference.Model_Generate_RTF: model-level RTF. This is reported only when the agent explicitly records model inference time viarecord_model_inference_time(...)or returns it inSegment.config["model_inference_time"].
Examples
Examples have been moved into the examples/ directory. The paralinguistic examples above cover:
- strict event F1 with timestamped
source_event_annotations - relaxed event F1 with utterance-level labels
- identity-audio baselines for measuring the evaluator itself without a translation model
Full Evaluation Pipelines
For larger end-to-end evaluation scripts, see test/:
test/run_full_eval_seamless.pytest/run_full_eval_vallex.pytest/run_full_eval_simulmega.pytest/run_full_eval_cascade.py
Input Conventions
Common text inputs support:
- Python
List[str] .txtfiles with one sample per line.jsonfiles
Common audio inputs support:
- folder path
- Python
List[str] .txtfiles.jsonfiles
Notes
- For
zh/ja/ko, the toolkit uses CJK-aware handling for text-side evaluation. SpeechQualityEvaluatorreturnsCER_Consistencyforzh/ja/ko, andWER_Consistencyfor most other languages.ParalinguisticEvaluatorreportsParalinguistic_Fidelity_Cosinethrough CLAP and can also report discrete event preservation withDiscrete_Acoustic_Event_F1_StrictandDiscrete_Acoustic_Event_F1_Relaxed.Discrete_Acoustic_Event_F1_Strictrequires timestamps on both source and target annotations.Discrete_Acoustic_Event_F1_Relaxedworks with utterance-level labels.- If the source side has only utterance-level labels and no target-side annotations are provided, the evaluator falls back to
clap_label_matching. In that branch only the relaxed metric is produced, and detector checkpoints are not used. - The built-in detector loads any
transformersAutoModelForAudioClassificationcheckpoint that exposesid2label. Users can pass a local path or Hugging Face repo id throughbeats_model_pathordetector_model_path; otherwise the evaluator tries BEATs-compatible defaults. allowed_labelsrestricts both detector outputs and CLAP candidate labels.- For discrete event F1, source-side event labels are expected to be canonical.
event_label_mappingis applied on target-side predicted labels so users can adapt different datasets or label ontologies. - Samples with no events or labels on both sides contribute zero counts to the aggregate instead of being treated as a special case.
- In S2S latency evaluation, alignment prefers the model's native transcript when available. If the model is audio-only, the evaluator can optionally use ASR fallback to prepare alignment text.
- For S2S forced alignment, pass language-appropriate MFA models through
alignment_acoustic_modelandalignment_dictionary_model. The defaults are English. - Some modules rely on optional dependencies or local model paths in offline environments.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multimetriceval-0.8.2.tar.gz.
File metadata
- Download URL: multimetriceval-0.8.2.tar.gz
- Upload date:
- Size: 47.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80a7d7414ac22a073362e3d5ab31a72c71056dafcb7b61182b850660a9d16466
|
|
| MD5 |
12312627f17f83a03cddea37aa089b25
|
|
| BLAKE2b-256 |
e064a0e7225d210de4b5e95d21be016ca1ca46b7d3e1fdd3f3304a08f70167df
|
File details
Details for the file multimetriceval-0.8.2-py3-none-any.whl.
File metadata
- Download URL: multimetriceval-0.8.2-py3-none-any.whl
- Upload date:
- Size: 50.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
386c2330637de518210ba2e2de8e995fd9a362d3ee9e38c3fa5ae3c6e11a8f61
|
|
| MD5 |
90db34835b809d10c7e628ce2b8baccd
|
|
| BLAKE2b-256 |
df37550ce527a291a52655fbc6234aa2f0261ea382e828a38deb76e7f61bb8c0
|