CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation
Project description
CASE Benchmark
Carrier-Agnostic Speaker Verification Evaluation
Why this exists: I wanted to build a system that indexes spoken conversations—automatically identifying speakers across Discord, phone calls, and in-person meetings. But I hit a wall: the same person produces different embeddings depending on how I encountered them. Current models degrade up to 19× on real-world audio. Humans don't have this problem—we recognize voices regardless of medium. This benchmark measures that gap. Read the full story →
Can Your Model Handle Real-World Audio?
State-of-the-art speaker verification models achieve <1% EER on clean benchmarks. But what happens when audio passes through phone codecs, cheap webcams, or is replayed through speakers?
The CASE Benchmark answers this question—and the results are eye-opening.
The Problem
| Condition | Typical SOTA Performance |
|---|---|
| Clean Audio | 0.6-1.7% EER ✅ |
| Phone Codec | 2-4% EER |
| Laptop Microphone | 0.6-1.8% EER |
| Room Reverb | 5-8% EER |
| Playback Chain | 9-13% EER ❌ |
That's up to 19× worse performance on realistic conditions.
What is a "Playback Chain"?
The hardest scenario: audio encoded, played through a speaker, and re-recorded:
Voice → [Codec] → [Speaker] → [Room Acoustics] → [Microphone] → Recording
This happens when:
- Voice messages are played back and recorded
- Conference calls with speaker playback
- Smart speaker interactions
- Any voice replay attack scenario
Quick Start
Installation
pip install case-benchmark
# Install with model support
pip install case-benchmark[speechbrain] # SpeechBrain ECAPA-TDNN
pip install case-benchmark[all-models] # All supported models
Download Benchmark Data
case-benchmark download --output-dir ./benchmark_data
Evaluate Your Model
# Using built-in model wrappers
case-benchmark evaluate \
--model speechbrain \
--benchmark-dir ./benchmark_data \
--output-dir ./results
# Or programmatically
from case_benchmark import CASEBenchmark, load_model
benchmark = CASEBenchmark("./benchmark_data")
model = load_model("speechbrain")
results = benchmark.evaluate(model)
results.print_summary()
# Clean EER: 0.56%, Degradation: +2.49%
Benchmark Results
Leaderboard
| Rank | Model | Absolute EER | Degradation | Clean EER |
|---|---|---|---|---|
| 1 | WeSpeaker ResNet34 | 3.01% | +2.43% | 0.58% |
| 2 | SpeechBrain ECAPA-TDNN | 3.05% | +2.49% | 0.56% |
| 3 | CASE HF v2-512 | 3.53% | +2.31% | 1.22% |
| 4 | NeMo TitaNet-L | 4.05% | +3.39% | 0.66% |
| 5 | pyannote Embedding | 4.47% | +2.79% | 1.68% |
| 6 | Resemblyzer | 10.49% | +5.65% | 4.84% |
Key Finding: The CASE HF model achieves the lowest degradation factor (+2.31%), validating its carrier-agnostic design.
Context: VoxCeleb1-O SOTA
For reference, current SOTA on VoxCeleb1-O (clean-clean only):
- ResNet293 + VoxBlink2: 0.17% EER (arXiv:2407.11510)
- ERes2NetV2: 0.61% EER (3D-Speaker)
Our benchmark tests production-ready models that are easily accessible. The models above require specialized training (VoxBlink2 dataset, 100K+ speakers) and post-processing (AS-Norm, QMF) not typically used in deployment.
Category Breakdown (WeSpeaker ResNet34)
| Category | Avg EER | vs Clean |
|---|---|---|
| Clean | 0.58% | baseline |
| Codec | 1.73% | +1.15% |
| Mic | 0.59% | +0.01% |
| Noise | 0.73% | +0.15% |
| Reverb | 5.88% | +5.30% |
| Playback | 8.57% | +7.99% |
Key Insight: Playback Chains Remain Challenging
All models show significant degradation on playback scenarios (codec→speaker→room→mic chains), though carrier-aware training reduces this gap substantially.
Evaluation Protocols
The benchmark includes 24 protocols across 6 categories:
| Category | Protocols | Description |
|---|---|---|
| Clean | 1 | Baseline (clean vs clean) |
| Codec | 7 | GSM, G.711, Opus, MP3 |
| Mic | 7 | Webcam, laptop, phone, headset |
| Noise | 5 | SNR 5-25 dB |
| Reverb | 1 | Simulated room acoustics |
| Playback | 3 | Full codec→speaker→room→mic chain |
Each protocol has 10,000 trials (5,000 target + 5,000 impostor).
Metrics
Two metrics together describe a model's carrier robustness:
1. Clean EER (Baseline)
Clean EER = EER on clean_clean protocol
- Measures baseline performance under ideal conditions
- Lower is better (e.g., 0.58% is excellent)
2. Degradation Factor (Robustness)
Degradation = Absolute EER − Clean EER
- Measures robustness: how much performance is lost due to carrier effects
- Lower is better (e.g., +2.31% means minimal degradation)
- Independent of baseline—directly measures carrier susceptibility
A model with low Clean EER and low Degradation is ideal. Some models (like CASE HF) trade baseline performance for better robustness.
Note: An earlier "CASE-Score v1" metric used normalized ratios (EER_degraded / EER_clean), but this can misleadingly reward models with poor baselines. See Metrics for full details.
Supported Models
Built-in wrappers for popular models:
| Model | Install | Status |
|---|---|---|
| SpeechBrain ECAPA-TDNN | pip install case-benchmark[speechbrain] |
✅ Supported |
| WeSpeaker ResNet34/CAM++ | pip install case-benchmark[wespeaker] |
✅ Supported |
| pyannote embedding | pip install case-benchmark[pyannote] |
✅ Supported |
| NVIDIA NeMo TitaNet | pip install case-benchmark[nemo] |
✅ Supported |
| Resemblyzer | pip install case-benchmark[resemblyzer] |
✅ Supported |
| CASE HF v2-512 | pip install case-benchmark[case-hf] |
✅ Supported |
Custom Models
Implement the EmbeddingModel interface:
from case_benchmark.models.base import EmbeddingModel
import numpy as np
from pathlib import Path
class MyModel(EmbeddingModel):
def load(self, device: str = "cpu") -> None:
self.model = load_my_model(device)
self._loaded = True
def extract_embedding(self, audio_path: Path) -> np.ndarray:
audio = load_audio(audio_path)
return self.model.encode(audio).numpy()
@property
def embedding_dim(self) -> int:
return 192
@property
def name(self) -> str:
return "My Custom Model"
Data
Source
- VoxCeleb1-O: 40 speakers, ~400 utterances (official test set)
- LibriSpeech test-clean: 40 speakers, ~392 utterances
- Total: 80 speakers across both datasets
- Sample rate: 16kHz mono
Degradations Applied
- Codecs: GSM, G.711 (μ-law, A-law), Opus (6k/12k/24k), MP3
- Microphones: Simulated FIR filters for webcam, laptop, phone, etc.
- Noise: DEMAND corpus at various SNR levels
- Reverb: Real RIRs from OpenSLR-28 + BUT ReverbDB
- Playback: Full codec→speaker→room→mic chain
Avoiding Data Leakage
Important: The benchmark uses different data sources than typical training pipelines to ensure proper train/eval separation.
| Component | Benchmark Source | Recommended for Training |
|---|---|---|
| Noise | DEMAND | MUSAN ✓ |
| Reverb | OpenSLR-28 + BUT ReverbDB (real RIRs) | pyroomacoustics or OpenSLR-26 (simulated) ✓ |
If you train with MUSAN noise and pyroomacoustics/OpenSLR-26 RIRs, your training data is properly separated from the benchmark. See docs/methodology.md for details.
Download
# From HuggingFace
case-benchmark download --output-dir ./benchmark_data
# Or using huggingface_hub directly
from huggingface_hub import snapshot_download
snapshot_download("bigstorm/case-benchmark", local_dir="./benchmark_data")
Documentation
- Why This Exists - The problem this benchmark is trying to solve
- Methodology - Benchmark design and technical approach
- Protocols - Detailed protocol descriptions
- Metrics - EER, Degradation Factor, and how to compare models
- Findings - Key results and analysis
- Submission Guide - How to submit to leaderboard
Citation
If you use the CASE Benchmark in your research, please cite:
@misc{gitter2026case,
title={CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation},
author={Gitter, Ben},
year={2026},
howpublished={\url{https://github.com/gittb/case-benchmark}}
}
License
- Code: MIT License
- Data: CC BY-NC 4.0 (non-commercial research only) (Contact Ben Gitter for a commerical license)
The benchmark audio is derived from VoxCeleb and LibriSpeech, which have their own license terms.
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
- Report issues on GitHub Issues
- Submit model results via Pull Request
Acknowledgments
- VoxCeleb for source audio data (VoxCeleb1-O test set)
- LibriSpeech for source audio data (test-clean subset)
- DEMAND for noise samples used in the benchmark
- OpenSLR-28 and BUT ReverbDB for real room impulse responses
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file case_benchmark-1.0.0.tar.gz.
File metadata
- Download URL: case_benchmark-1.0.0.tar.gz
- Upload date:
- Size: 415.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a8d3cd531369af183e223924e162885dba8247bbe285d7d8a5be76c36832bf4
|
|
| MD5 |
49b96a03d4c3fa614603e51ed641c11b
|
|
| BLAKE2b-256 |
17016836f8f70e40b1328e446b2d2d9ae3a38ea8a60e9bae5075d071031a893a
|
File details
Details for the file case_benchmark-1.0.0-py3-none-any.whl.
File metadata
- Download URL: case_benchmark-1.0.0-py3-none-any.whl
- Upload date:
- Size: 35.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d88d55a3fe06cef5ac6c49adbdf16c589f226ec5e85dabf68a16d996e1887d0
|
|
| MD5 |
5f149e4235f7ae838bafee02b9aa3004
|
|
| BLAKE2b-256 |
00c2400d4e806a53154e2745d7bc0aa42502c8cd521ac223d246294f02faf2b5
|