Skip to main content

CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation

Project description

CASE Benchmark

Carrier-Agnostic Speaker Verification Evaluation

License: MIT License: CC BY-NC 4.0 Python 3.10+


Why this exists: I wanted to build a system that indexes spoken conversations—automatically identifying speakers across Discord, phone calls, and in-person meetings. But I hit a wall: the same person produces different embeddings depending on how I encountered them. Current models degrade up to 19× on real-world audio. Humans don't have this problem—we recognize voices regardless of medium. This benchmark measures that gap. Read the full story →


Can Your Model Handle Real-World Audio?

State-of-the-art speaker verification models achieve <1% EER on clean benchmarks. But what happens when audio passes through phone codecs, cheap webcams, or is replayed through speakers?

The CASE Benchmark answers this question—and the results are eye-opening.

The Problem

Condition Typical SOTA Performance
Clean Audio 0.6-1.7% EER
Phone Codec 2-4% EER
Laptop Microphone 0.6-1.8% EER
Room Reverb 5-8% EER
Playback Chain 9-13% EER

That's up to 19× worse performance on realistic conditions.

What is a "Playback Chain"?

The hardest scenario: audio encoded, played through a speaker, and re-recorded:

Voice → [Codec] → [Speaker] → [Room Acoustics] → [Microphone] → Recording

This happens when:

  • Voice messages are played back and recorded
  • Conference calls with speaker playback
  • Smart speaker interactions
  • Any voice replay attack scenario

Quick Start

Installation

pip install case-benchmark

# Install with model support
pip install case-benchmark[speechbrain]  # SpeechBrain ECAPA-TDNN
pip install case-benchmark[all-models]   # All supported models

Download Benchmark Data

case-benchmark download --output-dir ./benchmark_data

Evaluate Your Model

# Using built-in model wrappers
case-benchmark evaluate \
    --model speechbrain \
    --benchmark-dir ./benchmark_data \
    --output-dir ./results

# Or programmatically
from case_benchmark import CASEBenchmark, load_model

benchmark = CASEBenchmark("./benchmark_data")
model = load_model("speechbrain")

results = benchmark.evaluate(model)
results.print_summary()
# Clean EER: 0.56%, Degradation: +2.49%

Benchmark Results

Leaderboard

Rank Model Absolute EER Degradation Clean EER
1 WeSpeaker ResNet34 3.01% +2.43% 0.58%
2 SpeechBrain ECAPA-TDNN 3.05% +2.49% 0.56%
3 CASE HF v2-512 3.53% +2.31% 1.22%
4 NeMo TitaNet-L 4.05% +3.39% 0.66%
5 pyannote Embedding 4.47% +2.79% 1.68%
6 Resemblyzer 10.49% +5.65% 4.84%

Key Finding: The CASE HF model achieves the lowest degradation factor (+2.31%), validating its carrier-agnostic design.

Context: VoxCeleb1-O SOTA

For reference, current SOTA on VoxCeleb1-O (clean-clean only):

Our benchmark tests production-ready models that are easily accessible. The models above require specialized training (VoxBlink2 dataset, 100K+ speakers) and post-processing (AS-Norm, QMF) not typically used in deployment.

Category Breakdown (WeSpeaker ResNet34)

Category Avg EER vs Clean
Clean 0.58% baseline
Codec 1.73% +1.15%
Mic 0.59% +0.01%
Noise 0.73% +0.15%
Reverb 5.88% +5.30%
Playback 8.57% +7.99%

Key Insight: Playback Chains Remain Challenging

All models show significant degradation on playback scenarios (codec→speaker→room→mic chains), though carrier-aware training reduces this gap substantially.


Evaluation Protocols

The benchmark includes 24 protocols across 6 categories:

Category Protocols Description
Clean 1 Baseline (clean vs clean)
Codec 7 GSM, G.711, Opus, MP3
Mic 7 Webcam, laptop, phone, headset
Noise 5 SNR 5-25 dB
Reverb 1 Simulated room acoustics
Playback 3 Full codec→speaker→room→mic chain

Each protocol has 10,000 trials (5,000 target + 5,000 impostor).


Metrics

Two metrics together describe a model's carrier robustness:

1. Clean EER (Baseline)

Clean EER = EER on clean_clean protocol
  • Measures baseline performance under ideal conditions
  • Lower is better (e.g., 0.58% is excellent)

2. Degradation Factor (Robustness)

Degradation = Absolute EER − Clean EER
  • Measures robustness: how much performance is lost due to carrier effects
  • Lower is better (e.g., +2.31% means minimal degradation)
  • Independent of baseline—directly measures carrier susceptibility

A model with low Clean EER and low Degradation is ideal. Some models (like CASE HF) trade baseline performance for better robustness.

Note: An earlier "CASE-Score v1" metric used normalized ratios (EER_degraded / EER_clean), but this can misleadingly reward models with poor baselines. See Metrics for full details.


Supported Models

Built-in wrappers for popular models:

Model Install Status
SpeechBrain ECAPA-TDNN pip install case-benchmark[speechbrain] ✅ Supported
WeSpeaker ResNet34/CAM++ pip install case-benchmark[wespeaker] ✅ Supported
pyannote embedding pip install case-benchmark[pyannote] ✅ Supported
NVIDIA NeMo TitaNet pip install case-benchmark[nemo] ✅ Supported
Resemblyzer pip install case-benchmark[resemblyzer] ✅ Supported
CASE HF v2-512 pip install case-benchmark[case-hf] ✅ Supported

Custom Models

Implement the EmbeddingModel interface:

from case_benchmark.models.base import EmbeddingModel
import numpy as np
from pathlib import Path

class MyModel(EmbeddingModel):
    def load(self, device: str = "cpu") -> None:
        self.model = load_my_model(device)
        self._loaded = True

    def extract_embedding(self, audio_path: Path) -> np.ndarray:
        audio = load_audio(audio_path)
        return self.model.encode(audio).numpy()

    @property
    def embedding_dim(self) -> int:
        return 192

    @property
    def name(self) -> str:
        return "My Custom Model"

Data

Source

  • VoxCeleb1-O: 40 speakers, ~400 utterances (official test set)
  • LibriSpeech test-clean: 40 speakers, ~392 utterances
  • Total: 80 speakers across both datasets
  • Sample rate: 16kHz mono

Degradations Applied

  • Codecs: GSM, G.711 (μ-law, A-law), Opus (6k/12k/24k), MP3
  • Microphones: Simulated FIR filters for webcam, laptop, phone, etc.
  • Noise: DEMAND corpus at various SNR levels
  • Reverb: Real RIRs from OpenSLR-28 + BUT ReverbDB
  • Playback: Full codec→speaker→room→mic chain

Avoiding Data Leakage

Important: The benchmark uses different data sources than typical training pipelines to ensure proper train/eval separation.

Component Benchmark Source Recommended for Training
Noise DEMAND MUSAN
Reverb OpenSLR-28 + BUT ReverbDB (real RIRs) pyroomacoustics or OpenSLR-26 (simulated) ✓

If you train with MUSAN noise and pyroomacoustics/OpenSLR-26 RIRs, your training data is properly separated from the benchmark. See docs/methodology.md for details.

Download

# From HuggingFace
case-benchmark download --output-dir ./benchmark_data

# Or using huggingface_hub directly
from huggingface_hub import snapshot_download
snapshot_download("bigstorm/case-benchmark", local_dir="./benchmark_data")

Documentation


Citation

If you use the CASE Benchmark in your research, please cite:

@misc{gitter2026case,
  title={CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation},
  author={Gitter, Ben},
  year={2026},
  howpublished={\url{https://github.com/gittb/case-benchmark}}
}

License

  • Code: MIT License
  • Data: CC BY-NC 4.0 (non-commercial research only) (Contact Ben Gitter for a commerical license)

The benchmark audio is derived from VoxCeleb and LibriSpeech, which have their own license terms.


Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.


Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

case_benchmark-1.0.0.tar.gz (415.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

case_benchmark-1.0.0-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file case_benchmark-1.0.0.tar.gz.

File metadata

  • Download URL: case_benchmark-1.0.0.tar.gz
  • Upload date:
  • Size: 415.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for case_benchmark-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0a8d3cd531369af183e223924e162885dba8247bbe285d7d8a5be76c36832bf4
MD5 49b96a03d4c3fa614603e51ed641c11b
BLAKE2b-256 17016836f8f70e40b1328e446b2d2d9ae3a38ea8a60e9bae5075d071031a893a

See more details on using hashes here.

File details

Details for the file case_benchmark-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: case_benchmark-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for case_benchmark-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1d88d55a3fe06cef5ac6c49adbdf16c589f226ec5e85dabf68a16d996e1887d0
MD5 5f149e4235f7ae838bafee02b9aa3004
BLAKE2b-256 00c2400d4e806a53154e2745d7bc0aa42502c8cd521ac223d246294f02faf2b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page