Skip to main content

WavLM based diarization with MSDD

Project description

WavLMMSDD

This repository combines WavLM, a powerful speech representation model from Microsoft, with MSDD (Multi-Scale Diarization Decoder), a state-of-the-art approach for speaker diarization from Nvidia. By merging WavLM’s robust feature extraction capabilities with MSDD’s advanced clustering and segmentation, this project enables accurate identification of multiple speakers in audio streams—especially in challenging, noisy, or overlapping speech scenarios.

In particular, this setup uses the Diarization MSDD Telephonic model (diar_msdd_telephonic in NeMo), making it well-suited for telephony or call center environments where speech overlap and background noise are common. Use this repository as a starting point for projects that demand robust speaker diarization in environments where speech overlap or varied audio conditions are critical factors.

Note: If you would like to contribute to this repository, please read the CONTRIBUTING first.

License GitHub release (latest by date) GitHub Discussions GitHub Issues

LinkedIn


Table of Contents


Architecture

WavLMMSDD Architecture


Features

Key Capabilities

  • WavLM-Based Embeddings: Leverages WavLM to generate high-quality speech representations, improving speaker identification.
  • Multi-Scale Diarization (MSDD): Applies multi-scale inference for precise speaker segmentation, even with overlapping speech.
  • Scalable Pipeline: Modular design allows easy integration and customization for various diarization tasks or research experiments.

Models

  • WavLM-Base-Plus for Speaker Verification
  • WavLM-Base for Speaker Verification
  • Nvidia NeMo Diarization MSDD Telephonic

Reports

Benchmark

Below is an example benchmark comparing metrics for different models on a The Ami Corpus dataset. These models use Multi-Scale Diarization Decoder (MSDD) with different embedding backbones: TitaNet and WavLM.

INFO

  • Experiments were conducted on an NVIDIA GeForce RTX 3060 using CUDA 12.6 (Driver Version 560.35.03).
  • We randomly selected 10 samples from the AMI Corpus (Array1-01.tar.gz), each 60 seconds long.
  • MSDD (Titanet) and MSDD (WavLMMSDD) refer to using TitaNet vs. WavLM as the speaker-embedding model.
  • For a detailed Jupyter notebook demonstrating how this benchmark was performed, see:
    > notebook/benchmark.ipynb
Model DER FA MISS CER Duration(sec)
MSDD + TitaNet 0.9963 0.0010 0.9946 0.0015 644
MSDD + WavLMBasePlus 0.9961 0.0010 0.9946 0.0016 18
  • DER: Diarization Error Rate
  • FA: False Alarm Rate
  • MISS: Missed Detection Rate
  • CER: Confusion Error Rate

Installation

The Python Package Index (PyPI)
pip insall wavlmmsdd

Usage

# Standard library imports
from typing import Annotated

# Local imports
from wavlmmsdd.audio.diarization.diarize import Diarizer
from wavlmmsdd.audio.feature.embedding import WavLMSV
from wavlmmsdd.audio.preprocess.resample import Resample
from wavlmmsdd.audio.preprocess.convert import Convert
from wavlmmsdd.audio.utils.utils import Build

def main() -> Annotated[None, "No return value"]:
    """
    Demonstrate the audio processing workflow from a WAV file
    to a diarization result.

    This function performs the following steps:
    1. Resamples the audio to 16 kHz.
    2. Converts the audio to mono.
    3. Builds a manifest file.
    4. Obtains embeddings.
    5. Runs diarization.

    Returns
    -------
    None

    Examples
    --------
    >>> main()
    No direct output is produced, but the specified audio file is
    processed and the results are saved or printed as logs.
    """
    
    # Audio Path
    audio_path = "audio.wav"

    # Resample to 16000 Khz
    resampler = Resample(audio_file=audio_path)
    wave_16k, sr_16k = resampler.to_16k()

    # Convert to Mono
    converter = Convert(waveform=wave_16k, sample_rate=sr_16k)
    converter.to_mono()
    saved_path = converter.save()

    # Build Manifest File
    builder = Build(saved_path)
    manifest_path = builder.manifest()

    # Embedding
    embedder = WavLMSV()

    # Diarization
    diarizer = Diarizer(embedding=embedder, manifest_path=manifest_path)
    diarizer.run()

if __name__ == "__main__":
    main()

File Structure

.
├── .data
│   └── example
│       └── ae.wav
├── .docs
│   ├── documentation
│   │   ├── CONTRIBUTING.md
│   │   └── RESOURCES.md
│   └── img
│       └── architecture
│           ├── WavLMMSDDArchitecture.drawio
│           └── WavLMMSDDArchitecture.gif
├── environment.yaml
├── .github
│   ├── CODEOWNERS
│   └── workflows
│       └── pypi.yaml
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── notebook
│   └── benchmark.ipynb
├── pyproject.toml
├── README.md
├── requirements.txt
└── src
    └── wavlmmsdd
        ├── audio
        │   ├── config
        │   │   ├── config.yaml
        │   │   ├── diar_infer_telephonic.yaml
        │   │   └── schema.py
        │   ├── diarization
        │   │   └── diarize.py
        │   ├── feature
        │   │   └── embedding.py
        │   ├── preprocess
        │   │   ├── convert.py
        │   │   └── resample.py
        │   └── utils
        │       └── utils.py
        └── main.py

18 directories, 24 files

Version Control System

Releases
Branches

Upcoming

  • WavLM Large: Integrate the WavLM Large model.

Documentations


Licence


Links


Team


Contact


Citation

@software{       WavLMMSDD,
  author       = {Bunyamin Ergen},
  title        = {{WavLMMSDD}},
  year         = {2025},
  month        = {02},
  url          = {https://github.com/bunyaminergen/WavLMMSDD},
  version      = {v0.1.0},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wavlmmsdd-1.0.0.tar.gz (53.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wavlmmsdd-1.0.0-py3-none-any.whl (43.8 kB view details)

Uploaded Python 3

File details

Details for the file wavlmmsdd-1.0.0.tar.gz.

File metadata

  • Download URL: wavlmmsdd-1.0.0.tar.gz
  • Upload date:
  • Size: 53.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wavlmmsdd-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7a1d2e358ee98bc0af87251aae0c9247ad725693d8fc68ab39d980ea5b0e85d7
MD5 8cfbd068c8657b036a6c15b90ecfe62b
BLAKE2b-256 12dd0768ae2f2c416379affe9f75df940366b5080b9ade438686efc60f834da4

See more details on using hashes here.

File details

Details for the file wavlmmsdd-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: wavlmmsdd-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 43.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wavlmmsdd-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb5e8d07d7f539afc9e5ccb6f989771579bccb10bb8bb714ef172fb4bbf7384d
MD5 5dc8e38d58a3195d5c600f1edd9b2504
BLAKE2b-256 4144d34f2c1d142e949a567498ca382eb2198ba81484d8ad4159088ebd32cefc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page