WavLM based diarization with MSDD

These details have not been verified by PyPI

Project links

Project description

WavLMMSDD

This repository combines WavLM, a powerful speech representation model from Microsoft, with MSDD (Multi-Scale Diarization Decoder), a state-of-the-art approach for speaker diarization from Nvidia. By merging WavLM’s robust feature extraction capabilities with MSDD’s advanced clustering and segmentation, this project enables accurate identification of multiple speakers in audio streams—especially in challenging, noisy, or overlapping speech scenarios.

In particular, this setup uses the Diarization MSDD Telephonic model (diar_msdd_telephonic in NeMo), making it well-suited for telephony or call center environments where speech overlap and background noise are common. Use this repository as a starting point for projects that demand robust speaker diarization in environments where speech overlap or varied audio conditions are critical factors.

Note: If you would like to contribute to this repository, please read the CONTRIBUTING first.

License GitHub release (latest by date) GitHub Discussions GitHub Issues

Architecture
Features
Reports
Installation
Usage
File Structure
Version Control System
Upcoming
Documentations
License
Links
Team
Contact
Citation

Architecture

WavLMMSDD Architecture

Features

Key Capabilities

WavLM-Based Embeddings: Leverages WavLM to generate high-quality speech representations, improving speaker identification.
Multi-Scale Diarization (MSDD): Applies multi-scale inference for precise speaker segmentation, even with overlapping speech.
Scalable Pipeline: Modular design allows easy integration and customization for various diarization tasks or research experiments.

Models

WavLM-Base-Plus for Speaker Verification
WavLM-Base for Speaker Verification
Nvidia NeMo Diarization MSDD Telephonic

Reports

Benchmark

Below is an example benchmark comparing metrics for different models on a The Ami Corpus dataset. These models use Multi-Scale Diarization Decoder (MSDD) with different embedding backbones: TitaNet and WavLM.

INFO

Experiments were conducted on an NVIDIA GeForce RTX 3060 using CUDA 12.6 (Driver Version 560.35.03).

We randomly selected 10 samples from the AMI Corpus (Array1-01.tar.gz), each 60 seconds long.

MSDD (Titanet) and MSDD (WavLMMSDD) refer to using TitaNet vs. WavLM as the speaker-embedding model.

For a detailed Jupyter notebook demonstrating how this benchmark was performed, see:
> notebook/benchmark.ipynb

Model	DER	FA	MISS	CER	Duration(sec)
MSDD + TitaNet	0.9963	0.0010	0.9946	0.0015	644
MSDD + WavLMBasePlus	0.9961	0.0010	0.9946	0.0016	18

DER: Diarization Error Rate
FA: False Alarm Rate
MISS: Missed Detection Rate
CER: Confusion Error Rate

Installation

The Python Package Index (PyPI)

pip insall wavlmmsdd

Usage

# Standard library imports
from typing import Annotated

# Local imports
from wavlmmsdd.audio.diarization.diarize import Diarizer
from wavlmmsdd.audio.feature.embedding import WavLMSV
from wavlmmsdd.audio.preprocess.resample import Resample
from wavlmmsdd.audio.preprocess.convert import Convert
from wavlmmsdd.audio.utils.utils import Build

def main() -> Annotated[None, "No return value"]:
    """
    Demonstrate the audio processing workflow from a WAV file
    to a diarization result.

    This function performs the following steps:
    1. Resamples the audio to 16 kHz.
    2. Converts the audio to mono.
    3. Builds a manifest file.
    4. Obtains embeddings.
    5. Runs diarization.

    Returns
    -------
    None

    Examples
    --------
    >>> main()
    No direct output is produced, but the specified audio file is
    processed and the results are saved or printed as logs.
    """
    
    # Audio Path
    audio_path = "audio.wav"

    # Resample to 16000 Khz
    resampler = Resample(audio_file=audio_path)
    wave_16k, sr_16k = resampler.to_16k()

    # Convert to Mono
    converter = Convert(waveform=wave_16k, sample_rate=sr_16k)
    converter.to_mono()
    saved_path = converter.save()

    # Build Manifest File
    builder = Build(saved_path)
    manifest_path = builder.manifest()

    # Embedding
    embedder = WavLMSV()

    # Diarization
    diarizer = Diarizer(embedding=embedder, manifest_path=manifest_path)
    diarizer.run()

if __name__ == "__main__":
    main()

File Structure

.
├── .data
│   └── example
│       └── ae.wav
├── .docs
│   ├── documentation
│   │   ├── CONTRIBUTING.md
│   │   └── RESOURCES.md
│   └── img
│       └── architecture
│           ├── WavLMMSDDArchitecture.drawio
│           └── WavLMMSDDArchitecture.gif
├── environment.yaml
├── .github
│   ├── CODEOWNERS
│   └── workflows
│       └── pypi.yaml
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── notebook
│   └── benchmark.ipynb
├── pyproject.toml
├── README.md
├── requirements.txt
└── src
    └── wavlmmsdd
        ├── audio
        │   ├── config
        │   │   ├── config.yaml
        │   │   ├── diar_infer_telephonic.yaml
        │   │   └── schema.py
        │   ├── diarization
        │   │   └── diarize.py
        │   ├── feature
        │   │   └── embedding.py
        │   ├── preprocess
        │   │   ├── convert.py
        │   │   └── resample.py
        │   └── utils
        │       └── utils.py
        └── main.py

18 directories, 24 files

Version Control System

Releases

v0.1.0 .zip
v0.1.0 .tar.gz

Branches

main
develop

Upcoming

WavLM Large: Integrate the WavLM Large model.

Documentations

Citation

@software{       WavLMMSDD,
  author       = {Bunyamin Ergen},
  title        = {{WavLMMSDD}},
  year         = {2025},
  month        = {02},
  url          = {https://github.com/bunyaminergen/WavLMMSDD},
  version      = {v0.1.0},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Feb 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wavlmmsdd-1.0.0.tar.gz (53.8 kB view details)

Uploaded Feb 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wavlmmsdd-1.0.0-py3-none-any.whl (43.8 kB view details)

Uploaded Feb 14, 2025 Python 3

File details

Details for the file wavlmmsdd-1.0.0.tar.gz.

File metadata

Download URL: wavlmmsdd-1.0.0.tar.gz
Upload date: Feb 14, 2025
Size: 53.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wavlmmsdd-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`7a1d2e358ee98bc0af87251aae0c9247ad725693d8fc68ab39d980ea5b0e85d7`
MD5	`8cfbd068c8657b036a6c15b90ecfe62b`
BLAKE2b-256	`12dd0768ae2f2c416379affe9f75df940366b5080b9ade438686efc60f834da4`

See more details on using hashes here.

File details

Details for the file wavlmmsdd-1.0.0-py3-none-any.whl.

File metadata

Download URL: wavlmmsdd-1.0.0-py3-none-any.whl
Upload date: Feb 14, 2025
Size: 43.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for wavlmmsdd-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb5e8d07d7f539afc9e5ccb6f989771579bccb10bb8bb714ef172fb4bbf7384d`
MD5	`5dc8e38d58a3195d5c600f1edd9b2504`
BLAKE2b-256	`4144d34f2c1d142e949a567498ca382eb2198ba81484d8ad4159088ebd32cefc`

See more details on using hashes here.

wavlmmsdd 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WavLMMSDD

Table of Contents

Architecture

Features

Key Capabilities

Models

Reports

Benchmark

Installation

The Python Package Index (PyPI)

Usage

File Structure

Version Control System

Releases

Branches

Upcoming

Documentations

Licence

Links

Team

Contact

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes