speakerlab

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization.

These details have not been verified by PyPI

Project description

Introduction

This package is derived from https://github.com/modelscope/3D-Speaker, the core of the project folder speakerlab is not a standard package, this project will speakerlab separately packaged and distributed.

license

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker-Dataset to facilitate the research of speech representation disentanglement.

Benchmark

The EER results on VoxCeleb, CNCeleb and 3D-Speaker datasets for fully-supervised speaker verification.

Model	Params	VoxCeleb1-O	CNCeleb	3D-Speaker
Res2Net	4.03 M	1.56%	7.96%	8.03%
ResNet34	6.34 M	1.05%	6.92%	7.29%
ECAPA-TDNN	20.8 M	0.86%	8.01%	8.87%
ERes2Net-base	6.61 M	0.84%	6.69%	7.21%
CAM++	7.2 M	0.65%	6.78%	7.75%
ERes2NetV2	17.8M	0.61%	6.14%	6.52%
ERes2Net-large	22.46 M	0.52%	6.17%	6.34%

The DER results on public and internal multi-speaker datasets for speaker diarization.

Test	3D-Speaker	pyannote.audio	DiariZen_WavLM
Aishell-4	10.30%	12.2%	11.7%
Alimeeting	19.73%	24.4%	17.6%
AMI_SDM	21.76%	22.4%	15.4%
VoxConverse	11.75%	11.3%	28.39%
Meeting-CN_ZH-1	18.91%	22.37%	32.66%
Meeting-CN_ZH-2	12.78%	17.86%	18%

Quickstart

Install 3D-Speaker

git clone https://github.com/modelscope/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt

Running experiments

# Speaker verification: ERes2NetV2 on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2netv2/
bash run.sh
# Speaker verification: CAM++ on 3D-Speaker dataset
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset
cd egs/3dspeaker/sv-ecapa/
bash run.sh
# Self-supervised speaker verification: SDPN on VoxCeleb dataset
cd egs/voxceleb/sv-sdpn/
bash run.sh
# Audio and multimodal Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh

Inference using pretrained models from Modelscope

All pretrained models are released on Modelscope.

# Install modelscope
pip install modelscope
# ERes2Net trained on 200k labeled speakers
model_id=iic/speech_eres2net_sv_zh-cn_16k-common
# ERes2NetV2 trained on 200k labeled speakers
model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id
# Run batch inference
python speakerlab/bin/infer_sv_batch.py --model_id $model_id --wavs $wav_list

# SDPN trained on VoxCeleb
model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k
# Run SDPN inference
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id

# Run diarization inference
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir
# Enable overlap detection
python speakerlab/bin/infer_diarization.py --wav [wav_list OR wav_path] --out_dir $out_dir --include_overlap --hf_access_token $hf_access_token

Overview of Content

Supervised Speaker Verification
- CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on 3D-Speaker.
- CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on VoxCeleb.
- CAM++, ERes2Net, ERes2NetV2, ECAPA-TDNN, ResNet and Res2Net training recipes on CN-Celeb.
Self-supervised Speaker Verification
- RDINO and SDPN training recipes on VoxCeleb
- RDINO training recipes on 3D-Speaker.
- RDINO training recipes on CN-Celeb.
Speaker Diarization
- Speaker diarization inference recipes which comprise multiple modules, including overlap detection[optional], voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering.
Language Identification
- Language identification training recipes on 3D-Speaker.
3D-Speaker Dataset
- Dataset introduction and download address: 3D-Speaker
- Related paper address: 3D-Speaker

What‘s new :fire:

[2024.12] Update diarization recipes and add results on multiple diarization benchmarks.
[2024.8] Releasing ERes2NetV2 and ERes2NetV2_w24s4ep4 pretrained models trained on 200k-speaker datasets.
[2024.5] Releasing SDPN model and X-vector model training and inference recipes for VoxCeleb.
[2024.5] Releasing visual module and semantic module training recipes.
[2024.4] Releasing ONNX Runtime and the relevant scripts for inference.
[2024.4] Releasing ERes2NetV2 model with lower parameters and faster inference speed on VoxCeleb datasets.
[2024.2] Releasing language identification integrating phonetic information recipes for more higher recognition accuracy.
[2024.2] Releasing multimodal diarization recipes which fuses audio and video image input to produce more accurate results.
[2024.1] Releasing ResNet34 and Res2Net model training and inference recipes for 3D-Speaker, VoxCeleb and CN-Celeb datasets.
[2024.1] Releasing large-margin finetune recipes in speaker verification and adding diarization recipes.
[2023.11] ERes2Net-base pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
[2023.10] Releasing ECAPA model training and inference recipes for three datasets.
[2023.9] Releasing RDINO model training and inference recipes for CN-Celeb.
[2023.8] Releasing CAM++, ERes2Net-Base and ERes2Net-Large benchmarks in CN-Celeb.
[2023.8] Releasing ERes2Net annd CAM++ in language identification for Mandarin and English.
[2023.7] Releasing CAM++, ERes2Net-Base, ERes2Net-Large pretrained models trained on 3D-Speaker.
[2023.7] Releasing Dialogue Detection and Semantic Speaker Change Detection in speaker diarization.
[2023.7] Releasing CAM++ in language identification for Mandarin and English.
[2023.6] Releasing 3D-Speaker dataset and its corresponding benchmarks including ERes2Net, CAM++ and RDINO.
[2023.5] ERes2Net and CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.

Contact

If you have any comment or question about 3D-Speaker, please contact us by

email: {yfchen97, wanghuii}@mail.ustc.edu.cn, {dengchong.d, zsq174630, shuli.cly}@alibaba-inc.com

License

3D-Speaker is released under the Apache License 2.0.

Acknowledge

3D-Speaker contains third-party components and code modified from some open-source repos, including:
Speechbrain, Wespeaker, D-TDNN, DINO, Vicreg, TalkNet-ASD , Ultra-Light-Fast-Generic-Face-Detector-1MB, pyannote.audio

Citations

If you find this repository useful, please consider giving a star :star: and citation :t-rex::

@article{chen20243d,
  title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
  booktitle={ICASSP},
  year={2025}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.6

Jul 17, 2025

0.0.5

Jul 17, 2025

0.0.4

Jul 17, 2025

This version

0.0.3

Jul 9, 2025

0.0.2

Jul 9, 2025

0.0.1

Jul 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

speakerlab-0.0.3-py3-none-any.whl (123.5 kB view details)

Uploaded Jul 9, 2025 Python 3

File details

Details for the file speakerlab-0.0.3-py3-none-any.whl.

File metadata

Download URL: speakerlab-0.0.3-py3-none-any.whl
Upload date: Jul 9, 2025
Size: 123.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for speakerlab-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`19968d8d695bbcbb7eb7d8c3b3877992756d543369bd4c3525e6ba4ba695eaa9`
MD5	`ccc39f7937c7f8304ce70bdb66236bee`
BLAKE2b-256	`2c31999cd9919a0cb372a6e39d0c9a6acfdb944b8dc2331b29c36d9e83ff6226`

See more details on using hashes here.

speakerlab 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Introduction

Benchmark

Quickstart

Install 3D-Speaker

Running experiments

Inference using pretrained models from Modelscope

Overview of Content

What‘s new :fire:

Contact

License

Acknowledge

Citations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes