A unified inference platform for speech processing tasks

These details have not been verified by PyPI

Project links

homepage

Project description

ClearVoice

👉🏻HuggingFace Space Demo👈🏻 | 👉🏻ModelScope Space Demo👈🏻

1. Introduction
2. Usage

1. Introduction

ClearVoice offers a unified inference platform for speech enhancement, speech separation, speech super-resolution, and audio-visual target speaker extraction. It is designed to simplify the adoption of our pre-trained models for your speech processing purpose or the integration into your projects. Currently, we provide the following pre-trained models:

Tasks (Sampling rate)	Models (HuggingFace Links)
Speech Enhancement (16kHz & 48kHz)	`MossFormer2_SE_48K` (HuggingFace), `FRCRN_SE_16K` (HuggingFace), `MossFormerGAN_SE_16K` (HuggingFace)
Speech Separation (16kHz)	`MossFormer2_SS_16K` (HuggingFace)
Speech Super-Resolution (48kHz)	`MossFormer2_SR_48K`(HuggingFace)
Audio-Visual Target Speaker Extraction (16kHz)	`AV_MossFormer2_TSE_16K` (HuggingFace)

You don't need to manually download the pre-trained models—they are automatically fetched from HuggingFace during inference. If the models are not downloaded sucessfully to ./clearvoice/checkpoints, you can manually download them from ModelScope.

2. Usage

Install via PyPI

Install ClearVoice via PyPI:
```
pip install clearvoice
```
In your Python code:
```
from clearvoice import ClearVoice
```

Install FFmpeg (optional)

Clearvoice relies on FFmpeg for audio format conversion. If you're only working with .wav files, FFmpeg is not required. For all other formats, please follow the instructions below to install FFmpeg.

On Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

On macOS (with Homebrew):
```
brew install ffmpeg
```
On Windows: Download the static build from https://ffmpeg.org/download.html. Unzip and add the bin folder (which contains ffprobe.exe) to your System PATH.

Install from GitHub

Clone the GitHub repository and install the requirements:

git clone https://github.com/modelscope/ClearerVoice-Studio.git
cd ClearerVoice-Studio/clearvoice
pip install --editable .

In your Python code:
```
from clearvoice import ClearVoice
```

Demo script

cd ClearerVoice-Studio/clearvoice
python demo.py

cd ClearerVoice-Studio/clearvoice
python demo_with_more_comments.py

cd ClearerVoice-Studio/clearvoice
python demo_Numpy2Numpy.py

You may activate each demo case by setting to True in demo.py, demo_with_more_comments.py, and demo_Numpy2Numpy.py.
In demo_Numpy2Numpy.py, we added a new interface for ClearVoice that supports from numpy input to numpy output, instead of file I/O.
Supported audio format: .flac .wav
Supported video format: .avi .mp4 .mov .webm

Sample Python Script

Use MossFormer2_SE_48K model for fullband (48kHz) speech enhancement task:

from clearvoice import ClearVoice

myClearVoice = ClearVoice(task='speech_enhancement', model_names=['MossFormer2_SE_48K'])

#process single wave file
output_wav = myClearVoice(input_path='samples/input.wav', online_write=False)
myClearVoice.write(output_wav, output_path='samples/output_MossFormer2_SE_48K.wav')

#process wave directory
myClearVoice(input_path='samples/path_to_input_wavs', online_write=True, output_path='samples/path_to_output_wavs')

#process wave list file
myClearVoice(input_path='samples/scp/audio_samples.scp', online_write=True, output_path='samples/path_to_output_wavs_scp')

Parameter Description:

task: Choose one of the three tasks speech_enhancement, speech_separation, and target_speaker_extraction
model_names: List of model names, choose one or more models for the task
input_path: Path to the input audio/video file, input audio/video directory, or a list file (.scp)
online_write: Set to True to enable saving the enhanced/separated audio/video directly to local files during processing, otherwise, the enhanced/separated audio is returned. (Only supports False for speech_enhancement, speech_separation when processing single wave file`)
output_path: Path to a file or a directory to save the enhanced/separated audio/video file

这里给出了一个较详细的中文使用教程：https://stable-learn.com/zh/clearvoice-studio-tutorial

3. Model Performance

Speech enhancement models:

We evaluated our released speech enhancement models on the popular benchmarks: VoiceBank+DEMAND testset (16kHz & 48kHz) and DNS-Challenge-2020 (Interspeech) testset (non-reverb, 16kHz). Different from the most published papers that tailored each model for each test set, our evaluation here uses unified models on the two test sets. The evaluation metrics are generated by SpeechScore.

VoiceBank+DEMAND testset (tested on 16kHz)

Model	PESQ	NB_PESQ	CBAK	COVL	CSIG	STOI	SISDR	SNR	SRMR	SSNR	P808_MOS	SIG	BAK	OVRL	ISR	SAR	SDR	FWSEGSNR	LLR	LSD	MCD
Noisy	1.97	3.32	2.79	2.70	3.32	0.92	8.44	9.35	7.81	6.13	3.05	3.37	3.32	2.79	28.11	8.53	8.44	14.77	0.78	1.40	4.15
FRCRN_SE_16K	3.23	3.86	3.47	3.83	4.29	0.95	19.22	19.16	9.21	7.60	3.59	3.46	4.11	3.20	12.66	21.16	11.71	20.76	0.37	0.98	0.56
MossFormerGAN_SE_16K	3.47	3.96	3.50	3.73	4.40	0.96	19.45	19.36	9.07	9.09	3.57	3.50	4.09	3.23	25.98	21.18	19.42	20.20	0.34	0.79	0.70
MossFormer2_SE_48K	3.16	3.77	3.32	3.58	4.14	0.95	19.38	19.22	9.61	6.86	3.53	3.50	4.07	3.22	12.05	21.84	11.47	16.69	0.57	1.72	0.62

DNS-Challenge-2020 testset (tested on 16kHz)

Model	PESQ	NB_PESQ	CBAK	COVL	CSIG	STOI	SISDR	SNR	SRMR	SSNR	P808_MOS	SIG	BAK	OVRL	ISR	SAR	SDR	FWSEGSNR	LLR	LSD	MCD
Noisy	1.58	2.16	2.66	2.06	2.72	0.91	9.07	9.95	6.13	9.35	3.15	3.39	2.61	2.48	34.57	9.09	9.06	15.87	1.07	1.88	6.42
FRCRN_SE_16K	3.24	3.66	3.76	3.63	4.31	0.98	19.99	19.89	8.77	7.60	4.03	3.58	4.15	3.33	8.90	20.14	7.93	22.59	0.50	1.69	0.97
MossFormerGAN_SE_16K	3.57	3.88	3.93	3.92	4.56	0.98	20.60	20.44	8.68	14.03	4.05	3.58	4.18	3.36	8.88	20.81	7.98	21.62	0.45	1.65	0.89
MossFormer2_SE_48K	2.94	3.45	3.36	2.94	3.47	0.97	17.75	17.65	9.26	11.86	3.92	3.51	4.13	3.26	8.55	18.40	7.48	16.10	0.98	3.02	1.15

VoiceBank+DEMAND testset (tested on 48kHz) (We included our evaluations on other open-sourced models using SpeechScore)

Model	PESQ	NB_PESQ	CBAK	COVL	CSIG	STOI	SISDR	SNR	SRMR	SSNR	P808_MOS	SIG	BAK	OVRL	ISR	SAR	SDR	FWSEGSNR	LLR	LSD	MCD
Noisy	1.97	2.87	2.79	2.70	3.32	0.92	8.39	9.30	7.81	6.13	3.07	3.35	3.12	2.69	33.75	8.42	8.39	13.98	0.75	1.45	5.41
MossFormer2_SE_48K	3.15	3.77	3.33	3.64	4.23	0.95	19.36	19.22	9.61	7.03	3.53	3.41	4.10	3.15	4.08	21.23	4.06	14.45	NA	1.86	0.53
Resemble_enhance	2.84	3.58	3.14	NA	NA	0.94	12.42	12.79	9.08	7.07	3.53	3.42	3.99	3.12	13.62	12.66	10.31	14.56	1.50	1.66	1.54
DeepFilterNet	3.03	3.71	3.29	3.55	4.20	0.94	15.71	15.66	9.66	7.19	3.47	3.40	4.00	3.10	28.01	16.20	15.79	15.69	0.55	0.94	1.77

Resemble_enhance (Github) is an open-sourced 44.1kHz pure speech enhancement platform from Resemble-AI since 2023, we resampled to 48khz before making evaluation.
DeepFilterNet (Github) is a low complexity speech enhancement framework for Full-Band audio (48kHz) using on deep filtering.

Note: We observed anomalies in two speech metrics, LLR and LSD, after processing with the 48 kHz models. We will further investigate the issue to identify the cause.

Speech separation models:

We evaluated our speech separation model MossFormer2_SS_16K on the popular benchmark testset: LRS2_2Mix (16 kHz), WSJ0-2Mix (8 kHz), Libri2Mix (8 kHz), WHAM! (8 kHz). We compare our model with following state-of-the-art models: Conv-TasNet, DualPathRNN, DPTNet, SepFormer, TDANet, TF-GridNet, SPMamba. The testing results are taken from TDANet Github repo and SPMamba GitHub repo. The performance metric of SI-SNRi (SI-SNR improvement) is used for the evaluations.

Model	LRS2_2Mix (16 kHz)	WSJ0-2Mix (8 kHz)	Libri2Mix (8kHz)	WHAM! (8 kHz)
Conv-TasNet	10.6	15.3	12.2	12.7
DualPathRNN	12.7	18.8	16.1	13.7
DPTNet	13.3	20.2	16.7	14.9
SepFormer	13.5	20.4	17.0	14.4
TDANet Large	14.2	18.5	17.4	15.2
TF-GridNet	-	22.8	19.8	16.9
SPMamba	-	22.5	19.9	17.4
MossFormer2_SS_16K	15.5	22.0	16.7	17.4

Note: The MossFormer2_SS_16K results presented are from our unified model, evaluated without retraining on individual datasets. This 16 kHz model was used for speech separation on the 16 kHz test set, with scores then calculated on the downsampled 8 kHz audio. All comparison models were trained and tested separately on each dataset.

Speech super-resolution model:

We demonstrated the effectiveness of our speech super-resolution model, MossFormer2_SR_48K, using the VoiceBank+DEMAND 48 kHz test set. For super-resolution evaluation, the test set was downsampled to 16 kHz, 24 kHz, and 32 kHz. The Log Spectral Distance (LSD) and PESQ metrics was used for evaluation. Recognizing that speech quality is impacted by both lower sampling rates and background noise, we also incorporated our speech enhancement model, MossFormer2_SE_48K, to reduce noise prior to super-resolution processing. Results are presented in the following table.

Model	16 kHz	24 kHz	32 kHz	48 kHz	PESQ
Origin	2.80	2.60	2.29	1.46	1.97
Enhanced	1.93	1.52	1.50	1.42	3.15

For the 48 kHz case, speech super-resolution was not applied. The final two columns show that MossFormer2_SE_48K significantly improves the 16 kHz PESQ score but only marginally improves LSD. Therefore, LSD improvements at 16 kHz, 24 kHz, and 32 kHz are primarily attributed to MossFormer2_SR_48K.

Project details

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

This version

0.1.2

Jul 11, 2025

0.1.1

Apr 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clearvoice-0.1.2.tar.gz (154.5 kB view details)

Uploaded Jul 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clearvoice-0.1.2-py3-none-any.whl (183.3 kB view details)

Uploaded Jul 11, 2025 Python 3

File details

Details for the file clearvoice-0.1.2.tar.gz.

File metadata

Download URL: clearvoice-0.1.2.tar.gz
Upload date: Jul 11, 2025
Size: 154.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for clearvoice-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`810345298445bfbdfaf135ef3862b7ad1c74675ae180f88fe22fbf0c12b370b7`
MD5	`4753aadce635b01299dbc0b65cccbf10`
BLAKE2b-256	`00a17c340ab0501d023f7f19aff7b3ec0c630e741b7a2f9e922a84b1620367d6`

See more details on using hashes here.

File details

Details for the file clearvoice-0.1.2-py3-none-any.whl.

File metadata

Download URL: clearvoice-0.1.2-py3-none-any.whl
Upload date: Jul 11, 2025
Size: 183.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for clearvoice-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e92f300a6b41b5c0d2bb4a0455db100680581eb62f34a67db0ab5bfad09aa88a`
MD5	`7e4fe55664f615cace481aca2bfb8067`
BLAKE2b-256	`bd54bfb9f87e3d1e42aec149bc4216732a2b620a04f5b23753b7b44263bd7304`

See more details on using hashes here.

clearvoice 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

ClearVoice

👉🏻HuggingFace Space Demo👈🏻 | 👉🏻ModelScope Space Demo👈🏻

Table of Contents

1. Introduction

2. Usage

Install via PyPI

Install FFmpeg (optional)

Install from GitHub

Sample Python Script

3. Model Performance

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes