A unified inference platform for speech processing tasks
Project description
ClearVoice
👉🏻HuggingFace Space Demo👈🏻 | 👉🏻ModelScope Space Demo👈🏻
Table of Contents
1. Introduction
ClearVoice offers a unified inference platform for speech enhancement, speech separation, speech super-resolution, and audio-visual target speaker extraction. It is designed to simplify the adoption of our pre-trained models for your speech processing purpose or the integration into your projects. Currently, we provide the following pre-trained models:
| Tasks (Sampling rate) | Models (HuggingFace Links) |
|---|---|
| Speech Enhancement (16kHz & 48kHz) | MossFormer2_SE_48K (HuggingFace), FRCRN_SE_16K (HuggingFace), MossFormerGAN_SE_16K (HuggingFace) |
| Speech Separation (16kHz) | MossFormer2_SS_16K (HuggingFace) |
| Speech Super-Resolution (48kHz) | MossFormer2_SR_48K(HuggingFace) |
| Audio-Visual Target Speaker Extraction (16kHz) | AV_MossFormer2_TSE_16K (HuggingFace) |
You don't need to manually download the pre-trained models—they are automatically fetched from HuggingFace during inference. If the models are not downloaded sucessfully to ./clearvoice/checkpoints, you can manually download them from ModelScope.
2. Usage
Install via PyPI
-
Install ClearVoice via PyPI:
pip install clearvoice
-
In your Python code:
from clearvoice import ClearVoice
Install FFmpeg (optional)
Clearvoice relies on FFmpeg for audio format conversion. If you're only working with .wav files, FFmpeg is not required. For all other formats, please follow the instructions below to install FFmpeg.
- On Ubuntu/Debian:
sudo apt update sudo apt install ffmpeg
- On macOS (with Homebrew):
brew install ffmpeg
- On Windows: Download the static build from https://ffmpeg.org/download.html. Unzip and add the bin folder (which contains ffprobe.exe) to your System PATH.
Install from GitHub
-
Clone the GitHub repository and install the requirements:
git clone https://github.com/modelscope/ClearerVoice-Studio.git cd ClearerVoice-Studio/clearvoice pip install --editable .
-
In your Python code:
from clearvoice import ClearVoice
-
Demo script
cd ClearerVoice-Studio/clearvoice python demo.py
or
cd ClearerVoice-Studio/clearvoice python demo_with_more_comments.py
or
cd ClearerVoice-Studio/clearvoice python demo_Numpy2Numpy.py
- You may activate each demo case by setting to True in
demo.py,demo_with_more_comments.py, anddemo_Numpy2Numpy.py. - In
demo_Numpy2Numpy.py, we added a new interface for ClearVoice that supports from numpy input to numpy output, instead of file I/O. - Supported audio format: .flac .wav
- Supported video format: .avi .mp4 .mov .webm
Sample Python Script
Use MossFormer2_SE_48K model for fullband (48kHz) speech enhancement task:
from clearvoice import ClearVoice
myClearVoice = ClearVoice(task='speech_enhancement', model_names=['MossFormer2_SE_48K'])
#process single wave file
output_wav = myClearVoice(input_path='samples/input.wav', online_write=False)
myClearVoice.write(output_wav, output_path='samples/output_MossFormer2_SE_48K.wav')
#process wave directory
myClearVoice(input_path='samples/path_to_input_wavs', online_write=True, output_path='samples/path_to_output_wavs')
#process wave list file
myClearVoice(input_path='samples/scp/audio_samples.scp', online_write=True, output_path='samples/path_to_output_wavs_scp')
Parameter Description:
task: Choose one of the three tasksspeech_enhancement,speech_separation, andtarget_speaker_extractionmodel_names: List of model names, choose one or more models for the taskinput_path: Path to the input audio/video file, input audio/video directory, or a list file (.scp)online_write: Set toTrueto enable saving the enhanced/separated audio/video directly to local files during processing, otherwise, the enhanced/separated audio is returned. (Only supportsFalseforspeech_enhancement,speech_separationwhen processing single wave file`)output_path: Path to a file or a directory to save the enhanced/separated audio/video file
这里给出了一个较详细的中文使用教程:https://stable-learn.com/zh/clearvoice-studio-tutorial
3. Model Performance
Speech enhancement models:
We evaluated our released speech enhancement models on the popular benchmarks: VoiceBank+DEMAND testset (16kHz & 48kHz) and DNS-Challenge-2020 (Interspeech) testset (non-reverb, 16kHz). Different from the most published papers that tailored each model for each test set, our evaluation here uses unified models on the two test sets. The evaluation metrics are generated by SpeechScore.
VoiceBank+DEMAND testset (tested on 16kHz)
| Model | PESQ | NB_PESQ | CBAK | COVL | CSIG | STOI | SISDR | SNR | SRMR | SSNR | P808_MOS | SIG | BAK | OVRL | ISR | SAR | SDR | FWSEGSNR | LLR | LSD | MCD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Noisy | 1.97 | 3.32 | 2.79 | 2.70 | 3.32 | 0.92 | 8.44 | 9.35 | 7.81 | 6.13 | 3.05 | 3.37 | 3.32 | 2.79 | 28.11 | 8.53 | 8.44 | 14.77 | 0.78 | 1.40 | 4.15 |
| FRCRN_SE_16K | 3.23 | 3.86 | 3.47 | 3.83 | 4.29 | 0.95 | 19.22 | 19.16 | 9.21 | 7.60 | 3.59 | 3.46 | 4.11 | 3.20 | 12.66 | 21.16 | 11.71 | 20.76 | 0.37 | 0.98 | 0.56 |
| MossFormerGAN_SE_16K | 3.47 | 3.96 | 3.50 | 3.73 | 4.40 | 0.96 | 19.45 | 19.36 | 9.07 | 9.09 | 3.57 | 3.50 | 4.09 | 3.23 | 25.98 | 21.18 | 19.42 | 20.20 | 0.34 | 0.79 | 0.70 |
| MossFormer2_SE_48K | 3.16 | 3.77 | 3.32 | 3.58 | 4.14 | 0.95 | 19.38 | 19.22 | 9.61 | 6.86 | 3.53 | 3.50 | 4.07 | 3.22 | 12.05 | 21.84 | 11.47 | 16.69 | 0.57 | 1.72 | 0.62 |
DNS-Challenge-2020 testset (tested on 16kHz)
| Model | PESQ | NB_PESQ | CBAK | COVL | CSIG | STOI | SISDR | SNR | SRMR | SSNR | P808_MOS | SIG | BAK | OVRL | ISR | SAR | SDR | FWSEGSNR | LLR | LSD | MCD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Noisy | 1.58 | 2.16 | 2.66 | 2.06 | 2.72 | 0.91 | 9.07 | 9.95 | 6.13 | 9.35 | 3.15 | 3.39 | 2.61 | 2.48 | 34.57 | 9.09 | 9.06 | 15.87 | 1.07 | 1.88 | 6.42 |
| FRCRN_SE_16K | 3.24 | 3.66 | 3.76 | 3.63 | 4.31 | 0.98 | 19.99 | 19.89 | 8.77 | 7.60 | 4.03 | 3.58 | 4.15 | 3.33 | 8.90 | 20.14 | 7.93 | 22.59 | 0.50 | 1.69 | 0.97 |
| MossFormerGAN_SE_16K | 3.57 | 3.88 | 3.93 | 3.92 | 4.56 | 0.98 | 20.60 | 20.44 | 8.68 | 14.03 | 4.05 | 3.58 | 4.18 | 3.36 | 8.88 | 20.81 | 7.98 | 21.62 | 0.45 | 1.65 | 0.89 |
| MossFormer2_SE_48K | 2.94 | 3.45 | 3.36 | 2.94 | 3.47 | 0.97 | 17.75 | 17.65 | 9.26 | 11.86 | 3.92 | 3.51 | 4.13 | 3.26 | 8.55 | 18.40 | 7.48 | 16.10 | 0.98 | 3.02 | 1.15 |
VoiceBank+DEMAND testset (tested on 48kHz) (We included our evaluations on other open-sourced models using SpeechScore)
| Model | PESQ | NB_PESQ | CBAK | COVL | CSIG | STOI | SISDR | SNR | SRMR | SSNR | P808_MOS | SIG | BAK | OVRL | ISR | SAR | SDR | FWSEGSNR | LLR | LSD | MCD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Noisy | 1.97 | 2.87 | 2.79 | 2.70 | 3.32 | 0.92 | 8.39 | 9.30 | 7.81 | 6.13 | 3.07 | 3.35 | 3.12 | 2.69 | 33.75 | 8.42 | 8.39 | 13.98 | 0.75 | 1.45 | 5.41 |
| MossFormer2_SE_48K | 3.15 | 3.77 | 3.33 | 3.64 | 4.23 | 0.95 | 19.36 | 19.22 | 9.61 | 7.03 | 3.53 | 3.41 | 4.10 | 3.15 | 4.08 | 21.23 | 4.06 | 14.45 | NA | 1.86 | 0.53 |
| Resemble_enhance | 2.84 | 3.58 | 3.14 | NA | NA | 0.94 | 12.42 | 12.79 | 9.08 | 7.07 | 3.53 | 3.42 | 3.99 | 3.12 | 13.62 | 12.66 | 10.31 | 14.56 | 1.50 | 1.66 | 1.54 |
| DeepFilterNet | 3.03 | 3.71 | 3.29 | 3.55 | 4.20 | 0.94 | 15.71 | 15.66 | 9.66 | 7.19 | 3.47 | 3.40 | 4.00 | 3.10 | 28.01 | 16.20 | 15.79 | 15.69 | 0.55 | 0.94 | 1.77 |
- Resemble_enhance (Github) is an open-sourced 44.1kHz pure speech enhancement platform from Resemble-AI since 2023, we resampled to 48khz before making evaluation.
- DeepFilterNet (Github) is a low complexity speech enhancement framework for Full-Band audio (48kHz) using on deep filtering.
Note: We observed anomalies in two speech metrics, LLR and LSD, after processing with the 48 kHz models. We will further investigate the issue to identify the cause.
Speech separation models:
We evaluated our speech separation model MossFormer2_SS_16K on the popular benchmark testset: LRS2_2Mix (16 kHz), WSJ0-2Mix (8 kHz), Libri2Mix (8 kHz), WHAM! (8 kHz). We compare our model with following state-of-the-art models: Conv-TasNet, DualPathRNN, DPTNet, SepFormer, TDANet, TF-GridNet, SPMamba. The testing results are taken from TDANet Github repo and SPMamba GitHub repo. The performance metric of SI-SNRi (SI-SNR improvement) is used for the evaluations.
| Model | LRS2_2Mix (16 kHz) | WSJ0-2Mix (8 kHz) | Libri2Mix (8kHz) | WHAM! (8 kHz) |
|---|---|---|---|---|
| Conv-TasNet | 10.6 | 15.3 | 12.2 | 12.7 |
| DualPathRNN | 12.7 | 18.8 | 16.1 | 13.7 |
| DPTNet | 13.3 | 20.2 | 16.7 | 14.9 |
| SepFormer | 13.5 | 20.4 | 17.0 | 14.4 |
| TDANet Large | 14.2 | 18.5 | 17.4 | 15.2 |
| TF-GridNet | - | 22.8 | 19.8 | 16.9 |
| SPMamba | - | 22.5 | 19.9 | 17.4 |
| MossFormer2_SS_16K | 15.5 | 22.0 | 16.7 | 17.4 |
Note: The MossFormer2_SS_16K results presented are from our unified model, evaluated without retraining on individual datasets. This 16 kHz model was used for speech separation on the 16 kHz test set, with scores then calculated on the downsampled 8 kHz audio. All comparison models were trained and tested separately on each dataset.
Speech super-resolution model:
We demonstrated the effectiveness of our speech super-resolution model, MossFormer2_SR_48K, using the VoiceBank+DEMAND 48 kHz test set. For super-resolution evaluation, the test set was downsampled to 16 kHz, 24 kHz, and 32 kHz. The Log Spectral Distance (LSD) and PESQ metrics was used for evaluation. Recognizing that speech quality is impacted by both lower sampling rates and background noise, we also incorporated our speech enhancement model, MossFormer2_SE_48K, to reduce noise prior to super-resolution processing. Results are presented in the following table.
| Model | 16 kHz | 24 kHz | 32 kHz | 48 kHz | PESQ |
|---|---|---|---|---|---|
| Origin | 2.80 | 2.60 | 2.29 | 1.46 | 1.97 |
| Enhanced | 1.93 | 1.52 | 1.50 | 1.42 | 3.15 |
For the 48 kHz case, speech super-resolution was not applied. The final two columns show that MossFormer2_SE_48K significantly improves the 16 kHz PESQ score but only marginally improves LSD. Therefore, LSD improvements at 16 kHz, 24 kHz, and 32 kHz are primarily attributed to MossFormer2_SR_48K.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clearvoice-0.1.2.tar.gz.
File metadata
- Download URL: clearvoice-0.1.2.tar.gz
- Upload date:
- Size: 154.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
810345298445bfbdfaf135ef3862b7ad1c74675ae180f88fe22fbf0c12b370b7
|
|
| MD5 |
4753aadce635b01299dbc0b65cccbf10
|
|
| BLAKE2b-256 |
00a17c340ab0501d023f7f19aff7b3ec0c630e741b7a2f9e922a84b1620367d6
|
File details
Details for the file clearvoice-0.1.2-py3-none-any.whl.
File metadata
- Download URL: clearvoice-0.1.2-py3-none-any.whl
- Upload date:
- Size: 183.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e92f300a6b41b5c0d2bb4a0455db100680581eb62f34a67db0ab5bfad09aa88a
|
|
| MD5 |
7e4fe55664f615cace481aca2bfb8067
|
|
| BLAKE2b-256 |
bd54bfb9f87e3d1e42aec149bc4216732a2b620a04f5b23753b7b44263bd7304
|