A Speech-to-Text toolkit with VAD, punctuation, and emotion classification

These details have not been verified by PyPI

Project links

Homepage

Project description

DCheck Logo

DSpeech: A Command-line Speech Processing Toolkit

中文 | English

DSpeech is an advanced command-line toolkit designed for speech processing tasks such as transcription, voice activity detection (VAD), punctuation addition, and emotion classification. It is built on top of state-of-the-art models and provides an easy-to-use interface for handling various speech processing jobs.

1. Installation

1.1 Prerequisites

Python 3.6 or later
PyTorch 1.7 or later
torchaudio
rich
soundfile
funasr (A lightweight AutoModel library for speech processing)

1.2 Installation Steps

Clone the repository:

git clone https://gitee.com/iint/dspeech.git
cd dspeech

Install the required packages:
```
pip install -r requirements.txt
```
Directly install dspeech via pip:
```
pip install dspeech
```
Set the DSPEECH_HOME environment variable to the directory where your models are stored:
```
export DSPEECH_HOME=/path/to/dspeech/models
```
Download the necessary models and place them in the DSPEECH_HOME directory. You can download the models using the following commands (replace <model_id> with the actual model ID):
```
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download <model_id> --local-dir $DSPEECH_HOME/<model_name>
```

(Optional) Also you can install Dguard if you want to do the speaker diarization task:

pip install dguard==0.1.20
export DGUARD_MODEL_PATH=<path to dguard model home>
dguard_info

Print the help message to see the available commands:

dspeech help

You should see a list of available commands and options.

    DSpeech: A Command-line Speech Processing Toolkit
Usage: dspeech  
Commands:
help        Show this help message
transcribe  Transcribe an audio file
vad         Perform VAD on an audio file
punc        Add punctuation to a text
emo         Perform emotion classification on an audio file
clone       Clone speaker's voice and generate audio
clone_with_emo Clone speaker's voice with emotion and generate audio
Options (for asr and emotion classify):
--model      Model name (default: sensevoicesmall)
--vad-model  VAD model name (default: fsmn-vad)
--punc-model Punctuation model name (default: ct-punc)
--emo-model  Emotion model name (default: emotion2vec_plus_large)
--device     Device to run the models on (default: cuda)
--file       Audio file path for transcribing, VAD, or emotion classification
--text       Text to process with punctuation model
--start      Start time in seconds for processing audio files (default: 0)
--end        End time in seconds for processing audio files (default: end of file)
--sample-rate Sample rate of the audio file (default: 16000)
Options (for tts):
--ref_audio  Reference audio file path for voice cloning
--ref_text   Reference text for voice cloning
--speaker_folder Speaker folder path for emotional voice cloning
--text       Text to generate audio
--audio_save_path Path to save the audio
--spectrogram_save_path * [Optional] Path to save the spectrogram
--speed      Speed of the audio
--sample_rate Sample rate of the audio file (default: 16000)
Example: dspeech transcribe --file audio.wav

2. Features

DSpeech offers the following functionalities:

Transcription: Convert audio files to text using state-of-the-art speech recognition models.
Voice Activity Detection (VAD): Detect and segment speech regions in an audio file.
Punctuation Addition: Add punctuation to raw text transcriptions to improve readability.
Emotion Classification: Classify the emotional content of an audio file into various categories.
Voice Cloning: Clone a voice from a given audio file using a text-to-speech (TTS) model.
Emotion TTS: Generate emotional speech using a text-to-speech (TTS) model.

3. Introduction to the dspeech.STT

To use DSpeech in a Python script, you can import the STT class and create an instance with the desired models:

from dspeech.stt import STT
# Initialize the STT handler with the specified models
handler = STT(model_name="paraformer-zh", vad_model="fsmn-vad", punc_model="ct-punc", emo_model="emotion2vec_plus_large")
# Transcribe an audio file
transcription = handler.transcribe_file("audio.wav")
print(transcription)
# Perform VAD on an audio file
vad_result = handler.vad_file("audio.wav")
print(vad_result)
# Add punctuation to a text
punctuation_text = handler.punc_result("this is a test")
print(punctuation_text)
# Perform emotion classification on an audio file
emotion_result = handler.emo_classify_file("audio.wav")
print(emotion_result)

4. Introduction to the dspeech.TTS

4.1 Initialization

To initialize the TTS module, create a TTS handler object specifying the target device (CPU or GPU) and sample rate for generated audio.

from dspeech import TTS
import torch

# Initialize TTS handler
tts_handler = TTS(
    device="cuda",  # Use "cpu" if no GPU is available
    target_sample_rate=24000  # Define target sample rate for output audio
)

The device parameter can be set to "cuda" for GPU usage or "cpu" for running on the CPU.

4.2 Basic Voice Cloning

In basic voice cloning, you provide a reference audio and text, and the system generates a new speech that mimics the voice from the reference audio with the content of the provided text.

import torchaudio

# Load reference audio using torchaudio
ref_audio, sample_rate = torchaudio.load("tests/a.wav")

# Clone voice based on reference audio and text
r = tts_handler.clone(
    ref_audio=(ref_audio, sample_rate),  # Reference audio in (Tensor, int) format or file path
    ref_text="Reference text",  # The transcription of the reference audio
    gen_text_batches=["Hello, my name is Xiao Ming", "I am an AI", "I can speak Chinese"],  # Text to generate speech for
    speed=1,  # Speech speed (1 is normal speed)
    channel=-1,  # Merge all channels (-1) or specify one channel
    remove_silence=True,  # Option to remove silence from the reference audio
    wave_path="tests/tts_output.wav",  # Path to save generated audio
    spectrogram_path="tests/tts_output.png",  # Path to save spectrogram of generated audio
    concat=True  # Whether to merge all generated audio into a single output file
)

Parameters:

ref_audio: The reference audio in the format (Tensor, int) or as a file path.
ref_text: The transcription of the reference audio.
gen_text_batches: A list of text strings that you want to convert into speech.
speed: Adjusts the speed of the generated speech (default is 1, for normal speed).
remove_silence: Option to remove silence from the reference audio (boolean).
wave_path: Path to save the generated audio file.
spectrogram_path: Path to save the spectrogram image of the generated audio.

4.3 Extracting Speaker Information**

For complex voice cloning with multiple speakers and emotions, you need to extract speaker information from a directory containing multiple audio files for different speakers and emotions.

The directory should have the following structure:

<path>/
├── speaker1/
│   ├── happy.wav
│   ├── happy.txt
│   ├── neutral.wav
│   ├── neutral.txt

Each subdirectory represents a speaker, and each audio file should have an accompanying text file.

# Extract speaker information from the folder
spk_info = tts_handler.get_speaker_info("tests/speaker")
print(spk_info)

This function returns a dictionary of speaker information, which will be used for advanced cloning tasks.

4.4 Voice Cloning with Emotions**

To clone a voice with emotional expressions, you can use the clone_with_emo method. The generated text should contain emotional markers, e.g., [[zhaosheng_angry]], where zhaosheng is the speaker and angry is the emotion.

r = tts_handler.clone_with_emo(
    gen_text_batches=[
        "[[zhaosheng_angry]] How could you talk to me like that? It's too much!",
        "[[zhaosheng_whisper]] Be careful, don't let anyone hear, it's a secret.",
        "[[zhaosheng_sad]] I'm really sad, things are out of my control."
    ],
    speaker_info=spk_info,  # Dictionary of speaker information
    speed=1,  # Speech speed
    channel=-1,  # Merge all channels
    remove_silence=True,  # Remove silence in the generated output
    wave_path="tests/tts_output_emo.wav",  # Path to save output audio with emotions
    spectrogram_path="tests/tts_output_emo.png"  # Path to save spectrogram with emotions
)

4.5 Multi-Speaker and Multi-Emotion Dialogues

For generating dialogues between multiple speakers with different emotions, make sure the directory tests/speaker contains subdirectories for each speaker, and the corresponding audio and text files exist for each emotion.

# Extract speaker information for multiple speakers
spk_info = tts_handler.get_speaker_info("tests/speaker")

# Generate multi-speaker and multi-emotion dialogue
r = tts_handler.clone_with_emo(
    gen_text_batches=[
        "[[zhaosheng_angry]] How could you talk to me like that? It's too much!",
        "[[duanyibo_whisper]] Be careful, don't let anyone hear, it's a secret.",
        "[[zhaosheng_sad]] I'm really sad, things are out of my control."
    ],
    speaker_info=spk_info,  # Speaker information extracted from directory
    speed=1,  # Speech speed
    channel=-1,  # Merge all channels
    remove_silence=True,  # Remove silence from the reference audio
    wave_path="tests/tts_output_emo.wav",  # Path to save generated audio
    spectrogram_path="tests/tts_output_emo.png"  # Path to save generated spectrogram
)

This method will generate a single audio file containing speech from multiple speakers with different emotional expressions.

4.6 Output Files**

Wave Path (wave_path): Specifies where to save the generated audio output. If concat=True, all gen_text_batches will be concatenated into one audio file.
Spectrogram Path (spectrogram_path): Specifies where to save the spectrogram image of the generated speech. This is useful for visual analysis of the audio.

Command-line Interface

DSpeech provides a command-line interface for quick and easy access to its functionalities. To see the available commands, run:

dspeech help

DSpeech: A Command-line Speech Processing Toolkit
Usage: dspeech  
Commands:
  transcribe  Transcribe an audio file
  vad         Perform VAD on an audio file
  punc        Add punctuation to a text
  emo         Perform emotion classification on an audio file
Options:
  --model      Model name (default: sensevoicesmall)
  --vad-model  VAD model name (default: fsmn-vad)
  --punc-model Punctuation model name (default: ct-punc)
  --emo-model  Emotion model name (default: emotion2vec_plus_large)
  --device     Device to run the models on (default: cuda)
  --file       Audio file path for transcribing, VAD, or emotion classification
  --text       Text to process with punctuation model
  --start      Start time in seconds for processing audio files (default: 0)
  --end        End time in seconds for processing audio files (default: end of file)
  --sample-rate Sample rate of the audio file (default: 16000)
Example: dspeech transcribe --file audio.wav

Usage Examples

Transcribe an audio file:
```
dspeech transcribe --file audio.wav
```
Perform VAD on an audio file:
```
dspeech vad --file audio.wav
```
Add punctuation to a text:
```
dspeech punc --text "this is a test"
```
Perform emotion classification on an audio file:
```
dspeech emo --file audio.wav
```

License

DSpeech is licensed under the MIT License. See the LICENSE file for more details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.2

Oct 25, 2024

This version

0.1.1

Oct 17, 2024

0.1.0

Oct 16, 2024

0.0.2

Oct 14, 2024

0.0.1

Oct 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dspeech-0.1.1.tar.gz (69.1 kB view details)

Uploaded Oct 17, 2024 Source

Built Distribution

dspeech-0.1.1-py3-none-any.whl (67.9 kB view details)

Uploaded Oct 17, 2024 Python 3

File details

Details for the file dspeech-0.1.1.tar.gz.

File metadata

Download URL: dspeech-0.1.1.tar.gz
Upload date: Oct 17, 2024
Size: 69.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for dspeech-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`a36e1e2379b23e4da8ffd361f9aa54c4f267ba79a1eaa3560fa3611025ea0857`
MD5	`86a1ab063955bddf0e24f71b1e7087cd`
BLAKE2b-256	`ae8392c0c0cc5efca5310d0b417c4d017f6a105b461797fcf5d1ef29a41e120a`

See more details on using hashes here.

File details

Details for the file dspeech-0.1.1-py3-none-any.whl.

File metadata

Download URL: dspeech-0.1.1-py3-none-any.whl
Upload date: Oct 17, 2024
Size: 67.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.9

File hashes

Hashes for dspeech-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7c434fbb9d0a7c85c80ed1fb6c86bc63a44f1cc6b68d53c04623758e8c68dc31`
MD5	`342324fae90842141a42351e2c5e9781`
BLAKE2b-256	`e557deeb9ac26ced6f663794956cb7adea792c530db93e2a5cfbdff300e3c8ba`