Skip to main content

Unified speech tokenizer for speech language model

Project description

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

Introduction

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. SpeechTokenizer is a unified speech tokenizer for speech language models, which adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Specifically, the code indices that the first quantizer of RVQ outputs can be considered as semantic tokens and the output of the remaining quantizers mainly contain timbre info, which serve as supplements for the information lost by the first quantizer. We provide our models:

  • A model operated at 16khz on monophonic speech trained on Librispeech with average representation across all HuBERT layers as semantic teacher.
  • A model with Snake activation operated at 16khz on monophonic speech trained on Librispeech and Common Voice with average representation across all HuBERT layers as semantic teacher.


Overview


The SpeechTokenizer framework.


Welcome to try our SLMTokBench and we will also open source our USLM!

Qick Link

Release

  • [2024/6/9] 🔥 We released the training code of SpeechTokenizer.
  • [2024/3] 🔥 We released a checkpoint of SpeechTokenizer with Snake activation trained on LibriSpeech and Common Voice.
  • [2023/9/11] 🔥 We released code of soundstorm_speechtokenizer.
  • [2023/9/10] 🔥 We released code and checkpoints of USLM.
  • [2023/9/1] 🔥 We released code and checkpoints of SpeechTokenizer. Checkout the paper and demo.

Samples

Samples are provided on our demo page.

Installation

SpeechTokenizer requires Python>=3.8, and a reasonly recent version of PyTorch. To install SpeechTokenizer, you can run from this repository:

pip install -U speechtokenizer

# or you can clone the repo and install locally
git clone https://github.com/ZhangXInFD/SpeechTokenizer.git
cd SpeechTokenizer
pip install .

Model List

Model Dataset Discription
speechtokenizer_hubert_avg LibriSpeech Adopt average representation across all HuBERT layers as semantic teacher
speechtokenizer_snake LibriSpeech + Common Voice Snake activation, average representation across all HuBERT layers

Usage

load model

from speechtokenizer import SpeechTokenizer

config_path = '/path/config.json'
ckpt_path = '/path/SpeechTokenizer.pt'
model = SpeechTokenizer.load_from_checkpoint(config_path, ckpt_path)
model.eval()

Extracting discrete representations

import torchaudio
import torch

# Load and pre-process speech waveform
wav, sr = torchaudio.load('<SPEECH_FILE_PATH>')

# monophonic checking
if wav.shape(0) > 1:
    wav = wav[:1,:]

if sr != model.sample_rate:
    wav = torchaudio.functional.resample(wav, sr, model.sample_rate)

wav = wav.unsqueeze(0)

# Extract discrete codes from SpeechTokenizer
with torch.no_grad():
    codes = model.encode(wav) # codes: (n_q, B, T)

RVQ_1 = codes[:1, :, :] # Contain content info, can be considered as semantic tokens
RVQ_supplement = codes[1:, :, :] # Contain timbre info, complete info lost by the first quantizer

Decoding discrete representations

# Concatenating semantic tokens (RVQ_1) and supplementary timbre tokens and then decoding
wav = model.decode(torch.cat([RVQ_1, RVQ_supplement], axis=0))

# Decoding from RVQ-i:j tokens from the ith quantizers to the jth quantizers
wav = model.decode(codes[i: (j + 1)], st=i) 

Train SpeechTokenizer

In the following section, we describe how to train a SpeechTokenizer model by using our trainer.

Data Preprocess

To train the SpeechTokenizer, the first step is to extract semantic teacher representations from raw audio waveforms. We provide an example of how to extract HuBERT representations in scripts/hubert_rep_extract.sh. We explain the arguments in the following:

  • --config: Config file path. An example is provided in config/spt_base_cfg.json. You can modify the semantic_model_path and semantic_model_layer parameters in this file to change the Hubert model and the target layer.
  • --audio_dir: The path to the folder containing all audio files.
  • --rep_dir: The path to the folder storing all semantic representation files.
  • --exts: The file extension of the audio files. Use ',' to separate multiple extensions if they exist.
  • --split_seed: Random seed for splitting training set and validation set.
  • --valid_set_size: The size of validation set. When this number is between 0 and 1, it represents the proportion of the total dataset used for the validation set.

Train

You can use SpeechTokenizerTrainer to train a SpeechTokenizer as follows:

from speechtokenizer import SpeechTokenizer, SpeechTokenizerTrainer
from speechtokenizer.discriminators import MultiPeriodDiscriminator, MultiScaleDiscriminator, MultiScaleSTFTDiscriminator
import json


# Load model and trainer config
with open('<CONFIG_FILE_PATH>') as f:
    cfg = json.load(f)

# Initialize SpeechTokenizer
generator = SpeechTokenizer(cfg)

# Initialize the discriminators. You can add any discriminator that is not yet implemented in this repository, as long as the output format remains consistent with the discriminators in `speechtokenizer.discriminators`.
discriminators = {'mpd':MultiPeriodDiscriminator(), 'msd':MultiScaleDiscriminator(), 'mstftd':MultiScaleSTFTDiscriminator(32)}

# Initialize Trainer
trainer = SpeechTokenizerTrainer(generator=generator,
                                discriminators=discriminators,
                                cfg=cfg)

# Start training
trainer.train()

# Continue training from checkpoints
trainer.continue_train()

We provide example training scripts in scripts/train_example.sh. All arguments for SpeechTokenizerTrainer are defined in config/spt_base_cfg.json. Below, we explain some of the important arguments:

  • train_files and valid_files: Training file path and validation file path. These files should be text files listing the paths of all audio files and their corresponding semantic representation files in the training/validation set. Each line should follow the format: "<audio_file_path>\t<semantic_file_path>". If you use scripts/hubert_rep_extract.sh to extract semantic representations, these two files will be genrated automantically.
  • distill_type: Use "d_axis" for D-axis distillation loss and "t_axis" for T-axis distillation loss, as mentioned in the paper.

Quick Start

If you want to fully follow our experimental setup, simply set semantic_model_path in config/spt_base_cfg.json, and AUDIO_DIR, REP_DIR, EXTS in scripts/hubert_rep_extract.sh, and other optional arguments , then execute the following code:

cd SpeechTokenizer

# Extact semantic representation
bash scripts/hubert_rep_extract.sh

# Train
bash scripts/train_example.sh

Citation

If you use this code or result in your paper, please cite our work as:

@misc{zhang2023speechtokenizer,
      title={SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models}, 
      author={Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu},
      year={2023},
      eprint={2308.16692},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

The code in this repository is released under the Apache 2.0 license as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechtokenizer-1.0.1.tar.gz (38.8 kB view details)

Uploaded Source

Built Distribution

speechtokenizer-1.0.1-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file speechtokenizer-1.0.1.tar.gz.

File metadata

  • Download URL: speechtokenizer-1.0.1.tar.gz
  • Upload date:
  • Size: 38.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for speechtokenizer-1.0.1.tar.gz
Algorithm Hash digest
SHA256 f4226a2d42d5a326946ab48876e175796a170945519404cb2ad4509493c165de
MD5 4f5dc5bca73bbaabfa0ad5e5dd6c4041
BLAKE2b-256 c00f8b07ec23eb8438111cb967d64dae14f0cb3c42b14735ed725a361d1d9038

See more details on using hashes here.

File details

Details for the file speechtokenizer-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for speechtokenizer-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bb898a1505d72bebe336fae773baa85cd368cea815590b8b6c05199d8deb9424
MD5 9381497a42b00fac87a6b4d76ad4f563
BLAKE2b-256 99391af2582ed7afb03d3d76c2cec0de7fe9b17de6cceb2e908171b0f48d6d65

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page