Skip to main content

Unified speech tokenizer for speech language model

Project description

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Introduction

This is the code for the SpeechTokenizer presented in the SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models. SpeechTokenizer is a unified speech tokenizer for speech large language models, which adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Specifically, The code indices that the first quantizer of RVQ outputs can be considered as semantic tokens and the output of the remaining quantizers can be regarded as acoustic tokens, which serve as supplements for the information lost by the first quantizer. We provide our models:

  • A model operated at 16khz on monophonic speech trained on Librispeech with average representation across all HuBERT layers as semantic teacher.


Overview


The SpeechTokenizer framework.


Welcome to try our SLMTokBench and we will also open source our USLM !!

Samples

Samples are provided on our demo page.

Installation

SpeechTokenizer requires Python>=3.8, and a reasonly recent version of PyTorch. To install SpeechTokenizer, you can run from this repository:

# pip install -U speechtokenizer
git clone https://github.com/ZhangXInFD/SpeechTokenizer.git
cd SpeechTokenizer
pip install .

Usage

Model storage

model list

load model

from speechtokenizer import SpeechTokenizer

config_path = '/path/config.json'
ckpt_path = '/path/SpeechTokenizer.pt'
model = SpeechTokenizer.load_from_checkpoint(config_path, ckpt_path)
model.eval()

Extracting discrete representions

import torchaudio
import torch

# Load and pre-process speech waveform
wav, sr = torchaudio.load('<SPEECH_FILE_PATH>')
if sr != model.sample_rate:
    wav = torchaudio.functional.resample(wav, sr, model.sample_rate)
wav = wav.unsqueeze(0)

# Extract discrete codes from SpeechTokenizer
with torch.no_grad():
    codes = model.encode(wav) # codes: (n_q, B, T)

semantic_tokens = codes[0, :, :]
acoustic_tokens = codes[1:, :, :]

Decoding discrete representions

# Decoding from the first quantizers to ith quantizers
wav = model.decode(codes[:(i + 1)]) # wav: (B, 1, T)

# Decoding from ith quantizers to jth quantizers
wav = model.decode(codes[i: (j + 1)], st=i) 

# Cancatenating semantic tokens and acoustic tokens and then decoding
semantic_tokens = ... # (..., B, T)
acoustic_tokens = ... # (..., B, T)
wav = model.decode(torch.cat([semantic_tokens, acoustic_tokens], axis=0))

Citation

If you use this code or result in your paper, please cite our work as:

License

The code in this repository is released under the Apache 2.0 license as found in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speechtokenizer-0.1.1.tar.gz (25.1 kB view hashes)

Uploaded Source

Built Distribution

speechtokenizer-0.1.1-py3-none-any.whl (28.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page