Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice

These details have not been verified by PyPI

Project links

Homepage

Project description

Reverse Engineering of S3Tokenizer

Supervised Semantic Speech Tokenizer (S3Tokenizer)

S3Tokenizer was initially introduced in CosyVoice [Paper] [Repo], it is a Supervised Semantic Speech Tokenizer based on the pre-trained SenseVoice-Large model, which enhances the semantic relationship of extracted tokens to textual and paralinguistic information, is robust to data noise, and reduces the reliance on clean data collection, thereby enabling the use of a broader range of data for model training.

However, as indicated in this [issue], the authors have no intention to open-source the PyTorch implementation of the S3Tokenizer, and only plan to release an ONNX file. Additionally, users aiming to fine-tune CosyVoice must extract speech codes offline, with the batch size restricted to 1, a process that is notably time-consuming (refer to [cosyvoice/tools/extract_speech_token.py]).

This repository undertakes a reverse engineering of the S3Tokenizer, offering:

A pure PyTorch implementation of S3Tokenizer, compatible with initializing weights from the released ONNX file.
High-throughput batch inference, achieving a ??x speedup compared to the original inference pipeline in [cosyvoice/tools/extract_speech_token.py].
The capability to perform online speech code extraction during SpeechLLM training.

Setup

pip install s3tokenizer

Usage-1: Offline batch inference

import torch
import s3tokenizer

tokenizer = s3tokenizer.load_model("speech_tokenizer_v1").cuda()

mels, mels_lens = [], []
wav_paths = ["path_to_wav1", "path_to_wav2", ... "path_to_wavn"]
for wav_path in wav_paths:
    audio = s3tokenizer.load_audio(wav_path)
    mels.append(s3tokenizer.log_mel_spectrogram(audio))
mels, mels_lens = s3tokenizer.padding(mels)
codes, codes_lens = tokenizer.quantize(mels.cuda(), mels_lens.cuda())

for i in range(len(wav_paths)):
    print(codes[i, :codes_lens[i].item()])

Usage-2: Online speech code extraction (TODO)

Before (extract code offline)	After (extract code online)
_{class SpeechLLM(nn.Module): ... def __init__(self, ...): ... def forward(self, speech_codes: Tensor, text_ids: Tensor, ...): ...}	_{import s3tokenizer class SpeechLLM(nn.Module): ... def __init__(self, ...): ... self.speech_tokenizer = s3tokenizer.load_model("speech_tokenizer_v1") def forward(self, speech: Tensor, speech_lens: Tensor, text_ids: Tensor, ...): ... speech_codes = self.speech_tokenizer(speech, speech_lens)}

Before (extract code offline)

After (extract code online)

_{class SpeechLLM(nn.Module):
...
def __init__(self, ...):
...

def forward(self, speech_codes: Tensor, text_ids: Tensor, ...):
...}

_{import s3tokenizer

class SpeechLLM(nn.Module):
...
def __init__(self, ...):
...
self.speech_tokenizer = s3tokenizer.load_model("speech_tokenizer_v1")

def forward(self, speech: Tensor, speech_lens: Tensor, text_ids: Tensor, ...):
...
speech_codes = self.speech_tokenizer(speech, speech_lens)}

Usage-3: Command-line (TODO)

s3tokenizer --wav_scp "xxx.scp" --device "cuda:0" --output "yyy.list" --batch_size 32

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.6

Sep 27, 2024

0.0.5

Sep 18, 2024

0.0.4

Sep 11, 2024

0.0.3

Sep 11, 2024

0.0.2

Sep 10, 2024

This version

0.0.1

Sep 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3tokenizer-0.0.1.tar.gz (15.2 kB view hashes)

Uploaded Sep 10, 2024 Source

Hashes for s3tokenizer-0.0.1.tar.gz

Hashes for s3tokenizer-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`791101496c2d699d0928e33203a219de2767a29efea9d0c6cb318142d05dd82b`
MD5	`f6e70cb89995ec91c84c4a3e1e31f578`
BLAKE2b-256	`9e350108b594c47ae69f2c099c80b22da704c208f6435f47937c13d3d16a8713`