Stable Codec: A series of codec models for speech and audio
Project description
Stable Codec
This repository contains training and inference scripts for models in the Stable Codec series, starting with stable-codec-speech-16k - introduced in the paper titled Scaling Transformers for Low-bitrate High-Quality Speech Coding.
Paper: https://arxiv.org/abs/2411.19842
Sound demos: https://stability-ai.github.io/stable-codec-demo/
Additional training
In addition to the training described in the paper, the released weights have also undergone 500k steps of finetuning with force-aligned phoneme data from LibriSpeech and the English portion Multilingual LibriSpeech. This was performed by using a CTC head to regress the phoneme categories from pre-bottleneck latents. We found that this additional training significantly boosted the applicability of the codec tokens to downstream tasks like TTS.
Install
The model itself is defined in stable-audio-tools package.
To install stable-codec:
pip install -e .
pip install -U flash-attn --no-build-isolation
IMPORTANT NOTE: This model currently has a hard requirement for FlashAttention due to its use of sliding window attention. Inference without FlashAttention will likely be greatly degraded.
Encoding and decoding
To encode audio or decode tokens, the StableCodec class provides a convenient wrapper for the model. It can be used with a local checkpoint and config as follows:
from stable_codec.model import StableCodec
model = StableCodec(
model_config_path="<path-to-model-config>",
ckpt_path="<path-to-checkpoint>", # optional, can be `None`
)
audiopath = "audio.wav"
latents, tokens = model.encode(audiopath)
decoded_audio = model.decode(tokens)
torchaudio.save("decoded.wav", decoded_audio, model.sample_rate)
To download the model weights automatically from HuggingFace, simply provide the model name:
model = StableCodec(
pretrained_model = 'stabilityai/stable-codec-speech-16k'
)
Posthoc bottleneck configuration
Most usecases will benefit from replacing the training-time FSQ bottleneck with a post-hoc FSQ bottleneck, as described in the paper. This allows token dictionary size to be reduced to a reasonable level for modern language models. This is achieved by calling the set_posthoc_bottleneck function, and setting a flag to the encode/decode calls:
model.set_posthoc_bottleneck("2x15625_700bps")
latents, tokens = model.encode(audiopath, posthoc_bottleneck = True)
decoded_audio = model.decode(tokens, posthoc_bottleneck = True)
set_posthoc_bottleneck can take a string as argument, which allows selection a number of recommended preset settings for the bottleneck:
| Bottleneck Preset | Number of Tokens per step | Dictionary Size | Bits Per Second (bps) |
|---|---|---|---|
1x46656_400bps |
1 | 46656 | 400 |
2x15625_700bps |
2 | 15625 | 700 |
4x729_1000bps |
4 | 729 | 1000 |
Alternatively, the bottleneck stages can be specified directly. The format for specifying this can be seen in the definition of the StableCodec class in model.py.
Normalization
The model is trained with utterances normalized to -20 LUFS. The encode function applies this by default, but it can be disabled by setting normalize = False when calling the function.
Finetune
To finetune a model given its config and checkpoint, execute train.py file:
python train.py \
--project "stable-codec" \
--name "finetune" \
--config-file "defaults.ini" \
--save-dir "<ckpt-save-dir>" \
--model-config "<path-to-config.json>" \
--dataset-config "<dataset-config.json>" \
--val-dataset-config "<dataset-config.json>" \
--pretrained-ckpt-path "<pretrained-model-ckpt.ckpt>" \
--ckpt-path "$CKPT_PATH" \
--num-nodes $SLURM_JOB_NUM_NODES \
--num-workers 16 --batch-size 10 --precision "16-mixed" \
--checkpoint-every 10000 \
--logger "wandb"
For dataset configuration, refer to stable-audio-tools dataset docs.
Using CTC loss
To use CTC loss during training you have to enable it in the training configuration file and in the training dataset configuration.
-
Modifying training configuration:
- Enable CTC projection head and set its hidden dimension:
config["model"]["use_proj_head"] = True config["model"]["proj_head_dim"] = 81
- Enable CTC in the training part of the config:
config["training"]["use_ctc"] = True
- And set its loss config:
config["training"]["loss_configs"]["ctc"] = { "blank_idx": 80, "decay": 1.0, "weights": {"ctc": 1.0} }
- Optionally, you can enable computation of the Phone-Error-Rate (PER) during validation:
config["training"]["eval_loss_configs"]["per"] = {}
- Enable CTC projection head and set its hidden dimension:
-
Configuring dataset (only WebDataset format is supported for CTC):
- The dataset configuration should have one additional field set to it (see dataset docs for other options):
config["force_align_text"] = True
- And the JSON metadata file for each sample should contain force aligned transcript under
force_aligned_textentry in the format specified below (besides other metadata). Wheretranscriptis a list of word-level alignments withstartandendfields specifying range in seconds of each word."normalized_text":"and i feel" "force_aligned_text":{ "transcript":[ { "word":"and", "start":0.2202, "end":0.3403 }, { "word":"i", "start":0.4604, "end":0.4804 }, { "word":"feel", "start":0.5204, "end":0.7006 } ] }
- The dataset configuration should have one additional field set to it (see dataset docs for other options):
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stable_codec-0.1.1.tar.gz.
File metadata
- Download URL: stable_codec-0.1.1.tar.gz
- Upload date:
- Size: 18.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c426cec5c4ef42544d2c1fb92528e7099042066d71d771a70e1eefd9dad836b3
|
|
| MD5 |
5d9c1794741f8d883c659932d8445531
|
|
| BLAKE2b-256 |
47bda021af75de4498c53fc7a8828630d08db20fd9bc2ff488fdd00266f51c1e
|
File details
Details for the file stable_codec-0.1.1-py3-none-any.whl.
File metadata
- Download URL: stable_codec-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c605bcde2c76a964c223244c592ad78ace070c911b241d3d836a694420041854
|
|
| MD5 |
a2e1359dee9246b991071bcc0955ebcc
|
|
| BLAKE2b-256 |
e621d4a91357c95694f8d8ba1b472e84a7f879e3212fb7a0e46d9ad9ec6b3f2f
|