Skip to main content

Stable Codec: A series of codec models for speech and audio

Project description

Stable Codec

This repository contains training and inference scripts for models in the Stable Codec series, starting with stable-codec-speech-16k - introduced in the paper titled Scaling Transformers for Low-bitrate High-Quality Speech Coding.

Paper: https://arxiv.org/abs/2411.19842

Sound demos: https://stability-ai.github.io/stable-codec-demo/

Model weights: https://huggingface.co/stabilityai/stable-codec-speech-16k

Note that whilst this code is MIT licensed, the model weights are covered by the Stability AI Community License

Variants

The model is currently available in two variants:

  • stable-codec-speech-16k-base is the weights corresponding to the results in our publication, provided for reproducibility.
  • stable-codec-speech-16k is an improved finetune, with boosted latent semantics. It should be used in 99% of use-cases.

Additional Training

In addition to the training described in the paper, the weights for stable-codec-speech-16k have undergone 500k steps of finetuning with force-aligned data from LibriLight and the English portion Multilingual LibriSpeech. This was performed by using a CTC head to regress the force-aligned phoneme tags from pre-bottleneck latents. We found that this additional training significantly boosted the applicability of the codec tokens to downstream tasks like TTS, at a small cost to objective reconstruction metrics.

Install

The model itself is defined in stable-audio-tools package.

To install stable-codec:

pip install stable-codec
pip install -U flash-attn --no-build-isolation

IMPORTANT NOTE: This model currently has a hard requirement for FlashAttention due to its use of sliding window attention. Inference without FlashAttention will likely be greatly degraded. This also means that the model currently does not support CPU inference. We will relax the dependency on FlashAttention in the future.

Encoding and decoding

To encode audio or decode tokens, the StableCodec class provides a convenient wrapper for the model. It can be used with a local checkpoint and config as follows:

import torch
import torchaudio
from stable_codec import StableCodec

model = StableCodec(
    model_config_path="<path-to-model-config>",
    ckpt_path="<path-to-checkpoint>", # optional, can be `None`,
    device = torch.device("cuda")
)

audiopath = "audio.wav"

latents, tokens = model.encode(audiopath)
decoded_audio = model.decode(tokens)

torchaudio.save("decoded.wav", decoded_audio, model.sample_rate)

To download the model weights automatically from HuggingFace, simply provide the model name:

model = StableCodec(
    pretrained_model = 'stabilityai/stable-codec-speech-16k'
)

Posthoc bottleneck configuration

Most usecases will benefit from replacing the training-time FSQ bottleneck with a post-hoc FSQ bottleneck, as described in the paper. This allows token dictionary size to be reduced to a reasonable level for modern language models. This is achieved by calling the set_posthoc_bottleneck function, and setting a flag to the encode/decode calls:

model.set_posthoc_bottleneck("2x15625_700bps")
latents, tokens = model.encode(audiopath, posthoc_bottleneck = True)
decoded_audio = model.decode(tokens, posthoc_bottleneck = True)

set_posthoc_bottleneck can take a string as argument, which allows selection a number of recommended preset settings for the bottleneck:

Bottleneck Preset Number of Tokens per step Dictionary Size Bits Per Second (bps)
1x46656_400bps 1 46656 400
2x15625_700bps 2 15625 700
4x729_1000bps 4 729 1000

Alternatively, the bottleneck stages can be specified directly. The format for specifying this can be seen in the definition of the StableCodec class in model.py.

Normalization

The model is trained with utterances normalized to -20 +-5 LUFS. The encode function normalizes to -20 LUFS by default, but it can be disabled by setting normalize = False when calling the function.

Finetune

To finetune a model given its config and checkpoint, execute train.py file:

python train.py \
    --project "stable-codec" \
    --name "finetune" \
    --config-file "defaults.ini" \
    --save-dir "<ckpt-save-dir>" \
    --model-config "<path-to-config.json>" \
    --dataset-config "<dataset-config.json>" \
    --val-dataset-config "<dataset-config.json>" \
    --pretrained-ckpt-path "<pretrained-model-ckpt.ckpt>" \
    --ckpt-path "$CKPT_PATH" \
    --num-nodes $SLURM_JOB_NUM_NODES \
    --num-workers 16 --batch-size 10 --precision "16-mixed" \
    --checkpoint-every 10000 \
    --logger "wandb"

For dataset configuration, refer to stable-audio-tools dataset docs.

Using CTC loss

To use CTC loss during training you have to enable it in the training configuration file and in the training dataset configuration.

  1. Modifying training configuration:

    • Enable CTC projection head and set its hidden dimension:
      config["model"]["use_proj_head"] = True
      config["model"]["proj_head_dim"] = 81
      
    • Enable CTC in the training part of the config:
      config["training"]["use_ctc"] = True
      
    • And set its loss config:
      config["training"]["loss_configs"]["ctc"] = {
        "blank_idx": 80,
        "decay": 1.0,
        "weights": {"ctc": 1.0}
      }
      
    • Optionally, you can enable computation of the Phone-Error-Rate (PER) during validation:
      config["training"]["eval_loss_configs"]["per"] = {}
      
  2. Configuring dataset (only WebDataset format is supported for CTC):

    • The dataset configuration should have one additional field set to it (see dataset docs for other options):
      config["force_align_text"] = True
      
    • And the JSON metadata file for each sample should contain force aligned transcript under force_aligned_text entry in the format specified below (besides other metadata). Where transcript is a list of word-level alignments with start and end fields specifying range in seconds of each word.
      "normalized_text":"and i feel"
      "force_aligned_text":{
       "transcript":[
          {
             "word":"and",
             "start":0.2202,
             "end":0.3403
          },
          {
             "word":"i",
             "start":0.4604,
             "end":0.4804
          },
          {
             "word":"feel",
             "start":0.5204,
             "end":0.7006
          }
        ]
      }
      

Objective Metrics

Model SI-SDR Mel Dis STFT Dis PESQ STOI
stable-codec-speech-16k-base 4.73 0.86 1.26 3.09 0.92
stable-codec-speech-16k 3.58 0.90 1.30 3.01 0.90

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stable_codec-0.1.2.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stable_codec-0.1.2-py3-none-any.whl (19.5 kB view details)

Uploaded Python 3

File details

Details for the file stable_codec-0.1.2.tar.gz.

File metadata

  • Download URL: stable_codec-0.1.2.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.16

File hashes

Hashes for stable_codec-0.1.2.tar.gz
Algorithm Hash digest
SHA256 6b9ab03c763dd78db831a391882a0739f2af7c86e502fc34ed0bf8d697e64190
MD5 b3f03e6e4686ed3f816322cfcbc33f98
BLAKE2b-256 4e73b9fa2ce561dddb829dec3960aab3ac30ce47a0cd6cc51974c6da5d771f26

See more details on using hashes here.

File details

Details for the file stable_codec-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: stable_codec-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 19.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.16

File hashes

Hashes for stable_codec-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 255239a2a7d5f96b081f9ca617aaefc1bd3d195632fb5f1611e8e76e6385bbb4
MD5 7628c34d0cbea50acaff2662a9373481
BLAKE2b-256 32bb75d0b2e7484177869f96e0a7a8eb9e5c5f717a069fd687b78091d54da9eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page