Skip to main content

The DualCodec neural audio codec.

Project description

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

About

Installation

pip install dualcodec

News

  • 2025-01-17: DualCodec inference code is released!

Available models

Model_ID Frame Rate RVQ Quantizers Semantic Codebook Size (RVQ-1 Size) Acoustic Codebook Size (RVQ-rest Size) Training Data
12hz_v1 12.5Hz Any from 1-8 (maximum 8) 16384 4096 100K hours Emilia
25hz_v1 25Hz Any from 1-12 (maximum 12) 16384 1024 100K hours Emilia

How to inference DualCodec

1. Download checkpoints to local:

# export HF_ENDPOINT=https://hf-mirror.com      # uncomment this to use huggingface mirror if you're in China
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors dualcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir dualcodec_ckpts

The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory dualcodec_ckpts.

2. To inference an audio in a python script:

import dualcodec

w2v_path = "./w2v-bert-2.0" # your downloaded path
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"

dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")

# do inference for your wav
import torchaudio
audio, sr = torchaudio.load("YOUR_WAV.wav")
# resample to 24kHz
audio = torchaudio.functional.resample(audio, sr, 24000)
audio = audio.reshape(1,1,-1)
# extract codes, for example, using 8 quantizers here:
semantic_codes, acoustic_codes = inference.encode(audio, n_quantizers=8)
# semantic_codes shape: torch.Size([1, 1, T])
# acoustic_codes shape: torch.Size([1, n_quantizers-1, T])

# produce output audio
out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)

# save output audio
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)

See "example.ipynb" for a running example.

DualCodec-based TTS models

DualCodec-based TTS

Benchmark results

DualCodec audio quality

DualCodec-based TTS

Finetuning DualCodec

  1. Install other necessary components for training:
pip install "dualcodec[train]"
  1. Clone this repository and cd to project root folder.

  2. Get discriminator checkpoints:

huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
  1. To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):
accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes)

To finetune a 25Hz_V1 model:

accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

Training DualCodec from scratch

  1. Install other necessary components for training:
pip install dualcodec[train]
  1. Clone this repository and cd to project root folder.

  2. To run example training on example Emilia German data:

accelerate launch train.py --config-name=codec_train \
model=dualcodec_12hz_16384_4096_8vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

This trains from scratch a dualcodec_12hz_16384_4096_8vq model with a training batch size of 3. (typically you need larger batch sizes)

To train a 25Hz model:

accelerate launch train.py --config-name=codec_train \
model=dualcodec_25hz_16384_1024_12vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

Citation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dualcodec-0.3.0.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dualcodec-0.3.0-py3-none-any.whl (49.2 kB view details)

Uploaded Python 3

File details

Details for the file dualcodec-0.3.0.tar.gz.

File metadata

  • Download URL: dualcodec-0.3.0.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dualcodec-0.3.0.tar.gz
Algorithm Hash digest
SHA256 b9e91cb7a2ff6ee2a223839a1a1018a9d121bf18f06a6bd2bfc12069bfbdd080
MD5 15f25df027079c4a1d5af1d11f024abe
BLAKE2b-256 2e778defd60b7cd01cccdff15e4f7b1310238338c8856002c1c4da5b68a7d64f

See more details on using hashes here.

File details

Details for the file dualcodec-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dualcodec-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 49.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dualcodec-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e7987071bd3464ed99621a9811ab7eb4cc63a72370dec3c25851ff8c6f70999
MD5 b7542633add1406e6b8506c3bac07a64
BLAKE2b-256 854b83da28d761102df01854fb74c1ddee113b642e6456e5b96f6601eb4cd8d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page