The DualCodec neural audio codec.
Project description
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
About
Installation
pip install dualcodec
News
- 2025-01-17: DualCodec inference code is released!
Available models
| Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
|---|---|---|---|---|---|
| 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
How to inference DualCodec
1. Download checkpoints to local:
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors dualcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir dualcodec_ckpts
The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory dualcodec_ckpts.
2. To inference an audio in a python script:
import dualcodec
w2v_path = "./w2v-bert-2.0" # your downloaded path
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
# do inference for your wav
import torchaudio
audio, sr = torchaudio.load("YOUR_WAV.wav")
# resample to 24kHz
audio = torchaudio.functional.resample(audio, sr, 24000)
audio = audio.reshape(1,1,-1)
# extract codes, for example, using 8 quantizers here:
semantic_codes, acoustic_codes = inference.encode(audio, n_quantizers=8)
# semantic_codes shape: torch.Size([1, 1, T])
# acoustic_codes shape: torch.Size([1, n_quantizers-1, T])
# produce output audio
out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)
# save output audio
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
See "example.ipynb" for a running example.
DualCodec-based TTS models
DualCodec-based TTS
Benchmark results
DualCodec audio quality
DualCodec-based TTS
Finetuning DualCodec
- Install other necessary components for training:
pip install "dualcodec[train]"
-
Clone this repository and
cdto project root folder. -
Get discriminator checkpoints:
huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
- To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):
accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes)
To finetune a 25Hz_V1 model:
accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
Training DualCodec from scratch
- Install other necessary components for training:
pip install dualcodec[train]
-
Clone this repository and
cdto project root folder. -
To run example training on example Emilia German data:
accelerate launch train.py --config-name=codec_train \
model=dualcodec_12hz_16384_4096_8vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
This trains from scratch a dualcodec_12hz_16384_4096_8vq model with a training batch size of 3. (typically you need larger batch sizes)
To train a 25Hz model:
accelerate launch train.py --config-name=codec_train \
model=dualcodec_25hz_16384_1024_12vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
Citation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dualcodec-0.3.0.tar.gz.
File metadata
- Download URL: dualcodec-0.3.0.tar.gz
- Upload date:
- Size: 1.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9e91cb7a2ff6ee2a223839a1a1018a9d121bf18f06a6bd2bfc12069bfbdd080
|
|
| MD5 |
15f25df027079c4a1d5af1d11f024abe
|
|
| BLAKE2b-256 |
2e778defd60b7cd01cccdff15e4f7b1310238338c8856002c1c4da5b68a7d64f
|
File details
Details for the file dualcodec-0.3.0-py3-none-any.whl.
File metadata
- Download URL: dualcodec-0.3.0-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e7987071bd3464ed99621a9811ab7eb4cc63a72370dec3c25851ff8c6f70999
|
|
| MD5 |
b7542633add1406e6b8506c3bac07a64
|
|
| BLAKE2b-256 |
854b83da28d761102df01854fb74c1ddee113b642e6456e5b96f6601eb4cd8d4
|