The DualCodec neural audio codec.
Project description
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
About
DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
You can check out our paper and our demo page. The overview of DualCodec system is shown in the following figure:
Installation
pip install dualcodec
News
- 2025-01-22: I added training and finetuning instructions for DualCodec, as well as a gradio interface. Version is v0.3.0.
- 2025-01-16: Finished releasing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
Available models
| Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
|---|---|---|---|---|---|
| 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
How to inference DualCodec
1. Download checkpoints to local:
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors dualcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir dualcodec_ckpts
The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory dualcodec_ckpts.
2. Programmic usage:
import dualcodec
w2v_path = "./w2v-bert-2.0" # your downloaded path
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
# do inference for your wav
import torchaudio
audio, sr = torchaudio.load("YOUR_WAV.wav")
# resample to 24kHz
audio = torchaudio.functional.resample(audio, sr, 24000)
audio = audio.reshape(1,1,-1)
# extract codes, for example, using 8 quantizers here:
semantic_codes, acoustic_codes = inference.encode(audio, n_quantizers=8)
# semantic_codes shape: torch.Size([1, 1, T])
# acoustic_codes shape: torch.Size([1, n_quantizers-1, T])
# produce output audio
out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)
# save output audio
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
See "example.ipynb" for a running example.
3. Gradio interface:
If you want to use the Gradio interface, you can run the following command:
python -m dualcodec.app
This will launch an app that allows you to upload a wav file and get the output wav file.
DualCodec-based TTS models
We're releasing DualCodec-based TTS models. Stay tuned!
Finetuning DualCodec
- Install other necessary components for training:
pip install "dualcodec[train]"
-
Clone this repository and
cdto the project root folder (the folder that contains this readme). -
Get discriminator checkpoints:
huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
- To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):
accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes like 10)
To finetune a 25Hz_V1 model:
accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
Training DualCodec from scratch
- Install other necessary components for training:
pip install "dualcodec[train]"
-
Clone this repository and
cdto the project root folder (the folder that contains this readme). -
To run example training on example Emilia German data:
accelerate launch train.py --config-name=dualcodec_train \
model=dualcodec_12hz_16384_4096_8vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
This trains from scratch a v1_12hz model with a training batch size of 3. (typically you need larger batch sizes like 10)
To train a v1_25Hz model:
accelerate launch train.py --config-name=dualcodec_train \
model=dualcodec_25hz_16384_1024_12vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
Citation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dualcodec-0.3.2.tar.gz.
File metadata
- Download URL: dualcodec-0.3.2.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
409d28a0dee44a847b4dc0b96b0f2f749fee3c7b1282247e78b72500ef22a626
|
|
| MD5 |
d20ddf6dda0fbe4e97da3fcf4589f8f6
|
|
| BLAKE2b-256 |
a1f90313997fa05a8b4e8791d9621d2f3ff1a493aee52e1b6dc1c07e8aabaf24
|
File details
Details for the file dualcodec-0.3.2-py3-none-any.whl.
File metadata
- Download URL: dualcodec-0.3.2-py3-none-any.whl
- Upload date:
- Size: 51.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e2d31c328e450fd7f464d4660566b260b82fe974975fe8bf28dcef466e05c3f
|
|
| MD5 |
630c35e307d9b66fa595428a35a19d52
|
|
| BLAKE2b-256 |
9b6f1dd2ad718c5070f8ee34122599e9571b3a9e16e38edc8c92b74101082a1f
|