The DualCodec neural audio codec.
Project description
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
About
DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.
You can check out our paper and our demo page. The overview of DualCodec system is shown in the following figure:
Installation
pip install dualcodec
News
- 2025-01-22: I added training and finetuning instructions for DualCodec, version is v0.3.0.
- 2025-01-16: Finished releasing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.
Available models
| Model_ID | Frame Rate | RVQ Quantizers | Semantic Codebook Size (RVQ-1 Size) | Acoustic Codebook Size (RVQ-rest Size) | Training Data |
|---|---|---|---|---|---|
| 12hz_v1 | 12.5Hz | Any from 1-8 (maximum 8) | 16384 | 4096 | 100K hours Emilia |
| 25hz_v1 | 25Hz | Any from 1-12 (maximum 12) | 16384 | 1024 | 100K hours Emilia |
How to inference DualCodec
1. Download checkpoints to local:
# export HF_ENDPOINT=https://hf-mirror.com # uncomment this to use huggingface mirror if you're in China
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors dualcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir dualcodec_ckpts
The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory dualcodec_ckpts.
2. To inference an audio in a python script:
import dualcodec
w2v_path = "./w2v-bert-2.0" # your downloaded path
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"
dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")
# do inference for your wav
import torchaudio
audio, sr = torchaudio.load("YOUR_WAV.wav")
# resample to 24kHz
audio = torchaudio.functional.resample(audio, sr, 24000)
audio = audio.reshape(1,1,-1)
# extract codes, for example, using 8 quantizers here:
semantic_codes, acoustic_codes = inference.encode(audio, n_quantizers=8)
# semantic_codes shape: torch.Size([1, 1, T])
# acoustic_codes shape: torch.Size([1, n_quantizers-1, T])
# produce output audio
out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)
# save output audio
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)
See "example.ipynb" for a running example.
DualCodec-based TTS models
We're releasing DualCodec-based TTS models. Stay tuned!
Finetuning DualCodec
- Install other necessary components for training:
pip install "dualcodec[train]"
-
Clone this repository and
cdto project root folder. -
Get discriminator checkpoints:
huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts
- To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):
accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes)
To finetune a 25Hz_V1 model:
accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
Training DualCodec from scratch
- Install other necessary components for training:
pip install "dualcodec[train]"
-
Clone this repository and
cdto project root folder. -
To run example training on example Emilia German data:
accelerate launch train.py --config-name=codec_train \
model=dualcodec_12hz_16384_4096_8vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
This trains from scratch a v1_12hz model with a training batch size of 3. (typically you need larger batch sizes)
To train a v1_25Hz model:
accelerate launch train.py --config-name=codec_train \
model=dualcodec_25hz_16384_1024_12vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000
Citation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dualcodec-0.3.1.tar.gz.
File metadata
- Download URL: dualcodec-0.3.1.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cab482ba8dec941a83804d4f990ef1377e6deffa9bd1a2c209e44131d3d1afcd
|
|
| MD5 |
b38f2404784d6946ebc9997e4f45a858
|
|
| BLAKE2b-256 |
1cf5d079b436ffb3972cad6bb802eff62e2ea6664a6807d526c69f8fb241c06c
|
File details
Details for the file dualcodec-0.3.1-py3-none-any.whl.
File metadata
- Download URL: dualcodec-0.3.1-py3-none-any.whl
- Upload date:
- Size: 49.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c270352c5cb3bfc40f73992e2f77f0c2cb93e93ab14b9f26fa5ed6e2fe547235
|
|
| MD5 |
c9e7ff86de28a275bb9df6245b4ff9bb
|
|
| BLAKE2b-256 |
e2388cbaa53d52c3530ca091f7cf440ad22b233cfa890a35c59ed7f5e5e8c00d
|