The DualCodec neural audio codec.

Project description

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

Amphion GitHub

About

DualCodec is a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.

You can check out our paper and our demo page. The overview of DualCodec system is shown in the following figure:

DualCodec

Installation

pip install dualcodec

News

2025-01-22: I added training and finetuning instructions for DualCodec, as well as a gradio interface. Version is v0.3.0.
2025-01-16: Finished releasing DualCodec inference codes, the version is v0.1.0. Latest versions are synced to pypi.

Available models

Model_ID	Frame Rate	RVQ Quantizers	Semantic Codebook Size (RVQ-1 Size)	Acoustic Codebook Size (RVQ-rest Size)	Training Data
12hz_v1	12.5Hz	Any from 1-8 (maximum 8)	16384	4096	100K hours Emilia
25hz_v1	25Hz	Any from 1-12 (maximum 12)	16384	1024	100K hours Emilia

How to inference DualCodec

1. Download checkpoints to local:

# export HF_ENDPOINT=https://hf-mirror.com      # uncomment this to use huggingface mirror if you're in China
huggingface-cli download facebook/w2v-bert-2.0 --local-dir w2v-bert-2.0
huggingface-cli download amphion/dualcodec dualcodec_12hz_16384_4096.safetensors dualcodec_25hz_16384_1024.safetensors w2vbert2_mean_var_stats_emilia.pt --local-dir dualcodec_ckpts

The second command downloads the two DualCodec model (12hz_v1 and 25hz_v1) checkpoints and a w2v-bert-2 mean and variance statistics to the local directory dualcodec_ckpts.

2. Programmic usage:

import dualcodec

w2v_path = "./w2v-bert-2.0" # your downloaded path
dualcodec_model_path = "./dualcodec_ckpts" # your downloaded path
model_id = "12hz_v1" # select from available Model_IDs, "12hz_v1" or "25hz_v1"

dualcodec_model = dualcodec.get_model(model_id, dualcodec_model_path)
inference = dualcodec.Inference(dualcodec_model=dualcodec_model, dualcodec_path=dualcodec_model_path, w2v_path=w2v_path, device="cuda")

# do inference for your wav
import torchaudio
audio, sr = torchaudio.load("YOUR_WAV.wav")
# resample to 24kHz
audio = torchaudio.functional.resample(audio, sr, 24000)
audio = audio.reshape(1,1,-1)
# extract codes, for example, using 8 quantizers here:
semantic_codes, acoustic_codes = inference.encode(audio, n_quantizers=8)
# semantic_codes shape: torch.Size([1, 1, T])
# acoustic_codes shape: torch.Size([1, n_quantizers-1, T])

# produce output audio
out_audio = dualcodec_model.decode_from_codes(semantic_codes, acoustic_codes)

# save output audio
torchaudio.save("out.wav", out_audio.cpu().squeeze(0), 24000)

See "example.ipynb" for a running example.

3. Gradio interface:

If you want to use the Gradio interface, you can run the following command:

python -m dualcodec.app

This will launch an app that allows you to upload a wav file and get the output wav file.

DualCodec-based TTS models

We're releasing DualCodec-based TTS models. Stay tuned!

Finetuning DualCodec

Install other necessary components for training:

pip install "dualcodec[train]"

Clone this repository and cd to the project root folder (the folder that contains this readme).
Get discriminator checkpoints:

huggingface-cli download amphion/dualcodec --local-dir dualcodec_ckpts

To run example training on Emilia German data (streaming, no need to download files. Need to access Huggingface):

accelerate launch train.py --config-name=dualcodec_ft_12hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

This trains from scratch a 12hz_v1 model with a training batch size of 3. (typically you need larger batch sizes like 10)

To finetune a 25Hz_V1 model:

accelerate launch train.py --config-name=dualcodec_ft_25hzv1 \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

Training DualCodec from scratch

Install other necessary components for training:

pip install "dualcodec[train]"

Clone this repository and cd to the project root folder (the folder that contains this readme).
To run example training on example Emilia German data:

accelerate launch train.py --config-name=dualcodec_train \
model=dualcodec_12hz_16384_4096_8vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

This trains from scratch a v1_12hz model with a training batch size of 3. (typically you need larger batch sizes like 10)

To train a v1_25Hz model:

accelerate launch train.py --config-name=dualcodec_train \
model=dualcodec_25hz_16384_1024_12vq \
trainer.batch_size=3 \
data.segment_speech.segment_length=24000

Citation

Project details

Release history Release notifications | RSS feed

0.4.2

Aug 22, 2025

0.4.1

May 27, 2025

0.4.0

May 27, 2025

0.3.7

May 15, 2025

0.3.6

Mar 30, 2025

0.3.3

Feb 17, 2025

This version

0.3.2

Jan 22, 2025

0.3.1

Jan 22, 2025

0.3.0

Jan 22, 2025

0.2.1

Jan 15, 2025

0.1.3

Jan 13, 2025

0.1.2

Jan 12, 2025

0.1.2a2 pre-release

Jan 13, 2025

0.1.1

Jan 12, 2025

0.1.0

Jan 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dualcodec-0.3.2.tar.gz (1.5 MB view details)

Uploaded Jan 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dualcodec-0.3.2-py3-none-any.whl (51.7 kB view details)

Uploaded Jan 22, 2025 Python 3

File details

Details for the file dualcodec-0.3.2.tar.gz.

File metadata

Download URL: dualcodec-0.3.2.tar.gz
Upload date: Jan 22, 2025
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dualcodec-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`409d28a0dee44a847b4dc0b96b0f2f749fee3c7b1282247e78b72500ef22a626`
MD5	`d20ddf6dda0fbe4e97da3fcf4589f8f6`
BLAKE2b-256	`a1f90313997fa05a8b4e8791d9621d2f3ff1a493aee52e1b6dc1c07e8aabaf24`

See more details on using hashes here.

File details

Details for the file dualcodec-0.3.2-py3-none-any.whl.

File metadata

Download URL: dualcodec-0.3.2-py3-none-any.whl
Upload date: Jan 22, 2025
Size: 51.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for dualcodec-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e2d31c328e450fd7f464d4660566b260b82fe974975fe8bf28dcef466e05c3f`
MD5	`630c35e307d9b66fa595428a35a19d52`
BLAKE2b-256	`9b6f1dd2ad718c5070f8ee34122599e9571b3a9e16e38edc8c92b74101082a1f`

See more details on using hashes here.

dualcodec 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

About

Installation

News

Available models

How to inference DualCodec

1. Download checkpoints to local:

2. Programmic usage:

3. Gradio interface:

DualCodec-based TTS models

Finetuning DualCodec

Training DualCodec from scratch

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes