Fourier-based neural vocoder for high-quality audio synthesis
Project description
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
Audio samples | Paper [abs] [pdf]
Installation
To use Vocos only in inference mode, install it using:
pip install vocos
If you wish to train the model, install it with additional dependencies:
pip install vocos[train]
Usage
Reconstruct audio from mel-spectrogram
import torch
from vocos import Vocos
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
mel = torch.randn(1, 100, 256) # B, C, T
with torch.no_grad():
audio = vocos.decode(mel)
Copy-synthesis from a file:
import torchaudio
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
with torch.no_grad():
y_hat = vocos(y)
Reconstruct audio from EnCodec
Additionally, you need to provide a bandwidth_id
which corresponds to the lookup embedding for bandwidth from the
list: [1.5, 3.0, 6.0, 12.0]
.
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")
quantized_features = torch.randn(1, 128, 256)
bandwidth_id = torch.tensor([3]) # 12 kbps
with torch.no_grad():
audio = vocos.decode(quantized_features, bandwidth_id=bandwidth_id)
Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a single forward pass.
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
with torch.no_grad():
y_hat = vocos(y, bandwidth_id=bandwidth_id)
Pre-trained models
The provided models were trained up to 2.5 million generator iterations, which resulted in slightly better objective scores compared to those reported in the paper.
Model Name | Dataset | Training Iterations | Parameters |
---|---|---|---|
charactr/vocos-mel-24khz | LibriTTS | 2.5 M | 13.5 M |
charactr/vocos-encodec-24khz | DNS Challenge | 2.5 M | 7.9 M |
Training
Prepare a filelist of audio files for the training and validation set:
find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val
Fill a config file, e.g. vocos.yaml, with your filelist paths and start training with:
python train.py -c configs/vocos.yaml
Refer to Pytorch Lightning documentation for details about customizing the training pipeline.
Citation
If this code contributes to your research, please cite our work:
@article{siuzdak2023vocos,
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author={Siuzdak, Hubert},
journal={arXiv preprint arXiv:2306.00814},
year={2023}
}
License
The code in this repository is released under the MIT license as found in the LICENSE file.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vocos-0.0.1.tar.gz
.
File metadata
- Download URL: vocos-0.0.1.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81b5345d0c210bf6f43dd3be6afa8784a4b408d343bb4e364147fb4f7f4d2517 |
|
MD5 | 58b49c02302aae072007da79eddb2042 |
|
BLAKE2b-256 | 168f066eb380a2b28b9e5ded5741fa233d75c0c8403866addbaa73cfd6a7af05 |
File details
Details for the file vocos-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: vocos-0.0.1-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6b9ae852e9da165e2167a54aa0b0ae06c17b094ac0ed1b057a616841df6b8b8 |
|
MD5 | 6ca637ac718b8e54f83dd353a0eb3b62 |
|
BLAKE2b-256 | b37f1e7e4cb3f73a7e823d3408f4febf142d4b13819b1511a9afa9ec89f791c0 |