Quasi-Periodic Parallel WaveGAN implementation
Project description
Quasi-Periodic Parallel WaveGAN (QPPWG)
This is official QPPWG PyTorch implementation. QPPWG is a non-autoregressive neural speech generation model developed based on PWG and QP structure.
In this repo, we provide an example to train and test QPPWG as a vocoder for WORLD acoustic features. More details can be found on our Demo page.
News
- 2020/7/22 Release v0.1.2
- 2020/6/27 Release mel-spec feature extraction and the pre-trained models of vcc20 corpus.
- 2020/6/26 Release the pre-trained models of vcc18 corpus.
- 2020/5/20 Release the first version (v0.1.1).
Requirements
This repository is tested on Ubuntu 16.04 with a Titan V GPU.
- Python 3.6+
- Cuda 10.0
- CuDNN 7+
- PyTorch 1.0.1+
Environment setup
The code works with both anaconda and virtualenv. The following example uses anaconda.
$ conda create -n venvQPPWG python=3.6
$ source activate venvQPPWG
$ git clone https://github.com/bigpon/QPPWG.git
$ cd QPPWG
$ pip install -e .
Please refer to the PWG repo for more details.
Folder architecture
- egs: The folder for projects.
- egs/vcc18: The folder of the VCC2018 project.
- egs/vcc18/exp: The folder for trained models.
- egs/vcc18/conf: The folder for configs.
- egs/vcc18/data: The folder for corpus related files (wav, feature, list ...).
- qppwg: The folder of the source codes.
Run
Corpus and path setup
- Modify the corresponding CUDA paths in
egs/vcc18/run.py
. - Download the Voice Conversion Challenge 2018 (VCC2018) corpus to run the QPPWG example
$ cd egs/vcc18
# Download training and validation corpus
$ wget -o train.log -O data/wav/train.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_training.zip
# Download evaluation corpus
$ wget -o eval.log -O data/wav/eval.zip https://datashare.is.ed.ac.uk/bitstream/handle/10283/3061/vcc2018_database_evaluation.zip
# unzip corpus
$ unzip data/wav/train.zip -d data/wav/
$ unzip data/wav/eval.zip -d data/wav/
- Training wav lists:
data/scp/vcc18_train_22kHz.scp
. - Validation wav lists:
data/scp/vcc18_valid_22kHz.scp
. - Testing wav list:
data/scp/vcc18_eval_22kHz.scp
.
Preprocessing
# Extract WORLD acoustic features and statistics of training and testing data
$ bash run.sh --stage 0 --config PWG_30
- WORLD-related settings can be changed in
egs/vcc18/conf/vcc18.PWG_30.yaml
. - If you want to use another corpus, please create a corresponding config and a file including power thresholds and f0 ranges like
egs/vcc18/data/pow_f0_dict.yml
. - More details about feature extraction can be found in the QPNet repo.
- The lists of auxiliary features will be automatically generated.
- Training aux lists:
data/scp/vcc18_train_22kHz.list
. - Validation aux lists:
data/scp/vcc18_valid_22kHz.list
. - Testing aux list:
data/scp/vcc18_eval_22kHz.list
.
QPPWG/PWG training
# Training a QPPWG model with the 'QPPWGaf_20' config and the 'vcc18_train_22kHz' and 'vcc18_valid_22kHz' sets.
$ bash run.sh --gpu 0 --stage 1 --conf QPPWGaf_20 \
--trainset vcc18_train_22kHz --validset vcc18_valid_22kHz
- The gpu ID can be set by --gpu GPU_ID (default: 0)
- The model architecture can be set by --conf CONFIG (default: PWG_30)
- The trained model resume can be set by --resume NUM (default: None)
QPPWG/PWG testing
# QPPWG/PWG decoding w/ natural acoustic features
$ bash run.sh --gpu 0 --stage 2 --conf QPPWGaf_20 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
# QPPWG/PWG decoding w/ scaled f0 (ex: halved f0).
$ bash run.sh --gpu 0 --stage 3 --conf QPPWGaf_20 --scaled 0.50 \
--iter 400000 --trainset vcc18_train_22kHz --evalset vcc18_eval_22kHz
Monitor training progress
$ tensorboard --logdir exp
- The training time of PWG_30 with a TITAN V is around 3 days.
- The training time of QPPWGaf_20 with a TITAN V is around 5 days.
Inference speed (RTF)
- Vanilla PWG (PWG_30)
# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [04:50<00:00, 2.08s/it, RTF=0.771]
2020-05-26 12:30:27,273 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.579).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:09<00:00, 14.89it/s, RTF=0.0155]
2020-05-26 12:32:26,160 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.016).
- PWG w/ only 20 blocks (PWG_20)
# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [03:57<00:00, 1.70s/it, RTF=0.761]
2020-05-30 13:50:20,438 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.474).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:08<00:00, 16.55it/s, RTF=0.0105]
2020-05-30 13:43:50,793 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.011).
- QPPWG (QPPWGaf_20)
# On CPU (Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 32 threads)
[decode]: 100%|███████████| 140/140 [04:12<00:00, 1.81s/it, RTF=0.455]
2020-05-26 12:38:15,982 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.512).
# On GPU (TITAN V)
[decode]: 100%|███████████| 140/140 [00:11<00:00, 12.57it/s, RTF=0.0218]
2020-05-26 12:33:32,469 (decode:156) INFO: Finished generation of 140 utterances (RTF = 0.020).
Models and results
- The pre-trained models and generated utterances are released.
- You can download the whole folder of each corpus and then put it in
egs/[corpus]
to run speech generations with the pre-trained models. - You also can only download the
[corpus]/data
folder and the desired pre-trained model and then put thedata
folder inegs/[corpus]
and the model folder inegs/[corpus]/exp
. - Both models with 100,000 iterations (trained w/ only STFT loss) and 400,000 iterations (trained w/ STFT and GAN losses) are released.
- The generated utterances are in the
wav
folder of each model’s folder.
Corpus | Lang | Fs [Hz] | Feature | Model | Conf |
---|---|---|---|---|---|
vcc18 | EN | 22050 | world (uv + f0 + mcep + ap) (shiftms: 5) |
PWG_20 | link |
PWG_30 | link | ||||
QPPWGaf_20 | link | ||||
vcc20 | EN, FI, DE, ZH | 24000 | melf0h128 (uv + f0 + mel-spc) (hop_size: 128) |
PWG_20 | link |
PWG_30 | link | ||||
QPPWGaf_20 | link |
Usage of pre-trained models
Analysis-synthesis
The minimum code for performing analysis and synthesis is presented.
# Make sure you have installed `qppwg`
# If not, install it via pip
$ pip install qppwg
# Take "vcc18" corpus as an example
# Download the whole folder of "vcc18"
$ ls vcc18
data exp
# Change directory to `vcc18` folder
$ cd vcc18
# Put audio files in `data/wav/` directory
$ ls data/wav/
sample1.wav sample2.wav
# Create a list `data/sample.scp` of the audio files
$ tail data/scp/sample.scp
data/wav/sample1.wav
data/wav/sample2.wav
# Extract acoustic features
$ qppwg-preprocess \
--audio data/scp/sample.scp \
--indir wav \
--outdir hdf5 \
--config exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/config.yml
# The extracted features are in `data/hdf5/`
# The feature list `data/sample.list` of the feature files will be automatically generated
$ ls data/hdf5/
sample1.h5 sample2.h5
$ ls data/scp/
sample.scp sample.list
# Synthesis
$ qppwg-decode \
--eval_feat data/scp/sample.list \
--stats data/stats/vcc18_train_22kHz.joblib \
--indir data/hdf5/ \
--outdir exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/ \
--checkpoint exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/checkpoint-400000steps.pkl
# Synthesis w/ halved F0
$ qppwg-decode \
--f0_factor 0.50 \
--eval_feat data/scp/sample.list \
--stats data/stats/vcc18_train_22kHz.joblib \
--indir data/hdf5/ \
--outdir exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/ \
--checkpoint exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/checkpoint-400000steps.pkl
# The generated utterances can be found in `exp/[model]/wav/400000/`
$ ls exp/qppwg_vcc18_train_22kHz_QPPWGaf_20/wav/400000/
sample1.wav sample1_f0.50.wav sample2.wav sample2_f0.50.wav
References
The QPPWG repository is developed based on the following repositories and paper.
Citation
If you find the code is helpful, please cite the following article.
@article{wu2020qppwg,
title={Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation},
author={Wu, Yi-Chiao and Hayashi, Tomoki and Okamoto, Takuma and Kawai, Hisashi and Toda, Tomoki},
journal={arXiv preprint arXiv:2005.08654},
year={2020}
}
Authors
Development:
Yi-Chiao Wu @ Nagoya University (@bigpon)
E-mail: yichiao.wu@g.sp.m.is.nagoya-u.ac.jp
Advisor:
Tomoki Toda @ Nagoya University
E-mail: tomoki@icts.nagoya-u.ac.jp
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file qppwg-0.1.2.tar.gz
.
File metadata
- Download URL: qppwg-0.1.2.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.6.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f2a3783d8eb8a0e0a8248d428d8d600832105b8254c12cd93dfcf932a40120b |
|
MD5 | c78c28ac25b6f46b96ce71b4983ff017 |
|
BLAKE2b-256 | 90c512932cbfc7702896e666a9d942ffa213bc8b8573dd8a9a1fdc1b7c802cf6 |