Voice100 is a small TTS for English and Japanese.
Project description
Voice100
Voice100 includes neural TTS/ASR models. Inference of Voice100 is low cost as its models are tiny and only depend on CNN without recursion.
Objectives
- Don't depend on non-commercially licensed dataset
- Small enough to run on normal PCs, Raspberry Pi and smartphones.
Sample synthesis
- Sample synthesis 1 beginnings are apt to be determinative and when reinforced by continuous applications of similar influence
- Sample synthesis 2 which had restored the courage of noirtier for ever since he had conversed with the priest his violent despair had yielded to a calm resignation which surprised all who knew his excessive affection
- Sample synthesis 1 また、東寺のように五大明王と呼ばれる主要な明王の中央に配されることも多い。
- Sample synthesis 2 ニューイングランド風は牛乳をベースとした白いクリームスープでありボストンクラムチャウダーとも呼ばれる
Architecture
TTS
TTS model is devided into two sub models, align model and audio model. The align model predicts text alignments given a text. An aligned text is generated from the text and the text alignments. The audio model predicts WORLD features (F0, spectral envelope, coded aperiodicity) given the aligned text.
Alignment network
graph TD
A[Input text] -->|hello| B(Embedding)
B --> C(1D inverted residual x4)
C --> D(Convolution)
D -->|h:0,1 e:0,2 l:1,1 l:1,1 o:1,2| E[Alignment]
Audio network
graph TD
A[Aligned text] -->|_hee_l_l_oo| B(Embedding)
B --> C(1D inverted residual x4)
C --> D(1D transpose convolution)
D --> E(1D inverted residual x3)
E --> F(Convolution)
F --> G[WORLD parameters]
TTS align model
| Name | Type | Params
-----------------------------------------
0 | embedding | Embedding | 14.8 K
1 | layers | Sequential | 8.6 M
-----------------------------------------
8.6 M Trainable params
0 Non-trainable params
8.6 M Total params
17.137 Total estimated model params size (MB)
TTS audio model
| Name | Type | Params
-------------------------------------------
0 | embedding | Embedding | 14.8 K
1 | decoder | VoiceDecoder | 11.0 M
2 | norm | WORLDNorm | 518
3 | criterion | WORLDLoss | 0
-------------------------------------------
11.1 M Trainable params
518 Non-trainable params
11.1 M Total params
22.120 Total estimated model params size (MB)
Align model pre-processing
The input of the align model is sequence of tokens of the input text. The input text is lower cased and tokenized into characters and encoded by the text encoder. The text encoder has 28 characters in the vocabulary, which includes lower alphabets, a space and an apostrophy. All characters which are not found in the vocabulary, are removed.
Align model post-processing
The output of the align model is sequence of pairs of timings which length is the same as the number of input tokens. A pair has two values, number of frames before the token and number of frames for the token. One frame is 20ms. An aligned text is generated from the input text and pairs of timings. The length of the aligned text is the number of total frames for the audio.
Audio model pre-processing.
The input of the audio model is the encoded aligned text, which is encoded in the same way as the align model pre-processing, except it has one added token in the vocabulary for spacing between tokens for the original text.
Audio model post-processing.
The output of the audio model is the sequence of F0, F0 existences, log spectral envelope, coded aperiodicity. A F0 existence is a boolean value, which is true when F0 is available false otherwise. F0 is forced into 0 when F0 existence is false. One frame is 10ms. The length of the output is twice as the length of the input.
ASR
The ASR model is 9-layer MobileNet-like inverted residual which is trained to predict on CTC loss.
ASR network
graph TD
A[Mel spectrogram] --> B(1D inverted residual x 12)
B --> C(Convolution)
C --> G[Logits of aligned text]
| Name | Type | Params
----------------------------------------------------------------
0 | encoder | ConvVoiceEncoder | 11.6 M
1 | decoder | LinearCharDecoder | 14.9 K
2 | loss_fn | CTCLoss | 0
3 | batch_augment | BatchSpectrogramAugumentation | 0
----------------------------------------------------------------
11.6 M Trainable params
0 Non-trainable params
11.6 M Total params
23.243 Total estimated model params size (MB)
Align model
The align model is 2-layer bi-directional LSTM which is trained to predict aligned texts from MFCC audio features. The align model is used to prepare aligned texts for dataset to train the TTS models.
| Name | Type | Params
----------------------------------------------------------------
0 | conv | Conv1d | 24.7 K
1 | lstm | LSTM | 659 K
2 | dense | Linear | 7.5 K
3 | loss_fn | CTCLoss | 0
4 | batch_augment | BatchSpectrogramAugumentation | 0
----------------------------------------------------------------
691 K Trainable params
0 Non-trainable params
691 K Total params
1.383 Total estimated model params size (MB)
Training
Train ASR model
voice100-prepare-dataset \
--dataset ljspeech \
--language en \
--use_phone
voice100-prepare-dataset \
--dataset librispeech \
--language en \
--use_phone
voice100 fit \
--config config/asr_en_phone_base.yaml \
--trainer.accelerator gpu \
--trainer.devices 1 \
--trainer.precision 16 \
--trainer.default_root_dir ./outputs/asr_en_phone_base \
Align text with small ASR model
This generates the aligned text as data/${DATASET}-phone-align.txt
.
voice100-align-text \
--batch_size 4 \
--dataset ljspeech \
--language en \
--use_phone \
--checkpoint asr_en_phone_small-20230309.ckpt
Train TTS align model
voice100 fit --config voice100/config/align_en_phone_base.yaml \
--trainer.accelerator gpu \
--trainer.devices 1 \
--trainer.precision 16 \
--trainer.default_root_dir=./outputs/align_en_phone_base
Compute audio statistics
This generates the statistics as data/${DATASET}-stat.pt
.
voice100-calc-stat \
--dataset ljspeech \
--language en \
--output data/audio-stat.py
Train TTS audio model
voice100 fit --config voice100/config/tts_en_phone_base.yaml \
--trainer.accelerator gpu \
--gpus 1 \
--precision 16 \
--trainer.default_root_dir=./outputs/tts_en_phone_base
Exporting to ONNX
voice100-export-onnx \
--checkpoint model/${MODEL}/lightning_logs/version_0/checkpoints/last.ckpt
CMU models
CMU models is a model that use the output of G2p_en as text representation instead of raw text.
Training CMU models
These commands convert texts in the dataset into ./data/[dataset]-phone-[split].txt
.
Then run voice100-train-[model]
with --use-phone
.
voice100-prepare-dataset \
--dataset ljspeech
voice100-prepare-dataset \
--dataset librispeech \
--split train
voice100-prepare-dataset \
--dataset librispeech \
--split val
CMU multi-task model
CMU multitask model is a variant of TTS audio model which input is an aligned text and outputs are WORLD vocoder parameters and CMU phonemes. To train CMU multi-task model, we need alignment data for English and CMU phonemes.
./data/ljspeech-align-train.txt
./data/ljspeech-phone-align-train.txt
Then run
MODEL=ttsaudio_en_mt_conv_base
voice100-train-ttsaudio-mt \
--gpus 1 \
--dataset ${DATASET} \
--language ${LANGUAGE} \
--batch_size 32 \
--precision 16 \
--max_epochs 150 \
--default_root_dir ./model/${MODEL}
Inference
Use Voice100 runtime and exported ONNX files.
Pretrained models
Name | Model Class | Dataset | Download |
---|---|---|---|
asr_en_small-20230225 | AudioToAlignText | LibriSpeech, LJ Speech 1.1 | download |
asr_en_base-20230319 | AudioToAlignText | LibriSpeech, LJ Speech 1.1 | download |
asr_en_phone_small-20230309 | AudioToAlignText | LibriSpeech, LJ Speech 1.1 | download |
asr_en_phone_base-20230314 | AudioToAlignText | LibriSpeech, LJ Speech 1.1 | download |
asr_ja_phone_small-20230104 | AudioToAlignText | Common Voice 12.0 ja | download |
asr_ja_phone_base-20230104 | AudioToAlignText | Common Voice 12.0 ja | download |
align_en_base-20230401 | TextToAlignText | LJ Speech 1.1 | download |
tts_en_base-20230407 | AlignTextToAudio | LJ Speech 1.1 | download |
align_en_phone_base-20230407 | TextToAlignText | LJ Speech 1.1 | download |
tts_en_phone_base-20230401 | AlignTextToAudio | LJ Speech 1.1 | download |
align_ja_phone_base-20230203 | TextToAlignText | Kokoro Speech v1.2 large | download |
tts_ja_phone_base-20230204 | AlignTextToAudio | Kokoro Speech v1.2 large | download |
asr_en_base-20210628 (deprecated) | AudioAlignCTC | LJ Speech 1.1 | download |
align_en_lstm_base_ctc-20210628 (deprecated) | AudioAlignCTC | LJ Speech 1.1 | download |
align_en_phone_lstm_base_ctc-20220103 (deprecated) | AudioAlignCTC | LJ Speech 1.1 | download |
align_ja_lstm_base_ctc-20211116 (deprecated) | AudioAlignCTC | Kokoro Speech v1.1 small | download |
align_ja_phone_lstm_base_ctc-20221230 (deprecated) | AudioAlignCTC | Kokoro Speech v1.1 small | download |
ttsalign_en_conv_base-20220409 (deprecated) | TextToAlignTextModel | LJ Speech 1.1 | download |
ttsalign_en_phone_conv_base-20220409 (deprecated) | TextToAlignTextModel | LJ Speech 1.1 | download |
ttsalign_ja_conv_base-20220411 (deprecated) | TextToAlignTextModel | Kokoro Speech v1.1 small | download |
ttsaudio_en_conv_base-20220107 (deprecated) | AlignTextToAudioModel | Kokoro Speech v1.1 small | download |
ttsaudio_en_phone_conv_base-20220105 (deprecated) | AlignTextToAudioModel | LJ Speech 1.1 | download |
ttsaudio_ja_conv_base-20220416 (deprecated) | AlignTextToAudioModel | Kokoro Speech v1.1 small | download |
ttsaudio_en_mt_conv_base-20220316 (deprecated) | AlignTextToAudioMultiTaskModel | LJ Speech 1.1 | download |
asr_en_conv_base_ctc-20220126 (deprecated) | AudioToTextCTC | LibriSpeech | download |
asr_en_phone_conv_base_ctc-20220107 (deprecated) | AudioToTextCTC | LibriSpeech | download |
stt_ja_conv_base_ctc-20211127 (deprecated) | AudioToTextCTC | Common Voice 6.1 ja | download |
asr_ja_phone_conv_base_ctc-20221225 (deprecated) | AudioToTextCTC | Common Voice 6.1 ja | download |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file voice100-1.6.0.tar.gz
.
File metadata
- Download URL: voice100-1.6.0.tar.gz
- Upload date:
- Size: 35.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.8.15 Linux/5.19.0-38-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15e611e87fd7d32c0e2fb6dde9a1ba6e8a0e2e1be44889f326edca5ba3b45f91 |
|
MD5 | 3d7162c03b71047040900db02094853a |
|
BLAKE2b-256 | 8729bf07019715058f27fcaef2c49808456cf2fcd3e87e7163ff6ea5ffd3d51e |
File details
Details for the file voice100-1.6.0-py3-none-any.whl
.
File metadata
- Download URL: voice100-1.6.0-py3-none-any.whl
- Upload date:
- Size: 47.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.2 CPython/3.8.15 Linux/5.19.0-38-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de1929a780b362a30018fb1408ff13fd3f1c74ffc12c2c0189b3d21d04c85171 |
|
MD5 | 7ba35940a39cd625cd6a4105dc1af84d |
|
BLAKE2b-256 | 8b889fc0c5589ec453d25a098eaae588485cb43caecc2186c2d9b6864d462355 |