Skip to main content

ESPnet: end-to-end speech processing toolkit

Project description

ESPnet: end-to-end speech processing toolkit

Github Actions Build Status CircleCI codecov Gitter

Docs | Example | Docker | Notebook | Tutorial (2019)

ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

Key Features

Kaldi style complete recipe

  • Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, etc.)
  • Support numbers of TTS recipes with a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
  • Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
  • Support numbers of MT recipes (IWSLT'16, the above ST recipes etc.)
  • Support speech separation and recognition recipe (WSJ-2mix)
  • Support voice conversion recipe (VCC2020 baseline) (new!)

ASR: Automatic Speech Recognition

  • State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
  • Hybrid CTC/attention based end-to-end ASR
    • Fast/accurate training with CTC/attention multitask training
    • CTC/attention joint decoding to boost monotonic alignment decoding
    • Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU) or Transformer
  • Attention: Dot product, location-aware attention, variants of multihead
  • Incorporate RNNLM/LSTMLM/TransformerLM trained only with text data
  • Batch GPU decoding
  • Transducer based end-to-end ASR
    • Available: RNN-Transducer, Transformer-Transducer, Transformer/RNN-Transducer
    • Support attention extension and VGG-Transformer (encoder)

TTS: Text-to-speech

  • Tacotron2 based end-to-end TTS
  • Transformer based end-to-end TTS
  • Feed-forward Transformer (a.k.a. FastSpeech) based end-to-end TTS (new!)

ST: Speech Translation & MT: Machine Translation

  • State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
  • Transformer based end-to-end ST (new!)
  • Transformer based end-to-end MT (new!)

VC: Voice conversion

  • End-to-end VC based on cascaded ASR+TTS (new!)
  • Baseline system for Voice Conversion Challenge 2020!

DNN Framework

  • Flexible network architecture thanks to chainer and pytorch
  • Flexible front-end processing thanks to kaldiio and HDF5 support
  • Tensorboard based monitoring

Installation

See https://espnet.github.io/espnet/installation.html

Usage

See https://espnet.github.io/espnet/tutorial.html

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking times for ESPnet! Any contributions to ESPNet are welcome and feel free to ask any questions or requests to issues. If it's the first contribution to ESPnet for you, please follow the contribution guide.

Results and demo

You can find useful tutorials and demos in Interspeech 2019 Tutorial

ASR results

We list the character error rate (CER) and word error rate (WER) of major ASR tasks.

Task CER (%) WER (%) Pretrained model
Aishell dev 6.0 N/A link
Aishell test 6.7 N/A same as above
Common Voice dev 1.7 2.2 link
Common Voice test 1.8 2.3 same as above
CSJ eval1 5.7 N/A link
CSJ eval2 3.8 N/A same as above
CSJ eval3 4.2 N/A same as above
HKUST dev 23.5 N/A link
Librispeech dev_clean N/A 2.1 link
Librispeech dev_other N/A 5.3 same as above
Librispeech test_clean N/A 2.5 same as above
Librispeech test_other N/A 5.5 same as above
TEDLIUM2 dev N/A 9.3 link
TEDLIUM2 test N/A 8.1 same as above
TEDLIUM3 dev N/A 9.7 link
TEDLIUM3 test N/A 8.0 same as above
WSJ dev93 3.2 7.0 N/A
WSJ eval92 2.1 4.7 N/A

Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by RWTH.

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/asr1/RESULTS.md.

ASR demo

You can recognize speech in a WAV file using pretrained models. Go to a recipe directory and run utils/recog_wav.sh as follows:

cd egs/tedlium2/asr1
../../../utils/recog_wav.sh --models tedlium2.transformer.v1 example.wav

where example.wav is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model Notes
tedlium2.rnn.v1 Streaming decoding based on CTC-based VAD
tedlium2.rnn.v2 Streaming decoding based on CTC-based VAD (batch decoding)
tedlium2.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 2
tedlium3.transformer.v1 Joint-CTC attention Transformer trained on Tedlium 3
librispeech.transformer.v1 Joint-CTC attention Transformer trained on Librispeech
commonvoice.transformer.v1 Joint-CTC attention Transformer trained on CommonVoice
csj.transformer.v1 Joint-CTC attention Transformer trained on CSJ

ST results

We list 4-gram BLEU of major ST tasks.

end-to-end system

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 48.39 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 18.67 link
Libri-trans test (En->Fr) 16.70 link
How2 dev5 (En->Pt) 45.68 link
Must-C tst-COMMON (En->De) 22.91 link
Mboshi-French dev (Fr->Mboshi) 6.18 N/A

cascaded system

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 42.16 N/A
Fisher-CallHome Spanish callhome_evltest (Es->En) 19.82 N/A
Libri-trans test (En->Fr) 16.96 N/A
How2 dev5 (En->Pt) 44.90 N/A
Must-C tst-COMMON (En->De) 23.65 N/A

If you want to check the results of the other recipes, please check egs/<name_of_recipe>/st1/RESULTS.md.

ST demo

(New!) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation!

Open In Colab


You can translate speech in a WAV file using pretrained models. Go to a recipe directory and run utils/translate_wav.sh as follows:

cd egs/fisher_callhome_spanish/st1/
wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf - ../../../utils/translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav

where test.wav is a WAV file to be translated. The sampling rate must be consistent with that of data used in training.

Available pretrained models in the demo script are listed as below.

Model Notes
fisher_callhome_spanish.transformer.v1 Transformer-ST trained on Fisher-CallHome Spanish Es->En

MT results

Task BLEU Pretrained model
Fisher-CallHome Spanish fisher_test (Es->En) 61.45 link
Fisher-CallHome Spanish callhome_evltest (Es->En) 29.86 link
Libri-trans test (En->Fr) 18.09 link
How2 dev5 (En->Pt) 58.61 link
Must-C tst-COMMON (En->De) 27.63 link
IWSLT'14 test2014 (En->De) 24.70 link
IWSLT'14 test2014 (De->En) 29.22 link
IWSLT'16 test2014 (En->De) 24.05 link
IWSLT'16 test2014 (De->En) 29.13 link

TTS results

You can listen to our samples in demo HP espnet-tts-sample. Here we list some notable ones:

You can download all of the pretrained models and generated samples:

Note that in the generated samples we use three vocoders: Griffin-Lim (GL), WaveNet vocoder (WaveNet), Parallel WaveGAN (ParallelWaveGAN), and MelGAN (MelGAN). The neural vocoders are based on following repositories.

If you want to build your own neural vocoder, please check the above repositories.

Here we list all of the pretrained neural vocoders. Please download and enjoy the generation of high quality speech!

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Model type
ljspeech.wavenet.softmax.ns.v1 EN 22.05k None 1024 / 256 / None Softmax WaveNet
ljspeech.wavenet.mol.v1 EN 22.05k None 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v1 EN 22.05k None 1024 / 256 / None Parallel WaveGAN
ljspeech.wavenet.mol.v2 EN 22.05k 80-7600 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v2 EN 22.05k 80-7600 1024 / 256 / None Parallel WaveGAN
ljspeech.melgan.v1 (EXPERIMENTAL) EN 22.05k 80-7600 1024 / 256 / None MelGAN
ljspeech.melgan.v3 (EXPERIMENTAL) EN 22.05k 80-7600 1024 / 256 / None MelGAN
libritts.wavenet.mol.v1 EN 24k None 1024 / 256 / None MoL WaveNet
jsut.wavenet.mol.v1 JP 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
jsut.parallel_wavegan.v1 JP 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN
csmsc.wavenet.mol.v1 ZH 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
csmsc.parallel_wavegan.v1 ZH 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN

If you want to use the above pretrained vocoders, please exactly match the feature setting with them.

TTS demo

(New!) We made a new real-time E2E-TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis!

Open In Colab


You can synthesize speech in a TXT file using pretrained models. Go to a recipe directory and run utils/synth_wav.sh as follows:

cd egs/ljspeech/tts1
echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt
../../../utils/synth_wav.sh example.txt

You can change the pretrained model as follows:

../../../utils/synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

Waveform synthesis is performed with Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pretrained vocoder model as follows:

../../../utils/synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt

Note that WaveNet vocoder provides very high quality speech but it takes time to generate.

Available pretrained models in the demo script are listed as follows:

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Input R Model type
ljspeech.tacotron2.v1 EN 22.05k None 1024 / 256 / None char 2 Tacotron 2
ljspeech.tacotron2.v2 EN 22.05k None 1024 / 256 / None char 1 Tacotron 2 + forward attention
ljspeech.tacotron2.v3 EN 22.05k None 1024 / 256 / None char 1 Tacotron 2 + guided attention loss
ljspeech.transformer.v1 EN 22.05k None 1024 / 256 / None char 1 Deep Transformer
ljspeech.transformer.v2 EN 22.05k None 1024 / 256 / None char 3 Shallow Transformer
ljspeech.transformer.v3 EN 22.05k None 1024 / 256 / None phn 1 Deep Transformer
ljspeech.fastspeech.v1 EN 22.05k None 1024 / 256 / None char 1 FF-Transformer
ljspeech.fastspeech.v2 EN 22.05k None 1024 / 256 / None char 1 FF-Transformer + CNN in FFT block
ljspeech.fastspeech.v3 EN 22.05k None 1024 / 256 / None phn 1 FF-Transformer + CNN in FFT block + postnet
libritts.tacotron2.v1 EN 24k 80-7600 1024 / 256 / None char 2 Multi-speaker Tacotron 2
libritts.transformer.v1 EN 24k 80-7600 1024 / 256 / None char 2 Multi-speaker Transformer
jsut.tacotron2 JP 24k 80-7600 2048 / 300 / 1200 phn 2 Tacotron 2
jsut.transformer JP 24k 80-7600 2048 / 300 / 1200 phn 3 Shallow Transformer
csmsc.transformer.v1 ZH 24k 80-7600 2048 / 300 / 1200 pinyin 1 Deep Transformer
csmsc.fastspeech.v3 ZH 24k 80-7600 2048 / 300 / 1200 pinyin 1 FF-Transformer + CNN in FFT block + postnet

Available pretrained vocoder models in the demo script are listed as follows:

Model link Lang Fs [Hz] Mel range [Hz] FFT / Shift / Win [pt] Model type
ljspeech.wavenet.softmax.ns.v1 EN 22.05k None 1024 / 256 / None Softmax WaveNet
ljspeech.wavenet.mol.v1 EN 22.05k None 1024 / 256 / None MoL WaveNet
ljspeech.parallel_wavegan.v1 EN 22.05k None 1024 / 256 / None Parallel WaveGAN
libritts.wavenet.mol.v1 EN 24k None 1024 / 256 / None MoL WaveNet
jsut.wavenet.mol.v1 JP 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
jsut.parallel_wavegan.v1 JP 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN
csmsc.wavenet.mol.v1 ZH 24k 80-7600 2048 / 300 / 1200 MoL WaveNet
csmsc.parallel_wavegan.v1 ZH 24k 80-7600 2048 / 300 / 1200 Parallel WaveGAN

VC results

The Voice Conversion Challenge 2020 (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross lingual nonparallel VC. A cascade method of ASR+TTS is developed.
You can download converted samples here.

References

[1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-End Speech Processing Toolkit," Proc. Interspeech'18, pp. 2207-2211 (2018)

[2] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, "Joint CTC-attention based end-to-end speech recognition using multi-task learning," Proc. ICASSP'17, pp. 4835--4839 (2017)

[3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey and Tomoki Hayashi, "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, Dec. 2017

Citations

@inproceedings{watanabe2018espnet,
  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
  title={ESPnet: End-to-End Speech Processing Toolkit},
  year=2018,
  booktitle={Interspeech},
  pages={2207--2211},
  doi={10.21437/Interspeech.2018-1456},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
}
@misc{hayashi2019espnettts,
    title={ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit},
    author={Tomoki Hayashi and Ryuichi Yamamoto and Katsuki Inoue and Takenori Yoshimura and Shinji Watanabe and Tomoki Toda and Kazuya Takeda and Yu Zhang and Xu Tan},
    year={2019},
    eprint={1910.10909},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

espnet-0.6.3.tar.gz (308.1 kB view details)

Uploaded Source

Built Distribution

espnet-0.6.3-py3-none-any.whl (375.8 kB view details)

Uploaded Python 3

File details

Details for the file espnet-0.6.3.tar.gz.

File metadata

  • Download URL: espnet-0.6.3.tar.gz
  • Upload date:
  • Size: 308.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.4

File hashes

Hashes for espnet-0.6.3.tar.gz
Algorithm Hash digest
SHA256 de51faf43a26132d51c6c278106ea52b34eed25292d48f123e831752b7b963f6
MD5 3c2d3ad7a5a7a3e1ef1b58a0df2f8606
BLAKE2b-256 a9814abfd674a03e7191d7609a6d71736d3e670c94661de11fd2c6626d20664b

See more details on using hashes here.

File details

Details for the file espnet-0.6.3-py3-none-any.whl.

File metadata

  • Download URL: espnet-0.6.3-py3-none-any.whl
  • Upload date:
  • Size: 375.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0.post20200127 requests-toolbelt/0.9.1 tqdm/4.42.0 CPython/3.7.4

File hashes

Hashes for espnet-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 547d647b45cc5f6fca5f248a9de3ad4deab0d5457b58de5565c265119d97ac5d
MD5 3156eec623798555391c1c951e66108d
BLAKE2b-256 5f0ff8678e88f71532b10abc9400e8b9e46f8cd1bf2bbad32334c1caf7c8339f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page