Skip to main content

An Open Source text-to-speech system built by inverting Whisper

Project description


If you have questions or you want to help you can find us in the #audio-generation channel on the LAION Discord server.

SPEAR-TTS

An unofficial PyTorch implementation of SPEAR-TTS.

We are not targeting an exact copy – to speed up training we want to use existing Open Source models as bases: Whisper encoder to generate semantic tokens and EnCodec for acoustic modeling.

Following Google Brain we'll train on the LibriLight dataset. Ultimately we want to target multiple languages (Whisper and EnCodec are both multilanguage).

Progress updates

UPDATE 2023-04-13: We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).

End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see https://github.com/collabora/spear-tts-pytorch/issues/9 for more details:

(don't forget to unmute the video)

https://user-images.githubusercontent.com/107984/231755045-e7d55a7a-6d97-4a0a-a8cf-1bc7f54c9217.mp4

Ground truth:

https://user-images.githubusercontent.com/107984/231755210-7150636b-18c2-4dff-a8f4-9db0b932ad5f.mp4

UPDATE 2023-04-03: We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.

Validation set ground truth (don't forget to unmute):

https://user-images.githubusercontent.com/107984/229439299-3aca954c-f044-4270-a4e5-4f847fd5d929.mov

The generated output from the S->A model (multinomial sampling, temperature 0.8):

https://user-images.githubusercontent.com/107984/229439418-92575be4-a892-40bb-97f7-5bfda5b2bf1d.mov

Roadmap

Architecture

Whisper for modeling semantic tokens

Using Whisper for semantic token extraction diagram

Pros:

  • Whisper training should be a lot better at extracting semantic information than a masked language model with contrastive loss (w2v-BERT)
  • it's pretrained on 600k hours of multilingual speech (vs. 60k for w2v-BERT used in the paper)
  • freely available

Cons:

  • 2x higher "symbol rate" (50 vec/s) than w2v-BERT (25 vec/s) which means training the semantic->acoustic transformer may take longer (this turned out not to matter in practice – there are only 1500 semantic tokens for 30 seconds of audio vs. 4500 acoustic tokens)

EnCodec for modeling acoustic tokens

EnCodec block diagram

Pros:

  • High-quality pretrained model is available

Cons:

  • Comparing the speech samples with SPEAR-TTS, EnCodec needs 6kbps to get the same quality (SoundStream retrained only on speech seems to work with 1.5kbps)
  • CC-BY-NC license

We may switch to the OpenSource SoundStream re-implementation or train a new speech-only model.

Appreciation

Collabora logo      LAION logo

This work would not be possible without the generous sponsorships from:

  • Collabora – code development and model training
  • LAION – community building and datasets

We are available to help you with both Open Source and proprietary AI projects. You can reach us via the Collabora website or on Discord ( and )

Citations

@article{SpearTTS,
  title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
  url = {https://arxiv.org/abs/2302.03540},
  author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
  publisher = {arXiv},
  year = {2023},
}
@article{Whisper
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  publisher = {arXiv},  
  year = {2022},
}
@article{EnCodec
  title = {High Fidelity Neural Audio Compression},
  url = {https://arxiv.org/abs/2210.13438},
  author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  publisher = {arXiv},
  year = {2022},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

WhisperSpeech-0.0.3.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

WhisperSpeech-0.0.3-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file WhisperSpeech-0.0.3.tar.gz.

File metadata

  • Download URL: WhisperSpeech-0.0.3.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for WhisperSpeech-0.0.3.tar.gz
Algorithm Hash digest
SHA256 130525609d302b9f797b21b2a64cbb48acf8457d0dd8b927e1535f2448903dbb
MD5 ad617e86d42580d553480623dce19b00
BLAKE2b-256 fb587eca37f4106b9fc2c768bd65c6b47c6a02dc5ab2d8fa8935f320980e6888

See more details on using hashes here.

File details

Details for the file WhisperSpeech-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for WhisperSpeech-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d3fe12f838edc2a5526772e5a0be8360795717ffe36a0c6f1b7b5e8c1dc6fe9e
MD5 43224bddb3c44e0b729c39c3d4cbeae2
BLAKE2b-256 194d9e938b6b692ada349e677f89f3845ea007244b8bf57120b8acd271dfbfc2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page