An Open Source text-to-speech system built by inverting Whisper

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

WhisperSpeech

If you have questions or you want to help you can find us in the #audio-generation channel on the LAION Discord server.

An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch.

We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable.

We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications.

Currently the models are trained on the English LibreLight dataset. In the next release we want to target multiple languages (Whisper and EnCodec are both multilanguage).

Progress update [2023-07-14]

We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models with a better training setup and higher-quality data (see Roadmap).

An end to end generation example, inspired by one famous president’s speech (don’t forget to unmute the videos):

Female voice:

https://github.com/collabora/WhisperSpeech/assets/107984/ea592497-19aa-4a69-b492-18dd907b307f

Male voice:

https://github.com/collabora/WhisperSpeech/assets/107984/8a012d31-0d72-42cd-a2e5-ab728ad3ca8c

We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:

Older progress updates are archived here

Downloads

We encourage you to start with the Google Colab link above or run the provided notebook locally. If you want to download manually or train the models from scratch then both the WhisperSpeech pre-trained models as well as the converted datasets are available on HuggingFace.

Roadmap

Extract acoustic tokens
Extract Whisper embeddings and quantize them to semantic tokens
Semantic token to acoustic token (S->A) model
Text token to semantic token (T->S) model
Improve the EnCodec speech quality
Improve inference of short sentences
Gather a bigger emotive speech dataset
Document the LibriLight derived datasets we released on Hugginface
Create a community effort to gather freely licensed speech in multiple languages
Train final multi-language models

Architecture

The general architecture is similar to AudioLM, SPEAR TTS from Google and MusicGen from Meta. We avoided the NIH syndrome and built it on top of powerful Open Source models: Whisper from OpenAI to generate semantic tokens and perform transcription, EnCodec from Meta for acoustic modeling and Vocos from Charactr Inc as the high-quality vocoder.

Whisper for modeling semantic tokens

We utilize the OpenAI Whisper encoder block to generate embeddings which we then quantize with a small 2-layer model to get semantic tokens.

If the language is already supported by Whisper then this process requires only audio files (without ground truth transcriptions).

Using Whisper for semantic token extraction diagram

EnCodec for modeling acoustic tokens

We use EnCodec to model the audio waveform. Out of the box it delivers reasonable quality at 1.5kbps and we can bring this to high-quality by using Vocos – a vocoder pretrained on EnCodec tokens.

EnCodec block diagram

Appreciation

This work would not be possible without the generous sponsorships from:

Collabora – code development and model training
LAION – community building and datasets

We are available to help you with both Open Source and proprietary AI projects. You can reach us via the Collabora website or on Discord ( and )

Citations

We rely on many amazing Open Source projects and research papers:

@article{SpearTTS,
  title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
  url = {https://arxiv.org/abs/2302.03540},
  author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
  publisher = {arXiv},
  year = {2023},
}

@article{MusicGen,
  title={Simple and Controllable Music Generation}, 
  url = {https://arxiv.org/abs/2306.05284},
  author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
  publisher={arXiv},
  year={2023},
}

@article{Whisper
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  publisher = {arXiv},
  year = {2022},
}

@article{EnCodec
  title = {High Fidelity Neural Audio Compression},
  url = {https://arxiv.org/abs/2210.13438},
  author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
  publisher = {arXiv},
  year = {2022},
}

@article{Vocos
  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, 
  url = {https://arxiv.org/abs/2306.00814},
  author={Hubert Siuzdak},
  publisher={arXiv},
  year={2023},
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.8.9

Aug 17, 2024

0.8

Feb 28, 2024

0.7.2

Feb 26, 2024

0.7.1

Feb 26, 2024

0.7

Feb 26, 2024

0.6

Jan 29, 2024

0.5.8

Jan 26, 2024

0.5.7

Jan 22, 2024

0.5.6

Jan 18, 2024

0.5.5

Jan 18, 2024

0.5.4

Jan 17, 2024

0.5.3

Jan 17, 2024

0.5.2

Jan 17, 2024

0.5.1

Jan 17, 2024

0.5

Jan 15, 2024

0.2.1

Jan 10, 2024

0.2.0

Jan 10, 2024

This version

0.1.3

Dec 10, 2023

0.1.2

Dec 10, 2023

0.1.1

Dec 10, 2023

0.1.0

Dec 10, 2023

0.0.4

Nov 8, 2023

0.0.3

Jul 14, 2023

0.0.2

Jul 14, 2023

0.0.1

Jul 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

WhisperSpeech-0.1.3.tar.gz (82.6 kB view details)

Uploaded Dec 10, 2023 Source

Built Distribution

WhisperSpeech-0.1.3-py3-none-any.whl (127.3 kB view details)

Uploaded Dec 10, 2023 Python 3

File details

Details for the file WhisperSpeech-0.1.3.tar.gz.

File metadata

Download URL: WhisperSpeech-0.1.3.tar.gz
Upload date: Dec 10, 2023
Size: 82.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for WhisperSpeech-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`d0a65eeab22bec657dd276ca1862e82b5f06a5dbca920042b8351a81ec61b586`
MD5	`9af503b4c064f1e72bd54d85b01f0ede`
BLAKE2b-256	`ea76ba4586f19fcd44bfcbb32398bbde4fedb174dd7859a56016128be2aa04b9`

See more details on using hashes here.

File details

Details for the file WhisperSpeech-0.1.3-py3-none-any.whl.

File metadata

Download URL: WhisperSpeech-0.1.3-py3-none-any.whl
Upload date: Dec 10, 2023
Size: 127.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for WhisperSpeech-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4173d0ff12a60a2cf2ea954dc87af77b9b9adda10c0fb56cff2db38342708fd4`
MD5	`c4a245b91c1573a7264b2a462c0de77b`
BLAKE2b-256	`227446d6a944009fe82acc6a9dc2941c1750fda513e7ae4b5f175d10b69a9742`