An Open Source text-to-speech system built by inverting Whisper
Project description
WhisperSpeech
If you have questions or you want to help you can find us in the
#audio-generation channel on the LAION Discord server.
An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch.
We want this model to be like Stable Diffusion but for speech – both powerful and easily customizable.
We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications.
Currently the models are trained on the English LibreLight dataset. In the next release we want to target multiple languages (Whisper and EnCodec are both multilanguage).
Progress update [2024-01-10]
We’ve pushed a new SD S2A model that is a lot faster while still generating high-quality speech. We’ve also added an example of voice cloning based on a reference audio file.
As always, you can check out our Colab to try it yourself!
Progress update [2023-12-10]
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434
A Polish sample, male voice:
https://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec
Older progress updates are archived here
Downloads
We encourage you to start with the Google Colab link above or run the provided notebook locally. If you want to download manually or train the models from scratch then both the WhisperSpeech pre-trained models as well as the converted datasets are available on HuggingFace.
Roadmap
- Gather a bigger emotive speech dataset
- Figure out a way to condition the generation on emotions and prosody
- Create a community effort to gather freely licensed speech in multiple languages
- Train final multi-language models
Architecture
The general architecture is similar to AudioLM, SPEAR TTS from Google and MusicGen from Meta. We avoided the NIH syndrome and built it on top of powerful Open Source models: Whisper from OpenAI to generate semantic tokens and perform transcription, EnCodec from Meta for acoustic modeling and Vocos from Charactr Inc as the high-quality vocoder.
Whisper for modeling semantic tokens
We utilize the OpenAI Whisper encoder block to generate embeddings which we then quantize to get semantic tokens.
If the language is already supported by Whisper then this process requires only audio files (without ground truth transcriptions).
EnCodec for modeling acoustic tokens
We use EnCodec to model the audio waveform. Out of the box it delivers reasonable quality at 1.5kbps and we can bring this to high-quality by using Vocos – a vocoder pretrained on EnCodec tokens.
Appreciation
This work would not be possible without the generous sponsorships from:
- Collabora – code development and model training
- LAION – community building and datasets (special thanks to
- Jülich Supercomputing Centre - JUWELS Booster supercomputer
We gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding part of this work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC), with access to compute provided via LAION cooperation on foundation models research.
We’d like to also thank individual contributors for their great help in building this model:
- inevitable-2031 (
qwerty_qwer
on Discord) for dataset curation
Consulting
We are available to help you with both Open Source and proprietary AI projects. You can reach us via the Collabora website or on Discord ( and )
Citations
We rely on many amazing Open Source projects and research papers:
@article{SpearTTS,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
url = {https://arxiv.org/abs/2302.03540},
author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
publisher = {arXiv},
year = {2023},
}
@article{MusicGen,
title={Simple and Controllable Music Generation},
url = {https://arxiv.org/abs/2306.05284},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
publisher={arXiv},
year={2023},
}
@article{Whisper
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
publisher = {arXiv},
year = {2022},
}
@article{EnCodec
title = {High Fidelity Neural Audio Compression},
url = {https://arxiv.org/abs/2210.13438},
author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
publisher = {arXiv},
year = {2022},
}
@article{Vocos
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
url = {https://arxiv.org/abs/2306.00814},
author={Hubert Siuzdak},
publisher={arXiv},
year={2023},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file WhisperSpeech-0.5.1.tar.gz
.
File metadata
- Download URL: WhisperSpeech-0.5.1.tar.gz
- Upload date:
- Size: 85.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 619c33ef23887eaa7c3b59bb20672c3cf1e0f865fb994eac33636adbb2c94e73 |
|
MD5 | 88bd7f8cb6f2cb3d5d6fde7dd754cdd0 |
|
BLAKE2b-256 | af9c481bd900962690f1d946f236f67a83837679b9ab29d95b89e5a3c829b87b |
File details
Details for the file WhisperSpeech-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: WhisperSpeech-0.5.1-py3-none-any.whl
- Upload date:
- Size: 133.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e667e0f0b744b4a5207d5d9b89f33a80dcdea9d647d65ae79e1a31f4166be30 |
|
MD5 | 7894a977248bb35d7bbf50495da616bb |
|
BLAKE2b-256 | 56c44af42750b7ea2f0c6e73b112c9ea55170b045c3dbbef7e8fb19a396f5349 |