Simple, hackable text-to-speech with PyTorch or MLX.

These details have not been verified by PyPI

Project links

Homepage

Project description

Nanospeech

A simple, hackable text-to-speech system in PyTorch and MLX

Nanospeech is a research-oriented project to build a minimal, easy to understand text-to-speech system that scales to any level of compute. It supports voice matching from a reference speech sample, and comes with a variety of different voices built in.

An 82M parameter pretrained model (English-only) is available, which was trained on a single H100 GPU in a few days using only public domain data. The model is intentionally small to be a reproducible baseline and allow for fast inference. On recent M-series Apple Silicon or Nvidia GPUs, speech can be generated around ~3-5x faster than realtime.

All code and pretrained models are available under the MIT license, so you can modify and/or distribute them as you'd like.

Details

Nanospeech is based on a current line of research in text-to-speech systems which jointly learn text alignment and waveform generation. It's designed to use minimal input data — just audio and text — and avoid any auxiliary models, such as forced aligners or phonemizers.

There are two single-file implementations, one in PyTorch and one in MLX, which are near line-for-line equivalence where possible to make it easy to experiment with and modify. Each implementation is around 1,500 lines of code.

Quick Start

pip install -U nanospeech

To include MLX support, install it like:

pip install -U nanospeech[mlx]

To generate speech:

python -m nanospeech.generate --text "The quick brown fox jumps over the lazy dog."

Voices

Use the --voice parameter to select the voice used for speech:

celeste — Sample

luna — Sample

nash — Sample

orion — Sample

rhea — Sample

Note these voices are all based on samples from the LibriTTS-R dataset.

Voice Matching

You can also provide a speech sample and a transcript to match to a specific voice, although the pretrained model has limited voice matching capabilities. See python -m nanospeech.generate --help for a full list of options to customize the voice.

Training a Model

Nanospeech includes a PyTorch-based trainer using Accelerate, and is compatible with DistributedDataParallel for multi-GPU training.

It supports streaming from any WebDataset, but it should be straightforward to swap in your own dataloader as well. An ideal dataset consists of high-quality speech paired with clean transcriptions.

See the guide for an example of training both the base model and the duration predictor on the LibriTTS-R dataset.

Limitations

As a research project, the pretrained model that comes with Nanospeech isn't designed for production usage. It may mispronounce words, has limited capability to match out-of-distribution voices, and can't generate very long speech samples.

However, the underlying architecture should scale up well to significantly more compute and larger datasets, so if training your own model is attractive, you can extend it to perform high-quality voice matching, multilingual speech generation, emotional expression, etc.

Citations

@article{tang2025diffusion,
    title     = {Diffusion Models without Classifier-free Guidance},
    author    = {Tang, Zhicong and Bao, Jianmin and Chen, Dong and Guo, Baining},
    year      = {2025},
    url       = {https://api.semanticscholar.org/CorpusID:276421312}
}

@article{chen-etal-2024-f5tts,
    title     = {F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
    author    = {Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
    year      = {2024},
    url       = {https://api.semanticscholar.org/CorpusID:273228169}
}

@article{chen-etal-2024-f5tts,
    title     = {F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
    author    = {Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
    year      = {2024},
    url       = {https://api.semanticscholar.org/CorpusID:273228169}
}

@inproceedings{Eskimez2024E2TE,
    title     = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
    author    = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
    year      = {2024},
    url       = {https://api.semanticscholar.org/CorpusID:270738197}
}

@article{Le2023VoiceboxTM,
    title     = {Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale},
    author    = {Matt Le and Apoorv Vyas and Bowen Shi and Brian Karrer and Leda Sari and Rashel Moritz and Mary Williamson and Vimal Manohar and Yossi Adi and Jay Mahadeokar and Wei-Ning Hsu},
    year      = {2023},
    url       = {https://api.semanticscholar.org/CorpusID:259275061}
}

@article{tong2023generalized,
    title     = {Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport},
    author    = {Alexander Tong and Joshua Fan and Ricky T. Q. Chen and Jesse Bettencourt and David Duvenaud},
    year      = {2023}
    url       = {https://api.semanticscholar.org/CorpusID:259847293}
}

@article{peebles2022scalable,
    title     = {Scalable Diffusion Models with Transformers},
    author    = {Peebles, William and Xie, Saining},
    year      = {2022},
    url       = {https://api.semanticscholar.org/CorpusID:254854389}
}

@article{lipman2022flow,
    title     = {Flow Matching for Generative Modeling},
    author    = {Yaron Lipman and Ricky T. Q. Chen and Heli Ben-Hamu and Maximilian Nickel and Matt Le},
    year      = {2022},
    url       = {https://api.semanticscholar.org/CorpusID:252734897}
}

@article{koizumi2023librittsr,
    title     = {LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus},
    author    = {Yuma Koizumi and Heiga Zen and Shigeki Karita and Yifan Ding and Kohei Yatabe and Nobuyuki Morioka and Michiel Bacchiani and Yu Zhang and Wei Han and Ankur Bapna},
    year      = {2023},
    url       = {https://api.semanticscholar.org/CorpusID:258967444}
}

License

The code in this repository is released under the MIT license as found in the LICENSE file.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.6

Feb 23, 2025

0.0.5

Feb 14, 2025

0.0.4

Feb 11, 2025

0.0.3

Feb 10, 2025

0.0.2

Feb 9, 2025

0.0.1

Feb 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanospeech-0.0.6.tar.gz (261.3 kB view details)

Uploaded Feb 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nanospeech-0.0.6-py3-none-any.whl (259.1 kB view details)

Uploaded Feb 23, 2025 Python 3

File details

Details for the file nanospeech-0.0.6.tar.gz.

File metadata

Download URL: nanospeech-0.0.6.tar.gz
Upload date: Feb 23, 2025
Size: 261.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for nanospeech-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`93668e211079748717dc8641c32413e295f84b693b4a7275ce225bd5fef1efe6`
MD5	`9996eecbd4e20d978d67cbd7c6dd3bdc`
BLAKE2b-256	`eba43514ad24eb2511a268444e9b5964ebd226440e30240a36bba38458cefe0c`

See more details on using hashes here.

File details

Details for the file nanospeech-0.0.6-py3-none-any.whl.

File metadata

Download URL: nanospeech-0.0.6-py3-none-any.whl
Upload date: Feb 23, 2025
Size: 259.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for nanospeech-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebc69f8412cd13d774c7dd110437e468567ae0cbc3f9e8914158e17fdc9e58ab`
MD5	`a43a5601b8ebf8badee6b74013bb0fca`
BLAKE2b-256	`09454d04ab338e40617334a2886c9beb02ccf76dd76708e2439045d1bd640bcc`

See more details on using hashes here.

nanospeech 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Nanospeech

A simple, hackable text-to-speech system in PyTorch and MLX

Details

Quick Start

Voices

Voice Matching

Training a Model

Limitations

Citations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes