Text-to-speech using neural audio codec and causal language models
Reason this release was yanked:
test version
Project description
Kani-TTS
A simple and efficient text-to-speech library using neural audio codecs and causal language models.
Features
- Simple, intuitive API
- Built on Hugging Face Transformers and NVIDIA NeMo
- High-quality audio generation using neural codecs
- GPU acceleration support
Installation
From PyPI (once published)
pip install kani-tts
From source
git clone https://github.com/yourusername/kani-tts.git
cd kani-tts
pip install -e .
Optional dependencies
For saving audio files:
pip install kani-tts[audio]
For development:
pip install kani-tts[dev]
Quick Start
from kani_tts import KaniTTS
# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')
# Generate audio from text
audio, text = model("Hello, world!")
# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")
Advanced Usage
Custom Configuration
from kani_tts import KaniTTS
model = KaniTTS(
'your-model-name',
temperature=0.7, # Control randomness (default: 0.6)
top_p=0.9, # Nucleus sampling (default: 0.95)
max_new_tokens=2000, # Max audio length (default: 1800)
repetition_penalty=1.2, # Prevent repetition (default: 1.1)
)
audio, text = model("Your text here")
Working with Audio Output
The generated audio is a NumPy array sampled at 22kHz:
import numpy as np
import soundfile as sf
audio, text = model("Generate speech from this text")
# Audio is a numpy array
print(audio.shape) # (num_samples,)
print(audio.dtype) # float32/float64
# Save using soundfile
sf.write('output.wav', audio, 22050)
# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)
Batch Processing
texts = [
"First sentence to synthesize.",
"Second sentence to synthesize.",
"Third sentence to synthesize."
]
for i, text in enumerate(texts):
audio, _ = model(text)
model.save_audio(audio, f"output_{i}.wav")
Architecture
Kani-TTS uses a two-stage architecture:
- Text → Audio Tokens: A causal language model generates audio token sequences from text
- Audio Tokens → Waveform: NVIDIA NeMo's nano codec decodes tokens into audio waveforms
The system uses special tokens to mark different segments:
- Text boundaries (start/end of text)
- Speech boundaries (start/end of speech)
- Speaker turns (human/AI)
Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.
Requirements
- Python 3.10 or higher
- CUDA-capable GPU (recommended) or CPU
- PyTorch 2.0 or higher
- Transformers library
- NeMo Toolkit
Model Compatibility
This library works with causal language models trained for TTS with the following characteristics:
- Extended vocabulary including audio tokens
- Special tokens for speech/text boundaries
- Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Citation
If you use Kani-TTS in your research, please cite:
@software{kani_tts,
title = {Kani-TTS: Text-to-Speech using Neural Audio Codec},
author = {Your Name},
year = {2024},
url = {https://github.com/yourusername/kani-tts}
}
Acknowledgments
- Built on Hugging Face Transformers
- Uses NVIDIA NeMo audio codec
- Powered by PyTorch
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kani_tts-0.0.1.tar.gz.
File metadata
- Download URL: kani_tts-0.0.1.tar.gz
- Upload date:
- Size: 7.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
629839bce7dbc19ae9c95ca983b646ed2a8d83af9ae7acc902bdc8668c80b23d
|
|
| MD5 |
604f82647fa712bc13f2f3eddfb848f6
|
|
| BLAKE2b-256 |
8073e09a87bdf8ab9951a52881459f99e52c621d0d67881800db3021edc5626b
|
File details
Details for the file kani_tts-0.0.1-py3-none-any.whl.
File metadata
- Download URL: kani_tts-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcf926365e97ef408a29aa35516665289b1e16c51552733959b8c0a79c0d0e05
|
|
| MD5 |
b63dcb32f6b8303d210dddddb3db99f1
|
|
| BLAKE2b-256 |
73875e3c2c108a86958f2a4a5058093485adbf409b59fd7babb8eb2bcb4bbb4f
|