Skip to main content

Text-to-speech using neural audio codec and causal language models

Project description

Kani-TTS

A simple and efficient text-to-speech library using neural audio codecs and causal language models.

Features

  • Simple, intuitive API with flexible generation parameters
  • Built on Hugging Face Transformers and NVIDIA NeMo
  • High-quality audio generation using neural codecs
  • GPU acceleration support
  • Multi-speaker model support with easy speaker selection
  • Per-generation parameter control for maximum flexibility

Installation

From PyPI (once published)

pip install kani-tts
pip install -U transformers # for LFM2 !!!

Quick Start

from kani_tts import KaniTTS

# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")

Advanced Usage

Working with Multi-Speaker Models

Some models support multiple speakers. You can check if your model supports speakers and select a specific voice:

from kani_tts import KaniTTS

model = KaniTTS('your-multispeaker-model-name')

# Check if model supports multiple speakers
print(f"Model type: {model.status}")  # 'singlspeaker' or 'multispeaker'

# Display available speakers (pretty formatted)
model.show_speakers()

# Or access the speaker list directly
print(model.speaker_list)  # ['Speaker1', 'Speaker2', ...]

# Generate audio with a specific speaker
audio, text = model.generate("Hello, world!", speaker_id="Speaker1")
model.save_audio(audio, "speaker1_output.wav")

# Or using the shorthand call syntax
audio, text = model("Hello, world!", speaker_id="Speaker1")

Custom Configuration

from kani_tts import KaniTTS

# Initialize model with model-level parameters
model = KaniTTS(
    'your-model-name',
    max_new_tokens=3000,       # Max audio length (default: 3000)
    suppress_logs=True,        # Suppress library logs (default: True)
    show_info=True,            # Show model info on init (default: True)
)

# Generate with custom sampling parameters per call
audio, text = model(
    "Your text here",
    temperature=0.7,           # Control randomness (default: 1.0)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
)

# You can use different parameters for each generation
audio2, text2 = model(
    "Another text",
    temperature=1.2,
    top_p=0.85,
)

API Change: Generation parameters (temperature, top_p, repetition_penalty) are now passed per-generation call instead of during initialization. This allows you to experiment with different sampling strategies without reloading the model.

When initialized, Kani-TTS displays a beautiful banner with model information:

╔════════════════════════════════════════════════════════════╗
║                                                            ║
║                   N I N E N I N E S I X  😼                ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

              /\_/\
             ( o.o )
              > ^ <

──────────────────────────────────────────────────────────────
  Model: your-model-name
  Device: GPU (CUDA)
  Mode: Multi-speaker (5 speakers)

  Configuration:
    • Sample Rate: 22050 Hz
    • Max Tokens: 3000
──────────────────────────────────────────────────────────────

  Ready to generate speech! 🎵

You can disable this banner by setting show_info=False, or show it again anytime with model.show_model_info().

Controlling Logging Output

By default, Kani-TTS suppresses all logging output from transformers, NeMo, and PyTorch to keep your console clean. Only your print() statements will be visible.

from kani_tts import KaniTTS

# Default behavior - logs are suppressed
model = KaniTTS('your-model-name')

# To see all library logs (for debugging)
model = KaniTTS('your-model-name', suppress_logs=False)

# You can also manually suppress logs at any time
from kani_tts import suppress_all_logs
suppress_all_logs()

Working with Audio Output

The generated audio is a NumPy array sampled at 22kHz:

import numpy as np
import soundfile as sf

audio, text = model("Generate speech from this text")

# Audio is a numpy array
print(audio.shape)  # (num_samples,)
print(audio.dtype)  # float32/float64

# Save using soundfile
sf.write('output.wav', audio, 22050)

# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)

Playing Audio in Jupyter Notebooks

You can listen to generated audio directly in Jupyter notebooks or IPython:

from kani_tts import KaniTTS
from IPython.display import Audio as aplay

model = KaniTTS('your-model-name')
audio, text = model("Hello, world!")

# Play audio in notebook
aplay(audio, rate=model.sample_rate)

Architecture

Kani-TTS uses a two-stage architecture:

  1. Text → Audio Tokens: A causal language model generates audio token sequences from text
  2. Audio Tokens → Waveform: NVIDIA NeMo's NanoCodec decodes tokens into audio waveforms

The system uses special tokens to mark different segments:

  • Text boundaries (start/end of text)
  • Speech boundaries (start/end of speech)
  • Speaker turns (human/AI)

Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.

Requirements

  • Python 3.10 or higher
  • CUDA-capable GPU (recommended) or CPU
  • PyTorch 2.0 or higher
  • Transformers library
  • NeMo Toolkit

Model Compatibility

This library works with causal language models trained for TTS with the following characteristics:

  • Extended vocabulary including audio tokens
  • Special tokens for speech/text boundaries
  • Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

@inproceedings{emilialarge,
  author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
  title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
  booktitle={arXiv:2501.15907},
  year={2025}
}
@article{emonet_voice_2025,
  author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, Sören},
  title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},
  journal={arXiv preprint arXiv:2506.09827},
  year={2025}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kani_tts-1.0.1.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kani_tts-1.0.1-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file kani_tts-1.0.1.tar.gz.

File metadata

  • Download URL: kani_tts-1.0.1.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ff63d0353c05083e9a9ecc920b3c76e8668f5f456a34d1648024535b57f7e6ed
MD5 6c26acbfb0e84d4b38b3be825b2f2024
BLAKE2b-256 7b7a175a6024c8f5ef3b31f118acc2882a933e53cc09f40f11b5f216ffe632f2

See more details on using hashes here.

File details

Details for the file kani_tts-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: kani_tts-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bef1de8bfe80832d3277182fc40df988ab10baf034bca53364a3ef0b9bff63df
MD5 0b3820a6418c5c005c81e464c9ebe6e1
BLAKE2b-256 bfeb44b81d4872d5ea0879aea285211aec398c0dfe9400ca4c424c01171f274d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page