Skip to main content

Text-to-speech using neural audio codec and causal language models

Reason this release was yanked:

old version of nemo

Project description

Kani-TTS

A simple and efficient text-to-speech library using neural audio codecs and causal language models.

Features

  • Simple, intuitive API
  • Built on Hugging Face Transformers and NVIDIA NeMo
  • High-quality audio generation using neural codecs
  • GPU acceleration support
  • Multi-speaker model support with easy speaker selection

Installation

From PyPI (once published)

pip install kani-tts
pip install -U "transformers==4.57.1" # for LFM2 !!!

Quick Start

from kani_tts import KaniTTS

# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")

Advanced Usage

Working with Multi-Speaker Models

Some models support multiple speakers. You can check if your model supports speakers and select a specific voice:

from kani_tts import KaniTTS

model = KaniTTS('your-multispeaker-model-name')

# Check if model supports multiple speakers
print(f"Model type: {model.status}")  # 'singlspeaker' or 'multispeaker'

# Display available speakers (pretty formatted)
model.show_speakers()

# Or access the speaker list directly
print(model.speaker_list)  # ['Speaker1', 'Speaker2', ...]

# Generate audio with a specific speaker
audio, text = model.generate("Hello, world!", speaker_id="Speaker1")
model.save_audio(audio, "speaker1_output.wav")

# Or using the shorthand call syntax
audio, text = model("Hello, world!", speaker_id="Speaker1")

Custom Configuration

from kani_tts import KaniTTS

model = KaniTTS(
    'your-model-name',
    temperature=0.7,           # Control randomness (default: 1.0)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    max_new_tokens=2000,       # Max audio length (default: 1200)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
    suppress_logs=True,        # Suppress library logs (default: True)
    show_info=True,            # Show model info on init (default: True)
)

audio, text = model("Your text here")

When initialized, Kani-TTS displays a beautiful banner with model information:

╔════════════════════════════════════════════════════════════╗
║                                                            ║
║                   N I N E N I N E S I X  😼                ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

              /\_/\
             ( o.o )
              > ^ <

──────────────────────────────────────────────────────────────
  Model: your-model-name
  Device: GPU (CUDA)
  Mode: Multi-speaker (5 speakers)

  Configuration:
    • Sample Rate: 22050 Hz
    • Temperature: 1.0
    • Top-p: 0.95
    • Max Tokens: 1200
    • Repetition Penalty: 1.1
──────────────────────────────────────────────────────────────

  Ready to generate speech! 🎵

You can disable this banner by setting show_info=False, or show it again anytime with model.show_model_info().

Controlling Logging Output

By default, Kani-TTS suppresses all logging output from transformers, NeMo, and PyTorch to keep your console clean. Only your print() statements will be visible.

from kani_tts import KaniTTS

# Default behavior - logs are suppressed
model = KaniTTS('your-model-name')

# To see all library logs (for debugging)
model = KaniTTS('your-model-name', suppress_logs=False)

# You can also manually suppress logs at any time
from kani_tts import suppress_all_logs
suppress_all_logs()

Working with Audio Output

The generated audio is a NumPy array sampled at 22kHz:

import numpy as np
import soundfile as sf

audio, text = model("Generate speech from this text")

# Audio is a numpy array
print(audio.shape)  # (num_samples,)
print(audio.dtype)  # float32/float64

# Save using soundfile
sf.write('output.wav', audio, 22050)

# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)

Playing Audio in Jupyter Notebooks

You can listen to generated audio directly in Jupyter notebooks or IPython:

from kani_tts import KaniTTS
from IPython.display import Audio as aplay

model = KaniTTS('your-model-name')
audio, text = model("Hello, world!")

# Play audio in notebook
aplay(audio, rate=model.sample_rate)

Architecture

Kani-TTS uses a two-stage architecture:

  1. Text → Audio Tokens: A causal language model generates audio token sequences from text
  2. Audio Tokens → Waveform: NVIDIA NeMo's NanoCodec decodes tokens into audio waveforms

The system uses special tokens to mark different segments:

  • Text boundaries (start/end of text)
  • Speech boundaries (start/end of speech)
  • Speaker turns (human/AI)

Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.

Requirements

  • Python 3.10 or higher
  • CUDA-capable GPU (recommended) or CPU
  • PyTorch 2.0 or higher
  • Transformers library
  • NeMo Toolkit

Model Compatibility

This library works with causal language models trained for TTS with the following characteristics:

  • Extended vocabulary including audio tokens
  • Special tokens for speech/text boundaries
  • Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

@inproceedings{emilialarge,
  author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
  title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
  booktitle={arXiv:2501.15907},
  year={2025}
}
@article{emonet_voice_2025,
  author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, Sören},
  title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},
  journal={arXiv preprint arXiv:2506.09827},
  year={2025}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kani_tts-0.0.3.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kani_tts-0.0.3-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file kani_tts-0.0.3.tar.gz.

File metadata

  • Download URL: kani_tts-0.0.3.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-0.0.3.tar.gz
Algorithm Hash digest
SHA256 2c13677d8169902ce932f34f3f2dfa89dcb83d7873a4e5dadaac7dd8a4745c2d
MD5 c7c8624939f4c4e8f6bb20dffcec3f23
BLAKE2b-256 3064b9afbba8e8cfcdce5fcc2b37c4e19a20d7e3480c0230f9002277724a178a

See more details on using hashes here.

File details

Details for the file kani_tts-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: kani_tts-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a33efadbf3c9cdd658b76e1b9cdef3f34cd72802c4619597da93d6dc08a01749
MD5 f221c356d32522132403057b24adb192
BLAKE2b-256 488e9f706306eebcd50dd7ca14a89b0babd872050f2d3d1238cd65579bb7a21f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page