Text-to-speech using neural audio codec and causal language models

These details have not been verified by PyPI

Project links

Project description

Kani-TTS

A simple and efficient text-to-speech library using neural audio codecs and causal language models.

Features

Simple, intuitive API
Built on Hugging Face Transformers and NVIDIA NeMo
High-quality audio generation using neural codecs
GPU acceleration support
Multi-speaker model support with easy speaker selection

Installation

From PyPI (once published)

pip install kani-tts
pip install -U "transformers==4.57.1" # for LFM2 !!!

Quick Start

from kani_tts import KaniTTS

# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")

Advanced Usage

Working with Multi-Speaker Models

Some models support multiple speakers. You can check if your model supports speakers and select a specific voice:

from kani_tts import KaniTTS

model = KaniTTS('your-multispeaker-model-name')

# Check if model supports multiple speakers
print(f"Model type: {model.status}")  # 'singlspeaker' or 'multispeaker'

# Display available speakers (pretty formatted)
model.show_speakers()

# Or access the speaker list directly
print(model.speaker_list)  # ['Speaker1', 'Speaker2', ...]

# Generate audio with a specific speaker
audio, text = model.generate("Hello, world!", speaker_id="Speaker1")
model.save_audio(audio, "speaker1_output.wav")

# Or using the shorthand call syntax
audio, text = model("Hello, world!", speaker_id="Speaker1")

Custom Configuration

from kani_tts import KaniTTS

model = KaniTTS(
    'your-model-name',
    temperature=0.7,           # Control randomness (default: 1.0)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    max_new_tokens=2000,       # Max audio length (default: 1200)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
    suppress_logs=True,        # Suppress library logs (default: True)
    show_info=True,            # Show model info on init (default: True)
)

audio, text = model("Your text here")

When initialized, Kani-TTS displays a beautiful banner with model information:

╔════════════════════════════════════════════════════════════╗
║                                                            ║
║                   N I N E N I N E S I X  😼                ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

              /\_/\
             ( o.o )
              > ^ <

──────────────────────────────────────────────────────────────
  Model: your-model-name
  Device: GPU (CUDA)
  Mode: Multi-speaker (5 speakers)

  Configuration:
    • Sample Rate: 22050 Hz
    • Temperature: 1.0
    • Top-p: 0.95
    • Max Tokens: 1200
    • Repetition Penalty: 1.1
──────────────────────────────────────────────────────────────

  Ready to generate speech! 🎵

You can disable this banner by setting show_info=False, or show it again anytime with model.show_model_info().

Controlling Logging Output

By default, Kani-TTS suppresses all logging output from transformers, NeMo, and PyTorch to keep your console clean. Only your print() statements will be visible.

from kani_tts import KaniTTS

# Default behavior - logs are suppressed
model = KaniTTS('your-model-name')

# To see all library logs (for debugging)
model = KaniTTS('your-model-name', suppress_logs=False)

# You can also manually suppress logs at any time
from kani_tts import suppress_all_logs
suppress_all_logs()

Working with Audio Output

The generated audio is a NumPy array sampled at 22kHz:

import numpy as np
import soundfile as sf

audio, text = model("Generate speech from this text")

# Audio is a numpy array
print(audio.shape)  # (num_samples,)
print(audio.dtype)  # float32/float64

# Save using soundfile
sf.write('output.wav', audio, 22050)

# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)

Playing Audio in Jupyter Notebooks

You can listen to generated audio directly in Jupyter notebooks or IPython:

from kani_tts import KaniTTS
from IPython.display import Audio as aplay

model = KaniTTS('your-model-name')
audio, text = model("Hello, world!")

# Play audio in notebook
aplay(audio, rate=model.sample_rate)

Architecture

Kani-TTS uses a two-stage architecture:

Text → Audio Tokens: A causal language model generates audio token sequences from text
Audio Tokens → Waveform: NVIDIA NeMo's NanoCodec decodes tokens into audio waveforms

The system uses special tokens to mark different segments:

Text boundaries (start/end of text)
Speech boundaries (start/end of speech)
Speaker turns (human/AI)

Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.

Requirements

Python 3.10 or higher
CUDA-capable GPU (recommended) or CPU
PyTorch 2.0 or higher
Transformers library
NeMo Toolkit

Model Compatibility

This library works with causal language models trained for TTS with the following characteristics:

Extended vocabulary including audio tokens
Special tokens for speech/text boundaries
Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

@inproceedings{emilialarge,
  author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
  title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
  booktitle={arXiv:2501.15907},
  year={2025}
}

@article{emonet_voice_2025,
  author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, Sören},
  title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},
  journal={arXiv preprint arXiv:2506.09827},
  year={2025}
}

Acknowledgments

Built on Hugging Face Transformers
Uses NVIDIA NeMo audio codec
Powered by PyTorch

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.1

Jan 21, 2026

0.0.4

Nov 3, 2025

This version

0.0.3 yanked

Nov 2, 2025

Reason this release was yanked:

old version of nemo

0.0.1 yanked

Nov 1, 2025

Reason this release was yanked:

test version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kani_tts-0.0.3.tar.gz (11.3 kB view details)

Uploaded Nov 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kani_tts-0.0.3-py3-none-any.whl (9.3 kB view details)

Uploaded Nov 2, 2025 Python 3

File details

Details for the file kani_tts-0.0.3.tar.gz.

File metadata

Download URL: kani_tts-0.0.3.tar.gz
Upload date: Nov 2, 2025
Size: 11.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`2c13677d8169902ce932f34f3f2dfa89dcb83d7873a4e5dadaac7dd8a4745c2d`
MD5	`c7c8624939f4c4e8f6bb20dffcec3f23`
BLAKE2b-256	`3064b9afbba8e8cfcdce5fcc2b37c4e19a20d7e3480c0230f9002277724a178a`

See more details on using hashes here.

File details

Details for the file kani_tts-0.0.3-py3-none-any.whl.

File metadata

Download URL: kani_tts-0.0.3-py3-none-any.whl
Upload date: Nov 2, 2025
Size: 9.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a33efadbf3c9cdd658b76e1b9cdef3f34cd72802c4619597da93d6dc08a01749`
MD5	`f221c356d32522132403057b24adb192`
BLAKE2b-256	`488e9f706306eebcd50dd7ca14a89b0babd872050f2d3d1238cd65579bb7a21f`

See more details on using hashes here.

kani-tts 0.0.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Kani-TTS

Features

Installation

From PyPI (once published)

Quick Start

Advanced Usage

Working with Multi-Speaker Models

Custom Configuration

Controlling Logging Output

Working with Audio Output

Playing Audio in Jupyter Notebooks

Architecture

Requirements

Model Compatibility

Contributing

Citation

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes