Skip to main content

Text-to-speech using neural audio codec and causal language models

Reason this release was yanked:

test version

Project description

Kani-TTS

A simple and efficient text-to-speech library using neural audio codecs and causal language models.

Features

  • Simple, intuitive API
  • Built on Hugging Face Transformers and NVIDIA NeMo
  • High-quality audio generation using neural codecs
  • GPU acceleration support

Installation

From PyPI (once published)

pip install kani-tts

From source

git clone https://github.com/yourusername/kani-tts.git
cd kani-tts
pip install -e .

Optional dependencies

For saving audio files:

pip install kani-tts[audio]

For development:

pip install kani-tts[dev]

Quick Start

from kani_tts import KaniTTS

# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")

Advanced Usage

Custom Configuration

from kani_tts import KaniTTS

model = KaniTTS(
    'your-model-name',
    temperature=0.7,           # Control randomness (default: 0.6)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    max_new_tokens=2000,       # Max audio length (default: 1800)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
)

audio, text = model("Your text here")

Working with Audio Output

The generated audio is a NumPy array sampled at 22kHz:

import numpy as np
import soundfile as sf

audio, text = model("Generate speech from this text")

# Audio is a numpy array
print(audio.shape)  # (num_samples,)
print(audio.dtype)  # float32/float64

# Save using soundfile
sf.write('output.wav', audio, 22050)

# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)

Batch Processing

texts = [
    "First sentence to synthesize.",
    "Second sentence to synthesize.",
    "Third sentence to synthesize."
]

for i, text in enumerate(texts):
    audio, _ = model(text)
    model.save_audio(audio, f"output_{i}.wav")

Architecture

Kani-TTS uses a two-stage architecture:

  1. Text → Audio Tokens: A causal language model generates audio token sequences from text
  2. Audio Tokens → Waveform: NVIDIA NeMo's nano codec decodes tokens into audio waveforms

The system uses special tokens to mark different segments:

  • Text boundaries (start/end of text)
  • Speech boundaries (start/end of speech)
  • Speaker turns (human/AI)

Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.

Requirements

  • Python 3.10 or higher
  • CUDA-capable GPU (recommended) or CPU
  • PyTorch 2.0 or higher
  • Transformers library
  • NeMo Toolkit

Model Compatibility

This library works with causal language models trained for TTS with the following characteristics:

  • Extended vocabulary including audio tokens
  • Special tokens for speech/text boundaries
  • Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use Kani-TTS in your research, please cite:

@software{kani_tts,
  title = {Kani-TTS: Text-to-Speech using Neural Audio Codec},
  author = {Your Name},
  year = {2024},
  url = {https://github.com/yourusername/kani-tts}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kani_tts-0.0.1.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kani_tts-0.0.1-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file kani_tts-0.0.1.tar.gz.

File metadata

  • Download URL: kani_tts-0.0.1.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-0.0.1.tar.gz
Algorithm Hash digest
SHA256 629839bce7dbc19ae9c95ca983b646ed2a8d83af9ae7acc902bdc8668c80b23d
MD5 604f82647fa712bc13f2f3eddfb848f6
BLAKE2b-256 8073e09a87bdf8ab9951a52881459f99e52c621d0d67881800db3021edc5626b

See more details on using hashes here.

File details

Details for the file kani_tts-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: kani_tts-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 7.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for kani_tts-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bcf926365e97ef408a29aa35516665289b1e16c51552733959b8c0a79c0d0e05
MD5 b63dcb32f6b8303d210dddddb3db99f1
BLAKE2b-256 73875e3c2c108a86958f2a4a5058093485adbf409b59fd7babb8eb2bcb4bbb4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page