Multi-provider text-to-speech library with voice cloning, accent drift detection, and STT validation

These details have not been verified by PyPI

Project links

Project description

rho-tts

Multi-provider text-to-speech library with voice cloning, accent drift detection, and STT validation.

Features

Multi-provider TTS — Swap between Qwen3-TTS and Chatterbox with a single parameter
Voice cloning — Clone any voice from a short reference audio sample
Accent drift detection — ML classifier catches when the generated voice drifts from your target accent
STT validation — Whisper-based transcription check ensures the model actually said what you asked it to
Speaker similarity — Cosine similarity scoring between generated and reference voice embeddings
Audio post-processing — Silence trimming, crossfading, DC offset removal, fade-in/out
Batch processing — Generate multiple audio files efficiently with memory management
Cooperative cancellation — Thread-safe cancellation tokens for long-running generation tasks
Extensible — Register custom TTS providers via TTSFactory.register_provider()

Installation

# Core only (brings torch, torchaudio, numpy, pydub)
pip install rho-tts

# With Qwen3-TTS provider
pip install rho-tts[qwen]

# With Chatterbox provider
pip install rho-tts[chatterbox]

# With validation (accent drift, STT, speaker similarity)
pip install rho-tts[validation]

# Everything
pip install rho-tts[all]

System Dependencies

ffmpeg — Required by pydub for audio file joining
CUDA — GPU recommended for reasonable generation speed (CPU works but is slow)

# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

Hardware Requirements

Model	VRAM	Notes
Qwen3-TTS 0.6B	~8 GB	Smaller, faster
Qwen3-TTS 1.7B	~16 GB	Higher quality
Chatterbox	~6 GB	Good for single segments
Validation (Whisper)	~1 GB	Runs on CPU by default

Quick Start

from rho_tts import TTSFactory

# Create a TTS instance (requires a reference audio for voice cloning)
tts = TTSFactory.get_tts_instance(
    provider="qwen",
    reference_audio="my_voice.wav",
    reference_text="Transcript of my voice sample.",
)

# Generate a single file
result = tts.generate("Hello world!", "output.wav")

# Generate without saving to disk (in-memory only)
result = tts.generate("Hello world!")
print(result.audio, result.duration_sec)

# Generate a batch
results = tts.generate(
    texts=["First sentence.", "Second sentence."],
    output_path="batch_output",
)

Providers

Qwen3-TTS (default)

Best for batch generation with validation. Supports voice cloning via reference audio + text.

tts = TTSFactory.get_tts_instance(
    provider="qwen",
    reference_audio="voice.wav",
    reference_text="What the voice says in the audio file.",
    model_path="Qwen/Qwen3-TTS-12Hz-1.7B-Base",  # or local path
    batch_size=5,
    max_iterations=10,
    accent_drift_threshold=0.17,
    text_similarity_threshold=0.85,
)

Chatterbox

Best for single-segment regeneration with comprehensive validation loops.

tts = TTSFactory.get_tts_instance(
    provider="chatterbox",
    reference_audio="voice.wav",
    implementation="faster",  # rsxdalv optimizations
    max_iterations=50,
    accent_drift_threshold=0.17,
    text_similarity_threshold=0.75,
    speaker_similarity_threshold=0.85,
)

Configuration

All thresholds and parameters can be set via constructor kwargs:

Parameter	Default	Description
`device`	`"cuda"`	`"cuda"` or `"cpu"`
`seed`	`789`	Random seed for reproducibility
`deterministic`	`False`	Deterministic CUDA ops (slower)
`phonetic_mapping`	`{}`	Word-to-pronunciation overrides
`max_iterations`	`10`/`50`	Max validation retry loops
`accent_drift_threshold`	`0.17`	Max accent drift probability
`text_similarity_threshold`	`0.85`/`0.75`	Min STT text match score
`sound_decay_threshold`	`0.3`	Max RMS decay ratio (final vs first third)
`max_decay_retries`	`3`	Full-regeneration attempts on sound decay
`batch_size`	`5`	Texts per batch (Qwen only)

Custom Providers

from rho_tts import BaseTTS, TTSFactory

class MyTTS(BaseTTS):
    def _generate_audio(self, text, **kwargs):
        # Your model inference here — return a torch.Tensor
        ...

    @property
    def sample_rate(self):
        return 24000

TTSFactory.register_provider("my_tts", MyTTS)
tts = TTSFactory.get_tts_instance(provider="my_tts")

Validation Pipeline

When validation deps are installed (pip install rho-tts[validation]), generated audio goes through:

Accent drift detection — A trained classifier predicts the probability that the voice has drifted from the target accent. Samples exceeding the threshold are regenerated.
STT text matching — Whisper transcribes the audio and compares it against the intended text using fuzzy matching with number normalization. The normalizer handles word numbers, ordinals, dates, currency, and times (e.g. "five dollars and ninety nine cents" → "$5.99", "march twenty second" → "march 22") via NeMo inverse text normalization.
Speaker similarity — Cosine similarity between the generated audio's speaker embedding and the reference voice embedding.

Training the Accent Drift Classifier

Prepare a dataset with good/ and bad/ subdirectories containing .wav files, then train a classifier. Models can be trained globally or per-voice.

Per-voice models (recommended)

Each voice can have its own classifier, stored at ~/.rho_tts/models/{voice_id}_classifier.pkl. This gives better accuracy since accent drift patterns differ between voices.

# CLI
python -m rho_tts.validation.classifier.trainer \
    --dataset-dir /path/to/dataset \
    --voice-id my_voice

# Library
from rho_tts.validation.classifier.trainer import train

train(dataset_dir="/path/to/dataset", voice_id="my_voice")

During generation, set voice_id on the TTS instance to use the per-voice model automatically:

tts = TTSFactory.get_tts_instance(provider="qwen", reference_audio="voice.wav", reference_text="...")
tts.voice_id = "my_voice"
tts.generate(texts, "output")  # uses ~/.rho_tts/models/my_voice_classifier.pkl

Global model

A global model is used as a fallback when no per-voice model exists.

# Train a global model
python -m rho_tts.validation.classifier.trainer --dataset-dir /path/to/dataset

# Or specify an explicit output path
python -m rho_tts.validation.classifier.trainer \
    --dataset-dir /path/to/dataset \
    --output /path/to/voice_quality_model.pkl

Auto-sorting samples

During generation, samples can be automatically sorted into good/ and bad/ folders based on their drift score — building your training dataset as you generate.

tts = TTSFactory.get_tts_instance(provider="qwen", reference_audio="voice.wav", reference_text="...")
tts.voice_id = "my_voice"

# Set the target directories
tts.auto_sort_good_dir = "/path/to/dataset/good"
tts.auto_sort_bad_dir = "/path/to/dataset/bad"

# Set the thresholds (drift probability 0-1)
tts.auto_sort_good_threshold = 0.10  # below this → good/
tts.auto_sort_bad_threshold = 0.25   # above this → bad/

# Samples between 0.10 and 0.25 are ambiguous and skipped
tts.generate(texts, "output")

Attribute	Description
`auto_sort_good_dir`	Directory to copy low-drift samples to
`auto_sort_bad_dir`	Directory to copy high-drift samples to
`auto_sort_good_threshold`	Drift prob below this → `good/`
`auto_sort_bad_threshold`	Drift prob above this → `bad/`

The sorted files use the same good/ / bad/ structure the trainer expects, so you can point the trainer directly at the parent directory.

Model lookup order

When predicting accent drift, the classifier checks for models in this order:

Per-voice model at ~/.rho_tts/models/{voice_id}_classifier.pkl
Explicit path passed via model_path parameter
RHO_TTS_CLASSIFIER_MODEL environment variable
Bundled global model

# Override the global model path via environment variable
export RHO_TTS_CLASSIFIER_MODEL=/path/to/voice_quality_model.pkl

Web UI

A Gradio-based web interface for interactive TTS generation, voice management, and model configuration.

Installation

# From PyPI
pip install rho-tts[ui]

# From local source
pip install -e ".[ui]"

Launch

# CLI entry point
rho-tts-ui

# Or as a Python module
python -m rho_tts.ui

# With options
rho-tts-ui --host 0.0.0.0 --port 8080 --device cpu --share

Flag	Default	Description
`--config`	`~/.rho_tts/config.json`	Path to config JSON file
`--host`	`127.0.0.1`	Server bind address
`--port`	`7860`	Server port
`--device`	`cuda`	`cuda` or `cpu`
`--share`	off	Create a public Gradio link

The config path can also be set via the RHO_TTS_CONFIG environment variable.

Tabs

Generate

Library

Voices

Models

Training

Generate — Select a model and voice, enter text, and generate audio with real-time playback. Includes phonetic mapping overrides per voice/model pair.
Voices — Upload reference audio and transcripts to create reusable voice profiles (stored in ~/.rho_tts/voices/).
Models — Configure TTS providers with custom thresholds and parameters.

Cancellation

For long-running generation in web servers or UIs:

from rho_tts import CancellationToken, TTSFactory

token = CancellationToken()

# In worker thread
tts = TTSFactory.get_tts_instance(provider="qwen", reference_audio="voice.wav", reference_text="...")
result = tts.generate(texts, "output", cancellation_token=token)

# In controller thread (e.g., on user cancel button)
token.cancel()

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.4

Mar 28, 2026

1.1.3

Mar 26, 2026

1.1.2

Mar 25, 2026

1.0.9

Mar 25, 2026

1.0.8

Mar 22, 2026

1.0.7

Mar 22, 2026

1.0.6

Mar 18, 2026

1.0.5

Mar 18, 2026

1.0.4

Mar 14, 2026

1.0.3

Mar 12, 2026

0.1.0

Mar 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rho_tts-1.1.4.tar.gz (420.8 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rho_tts-1.1.4-py3-none-any.whl (74.6 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file rho_tts-1.1.4.tar.gz.

File metadata

Download URL: rho_tts-1.1.4.tar.gz
Upload date: Mar 28, 2026
Size: 420.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for rho_tts-1.1.4.tar.gz
Algorithm	Hash digest
SHA256	`79081833112c5b5b81d3a503e10cde07904d109daffe03e5211ea33c3b425cbb`
MD5	`bea99fbfff4a35456233d2b78fa0ddbf`
BLAKE2b-256	`5ffe5363e7c5a26f94689b3d3e34ae2a35f74472e527caff5bd07fea6bb85bbb`

See more details on using hashes here.

File details

Details for the file rho_tts-1.1.4-py3-none-any.whl.

File metadata

Download URL: rho_tts-1.1.4-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 74.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for rho_tts-1.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56cd7471ac934c2f195a2708b8d37476d16d7654a11f420f7295f194e017c09b`
MD5	`c01c4696e02fff4950727e2a978ea5d4`
BLAKE2b-256	`6d3d4a909ae410d786aec2e40b0255a270db6687ee721a09111d53d91fe470ea`

See more details on using hashes here.

rho-tts 1.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rho-tts

Features

Installation

System Dependencies

Hardware Requirements

Quick Start

Providers

Qwen3-TTS (default)

Chatterbox

Configuration

Custom Providers

Validation Pipeline

Training the Accent Drift Classifier

Per-voice models (recommended)

Global model

Auto-sorting samples

Model lookup order

Web UI

Installation

Launch

Tabs

Cancellation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes