Echo TTS - Text-to-Speech synthesis with voice cloning

These details have not been verified by PyPI

Project links

Project description

Echo-TTS

A multi-speaker text-to-speech model with speaker reference conditioning. See the blog post for technical details.

Model: jordand/echo-tts-base | Demo: echo-tts-preview

This work was made possible by the TPU Research Cloud (TRC).

Responsible Use

Don't use this model to:

Impersonate real people without their consent
Generate deceptive audio (e.g., fraud, misinformation, deepfakes)

You are responsible for complying with local laws regarding biometric data and voice cloning.

Installation

pip install -r requirements.txt

Requires Python 3.10+ and a CUDA-capable GPU with at least 8GB VRAM.

Quick Start

Gradio UI

python gradio_app.py

Python API

from inference import (
    load_model_from_hf,
    load_fish_ae_from_hf,
    load_pca_state_from_hf,
    load_audio,
    sample_pipeline,
    sample_euler_cfg_independent_guidances,
)
from functools import partial
import torchaudio

# Load models (downloads from HuggingFace on first run)
model = load_model_from_hf(delete_blockwise_modules=True)
fish_ae = load_fish_ae_from_hf()
pca_state = load_pca_state_from_hf()

# Load speaker reference (or set to None for no reference)
speaker_audio = load_audio("speaker.wav").cuda()

# Configure sampler
sample_fn = partial(
    sample_euler_cfg_independent_guidances,
    num_steps=40,
    cfg_scale_text=3.0,
    cfg_scale_speaker=8.0,
    cfg_min_t=0.5,
    cfg_max_t=1.0,
    truncation_factor=None,
    rescale_k=None,
    rescale_sigma=None,
    speaker_kv_scale=None,
    speaker_kv_max_layers=None,
    speaker_kv_min_t=None,
    sequence_length=640, # (~30 seconds)
)

# Generate
text = "[S1] Hello, this is a test of the Echo TTS model."
audio_out, _ = sample_pipeline(
    model=model,
    fish_ae=fish_ae,
    pca_state=pca_state,
    sample_fn=sample_fn,
    text_prompt=text,
    speaker_audio=speaker_audio,
    rng_seed=0,
)

torchaudio.save("output.wav", audio_out[0].cpu(), 44100)

Low VRAM (8GB)

In gradio_app.py, adjust:

FISH_AE_DTYPE = torch.bfloat16  # instead of float32
DEFAULT_SAMPLE_LATENT_LENGTH = 576  # (< 640 depending on what fits) instead of 640

Tips

Generation Length

Echo is trained to generate up to 30 seconds of audio (640 latents) given text and reference audio. Since the supplied text always corresponded to ≤30 seconds of audio during training, the model will attempt to fit any text prompt at inference into the 30 seconds of generated audio (and thus, e.g., long text prompts may result in faster speaking rates). On the other hand, shorter text prompts will work and will produce shorter outputs (as the model generates latent padding automatically).

If "Sample Latent Length" (in Custom Shapes in gradio)/sequence_length is set to less than 640, the model will attempt to generate the prefix corresponding to that length. I.e., if you set this to 320, and supply ~30 seconds worth of text, the model will likely generate the first half of the text (rather than try to fit the entirety of the text into the first 15 seconds).

Reference Audio

You can condition on up to 5 minutes of reference audio, but shorter clips (e.g., 10 seconds or shorter) work well too.

Force Speaker (KV Scaling)

Sometimes out-of-distribution text for a given reference speaker will cause the model to generate a different speaker entirely. Enabling "Force Speaker" (which scales speaker KV for a portion of timesteps, default scale 1.5) generally fixes this. However, high values may introduce artifacts or "overconditioning." Aim for the lowest scale that produces the correct speaker: 1.0 is baseline, 1.5 is the default when enabled and will usually force the speaker, but lower values (e.g., 1.3, 1.1) may suffice.

Text Prompt Format

Text prompts use the format from WhisperD. Colons, semicolons, and emdashes are normalized to commas (see inference.py tokenizer_encode) by default, and "[S1] " will be added to the beginning of the prompt if not already present. Commas generally function as pauses. Exclamation points (and other non-bland punctuation) may lead to increased expressiveness but also potentially lower quality on occasion; improving controllability is an important direction for future work.

The included text presets are stylistically in-distribution with the WhisperD transcription style.

Blockwise Generation

inference_blockwise.py includes blockwise sampling, which allows generating audio in smaller blocks as well as producing continuations of existing audio (where the prefix and continuation are up to 30 seconds combined). The model released on HF is a fully fine-tuned model (not the LoRA as described in the blog). Blockwise generation enables audio streaming (not included in current code) since the S1-DAC decoder is causal. Blockwise functionality hasn't been thoroughly tested and may benefit from different (e.g., smaller) CFG scales.

License

Code in this repo is MIT‑licensed except where file headers specify otherwise (e.g., autoencoder.py is Apache‑2.0).

Regardless of our model license, audio outputs are CC-BY-NC-SA-4.0 due to the dependency on the Fish Speech S1-DAC autoencoder, which is CC-BY-NC-SA-4.0.

We have chosen to release the Echo-TTS weights under CC-BY-NC-SA-4.0.

For included audio prompts, see audio_prompts/LICENSE.

Citation

@misc{darefsky2025echo,
    author = {Darefsky, Jordan},
    title = {Echo-TTS},
    year = {2025},
    url = {https://jordandarefsky.com/blog/2025/echo/}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Dec 25, 2025

0.1.3

Dec 23, 2025

0.1.1

Dec 20, 2025

This version

0.1.0

Dec 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

echo_tts-0.1.0.tar.gz (28.7 kB view details)

Uploaded Dec 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

echo_tts-0.1.0-py3-none-any.whl (27.2 kB view details)

Uploaded Dec 19, 2025 Python 3

File details

Details for the file echo_tts-0.1.0.tar.gz.

File metadata

Download URL: echo_tts-0.1.0.tar.gz
Upload date: Dec 19, 2025
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for echo_tts-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`85c73956d6116468686f4546768d4f4dd5dbaeedd3064b44618800817181e57c`
MD5	`502a2c8f15abefb4dbc8f185c6534c35`
BLAKE2b-256	`fe31d3b26643bd153cd509c9c23df65ac7ec874937da8798f809edaae30d716a`

See more details on using hashes here.

File details

Details for the file echo_tts-0.1.0-py3-none-any.whl.

File metadata

Download URL: echo_tts-0.1.0-py3-none-any.whl
Upload date: Dec 19, 2025
Size: 27.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for echo_tts-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`296a8030fa3b85fe0b780e4c8158c3c93bac748ad8e4c1fe5a4d305857bd3e52`
MD5	`544c9e7b8e52fbab9dca53aad79d075d`
BLAKE2b-256	`5ea91716398f1179010b96b163250f8d9b954cdb729f2473f98f610cda91b9c6`

See more details on using hashes here.

echo-tts 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Echo-TTS

Responsible Use

Installation

Quick Start

Gradio UI

Python API

Low VRAM (8GB)

Tips

Generation Length

Reference Audio

Force Speaker (KV Scaling)

Text Prompt Format

Blockwise Generation

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes