LuxTTS MLX port for fast Apple Silicon inference.

Project description

LuxTTS

LuxTTS is an lightweight zipvoice based text-to-speech model designed for high quality voice cloning and realistic generation at speeds exceeding 150x realtime.

https://github.com/user-attachments/assets/a3b57152-8d97-43ce-bd99-26dc9a145c29

Upstream sync status (2026-02-11)

Upstream LuxTTS commit synced: 34820963ee97f406619e5983771e572f779a600a (2026-01-28).
Local MLX fork is 0 commits behind upstream on master.
Ongoing MLX quality and parity plan: docs/MLX_PORT_PLAN.md.

Release status

Initial stable package release: v0.1.0 (2026-02-13).
Release notes: CHANGELOG.md.

The main features are

Voice cloning: SOTA voice cloning on par with models 10x larger.
Clarity: Clear 48khz speech generation unlike most TTS models which are limited to 24khz.
Speed: Reaches speeds of 150x realtime on a single GPU and faster then realtime on CPU's as well.
Efficiency: Fits within 1gb vram meaning it can fit in any local gpu.

Usage

You can try it locally, colab, or spaces.

Simple installation:

pip install LuxTTS-mlx

Required vocoder dependency (LinaCodec):

pip install git+https://github.com/ysharma3501/LinaCodec.git

Recommended for English synthesis (includes phonemizer):

pip install "LuxTTS-mlx[phonemize]" -f https://k2-fsa.github.io/icefall/piper_phonemize.html

From source:

git clone https://github.com/jishnuvenugopal/LuxTTS-mlx.git
cd LuxTTS-mlx
pip install -r requirements.txt

Load model:

from luxtts_mlx import LuxTTS

# load model on GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')

# load model on CPU
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)

# load model on MPS for macs
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

# load model on MLX (Apple Silicon, Python 3.11+)
# lux_tts = LuxTTS('YatharthS/LuxTTS', device='mlx')

Note: On MLX, both diffusion and vocoder run in MLX by default.

Note: The MLX vocoder path uses vocos-mlx and will download the LuxTTS vocoder weights on first run.

Important: English synthesis requires piper_phonemize. Install with: pip install "LuxTTS-mlx[phonemize]" -f https://k2-fsa.github.io/icefall/piper_phonemize.html

Simple inference

import soundfile as sf
from IPython.display import Audio

text = "Hey, what's up? I'm feeling really great if you ask me honestly!"

## change this to your reference file path, can be wav/mp3
prompt_audio = 'audio_file.wav'

## encode audio(takes 10s to init because of librosa first time)
encoded_prompt = lux_tts.encode_prompt(prompt_audio, rms=0.01)

## generate speech
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

## save audio
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

## display speech
if display is not None:
  display(Audio(final_wav, rate=48000))

Inference with sampling params:

import soundfile as sf
from IPython.display import Audio

text = "Hey, what's up? I'm feeling really great if you ask me honestly!"

## change this to your reference file path, can be wav/mp3
prompt_audio = 'audio_file.wav'

rms = 0.01 ## higher makes it sound louder(0.01 or so recommended)
t_shift = 0.9 ## sampling param, higher can sound better but worse WER
num_steps = 4 ## sampling param, higher sounds better but takes longer(3-4 is best for efficiency)
speed = 1.0 ## sampling param, controls speed of audio(lower=slower)
return_smooth = True ## sampling param, smoother/clearer default output path
ref_duration = 5 ## Setting it lower can speedup inference, set to 1000 if you find artifacts.

## encode audio(takes 10s to init because of librosa first time)
encoded_prompt = lux_tts.encode_prompt(prompt_audio, duration=ref_duration, rms=rms)

## generate speech
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=num_steps, t_shift=t_shift, speed=speed, return_smooth=return_smooth)

## save audio
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

## display speech
if display is not None:
  display(Audio(final_wav, rate=48000))

Tips

Please use at minimum a 3 second audio file for voice cloning.
Best cloning quality comes from clean real speech (single speaker, no music/noise), typically 3-8 seconds.
return_smooth=True is the default and usually sounds clearer; use False for sharper 48k output.
Lower t_shift for less possible pronunciation errors but worse quality and vice versa.

CLI quick test (no prompt)

luxtts-mlx "Hello from MLX!" --out output.wav --device mlx

Use --no-return-smooth if you want the sharper 48k path. Defaults are tuned for clarity: --num-steps 5, --speed 0.92, --duration-pad-frames 16. Default output peak normalization is enabled: --output-peak 0.92 (set --output-peak 0 to disable). If output feels too loud on speakers/headphones, use --output-peak 0.8. Prompt preprocessing is enabled by default: silence edge trim (--trim-prompt-silence) and safe RMS clamping (--prompt-rms-min 0.006, --prompt-rms-max 0.03).

CLI with prompt + optional prompt text

luxtts-mlx --text "Hello from MLX!" --prompt /path/to/prompt.wav --out output.wav --device mlx
luxtts-mlx --text "Hello from MLX!" --prompt /path/to/prompt.wav --prompt-text "Hello." --out output.wav --device mlx

Tip: providing --prompt-text skips Whisper prompt transcription load, which is faster and avoids extra multiprocessing warnings. For long/noisy prompt files, use --prompt-start and --ref-duration to target a clean segment. If prompt text is mis-transcribed or repeated, always pass explicit --prompt-text. If trimming is too aggressive for your clip, pass --no-trim-prompt-silence or lower trim sensitivity with --prompt-silence-threshold-db -48.

Optional fallback: torch vocoder with MLX diffusion

luxtts-mlx "Hello from MLX!" --prompt /path/to/prompt.wav --out output.wav --device mlx --vocoder torch

If you hit a Metal kernel error such as Unable to load function four_step_mem_8192..., re-run with --vocoder torch.

Automated recursive feedback loop (Torch vs MLX)

Use this to auto-run multi-round quality checks, compare backends, and tune params until thresholds pass:

./.venv/bin/python tools/feedback_loop.py \
  --text "This is a stability smoke test for LuxTTS MLX prompt preprocessing." \
  --prompt /path/to/prompt1.wav \
  --prompt /path/to/prompt2.wav \
  --prompt-text "Exact transcript of prompt clips." \
  --device mlx \
  --vocoder-set both \
  --max-rounds 4 \
  --max-candidates-per-round 10 \
  --keep-wavs final \
  --round-history none \
  --compact-report \
  --no-write-case-reports \
  --out-dir feedback-loop-runs

Outputs:

Storage-optimized by default:
- only final best WAV per prompt (--keep-wavs final)
- optional zero round history (--round-history none)
- no per-case JSON unless enabled (--write-case-reports)
Aggregate summary: feedback-loop-runs/SUMMARY.md.
Uses ASR similarity + start/tail artifact ratios + repetition ratio to pick winners.
Includes prompt transcript safeguards for auto-transcribed prompt text:
- --auto-prompt-text-policy strict (default): fail if transcript looks unstable.
- --auto-prompt-text-policy target-fallback: use target text when auto transcript looks unstable.
- --auto-prompt-text-policy hello-fallback: use "Hello." fallback.
Reuses prompt encodings across candidate variants to speed multi-round loops.

If you need full debug artifacts, use --keep-wavs all --no-compact-report --keep-asr-text.

5-scenario quality/performance matrix (recommended smoke gate)

Run a fixed short/medium/long/numeric/punctuation matrix with storage-safe outputs:

./.venv/bin/python tools/scenario_matrix.py \
  --prompt /path/to/prompt.wav \
  --prompt-text "Exact transcript of prompt clip." \
  --device mlx \
  --vocoder-set both \
  --keep-wavs best \
  --out-dir scenario-matrix-runs

Outputs:

scenario-matrix-runs/report.json
scenario-matrix-runs/SUMMARY.md
one best wav per scenario (or all/none via --keep-wavs)

Info

Q: How is this different from ZipVoice?

A: LuxTTS uses the same architecture but distilled to 4 steps with an improved sampling technique. It also uses a custom 48khz vocoder instead of the default 24khz version.

Q: Can it be even faster?

A: Yes, currently it uses float32. Float16 should be significantly faster(almost 2x).

Roadmap

Release model and code
Huggingface spaces demo
Release MPS support (thanks to @builtbybasit)
Release code for float16 inference

Acknowledgments

ZipVoice for their excellent code and model.
Vocos for their great vocoder.

Final Notes

The model and code are licensed under the Apache-2.0 license. See LICENSE for details.

Stars/Likes would be appreciated, thank you.

Email: yatharthsharma350@gmail.com

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

luxtts_mlx-0.1.0.tar.gz (154.7 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

luxtts_mlx-0.1.0-py3-none-any.whl (206.3 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file luxtts_mlx-0.1.0.tar.gz.

File metadata

Download URL: luxtts_mlx-0.1.0.tar.gz
Upload date: Feb 13, 2026
Size: 154.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for luxtts_mlx-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4afc82481c7751fc6803264c0ecbd3d7597f5f036a5a80c7340f0cc00504425f`
MD5	`46c3d5d3c65c63c681a490d465d4fd87`
BLAKE2b-256	`126e71335fed6585c67a19619024963284c9fecbdd248a1d7d248ec7efd009ba`

See more details on using hashes here.

File details

Details for the file luxtts_mlx-0.1.0-py3-none-any.whl.

File metadata

Download URL: luxtts_mlx-0.1.0-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 206.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for luxtts_mlx-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`447d5757b5c40084097f55409a3cedf366296aee91151e6ab86a8719f03981d6`
MD5	`ed90cd51d93f22419c7aa36bcc6afb6d`
BLAKE2b-256	`9c5b7edf6e4664d42ea9eecb497e72cba6083010d9f4d6331c947e507ae1108e`

See more details on using hashes here.

LuxTTS-mlx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LuxTTS

Upstream sync status (2026-02-11)

Release status

The main features are

Usage

Simple installation:

Required vocoder dependency (LinaCodec):

Recommended for English synthesis (includes phonemizer):

From source:

Load model:

Simple inference

Inference with sampling params:

Tips

CLI quick test (no prompt)

CLI with prompt + optional prompt text

Optional fallback: torch vocoder with MLX diffusion

Automated recursive feedback loop (Torch vs MLX)

5-scenario quality/performance matrix (recommended smoke gate)

Info

Roadmap

Acknowledgments

Final Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes