High-fidelity voice cloning TTS for Apple Silicon — powered by Qwen3-TTS and ChatterboxTurbo via MLX
Project description
voxon
High-fidelity, real-time voice cloning on Apple Silicon.
One command to install. One command to speak.
pip install voxon
voxon "Hello, this is Voxon."
Overview
voxon turns a 15-second audio clip into a permanent voice identity that you can synthesise speech through at any time. It runs entirely on-device — no cloud, no API keys, no data leaving your machine.
The synthesis engine loads into memory once and stays there as a background daemon. Subsequent calls have sub-second dispatch overhead regardless of text length.
Backends supported:
- Qwen3-TTS 1.7B (default) — Alibaba's 1.7B parameter multilingual TTS model, quantised to 8-bit for Apple Silicon via MLX
- ChatterboxTurbo — ResembleAI's fast, high-quality voice cloning model (opt-in via
VOXON_MODEL=chatterbox-turbo)
Requirements: macOS 13+, Apple Silicon (M1 or later), Python 3.11+
Prerequisites
System: macOS 13+, Apple Silicon (M1 or later), Python 3.11+
HuggingFace account (free, one-time):
The TTS model weights are hosted on HuggingFace and require a free account and a one-click licence acceptance before the first download.
- Create a free account at huggingface.co
- Accept the model licence — one click, no payment:
- Qwen3-TTS (default): huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base
- Generate a token at huggingface.co/settings/tokens (read-only is sufficient)
- Log in once from your terminal:
pip install huggingface_hub
huggingface-cli login
After this, model weights download automatically on first use and are cached locally — you never need to do this again.
If you skip this step, voxon will print a clear error with the exact URL to visit when you first try to synthesise.
Installation
pip install voxon
The installer pulls all required dependencies automatically, including MLX, PyTorch (CPU), faster-whisper, and the audio processing libraries. No manual configuration is needed.
First run: On the first synthesis call, model weights (~1–3 GB depending on backend) are downloaded from HuggingFace. This happens once and is cached locally.
Quick Start
Step 1 — Prepare a reference voice
Record or export approximately 15 seconds of clean speech as any audio file (WAV, MP3, FLAC, AIFF). Then run:
voxon prep my_recording.wav
This processes the audio through a five-stage pipeline:
- Loads and trims to the specified window
- Resamples to 24 kHz mono
- Applies stationary noise reduction
- Normalises peak amplitude
- Auto-transcribes using Whisper large-v3-turbo
Critical: Open the generated .txt file in ~/.voxon/voices/ and verify every word matches the audio exactly — including hesitations and fillers. Transcript accuracy directly determines cloning fidelity.
# Use a specific time window
voxon prep recording.wav --start 30 --duration 15
# Skip noise reduction (for already-clean audio)
voxon prep recording.wav --no-denoise
# Transcribe manually (write the .txt yourself after)
voxon prep recording.wav --no-transcribe
Step 2 — Speak
voxon "Hello, this is Voxon speaking."
On the first invocation, voxon starts the synthesis daemon, waits for it to warm up (model load + Metal shader compilation), then synthesises and plays the audio. Subsequent calls are immediate.
# Use a specific voice
voxon --voice alan "Hello world"
# Save the audio to a file
voxon "Hello world" --save output.wav
# Both play and save
voxon "This sentence will be played and saved." --save sentence.wav
Command Reference
voxon "<text>" — Synthesise
voxon [--voice NAME] [--save FILE] "<text>"
| Flag | Description |
|---|---|
--voice NAME |
Use a specific prepared voice. Restarts the daemon if the voice changes. |
--save FILE |
Save synthesised audio to this WAV file in addition to playing. |
voxon prep <file> — Prepare a voice
voxon prep <FILE> [--start SECONDS] [--duration SECONDS]
[--out-wav PATH] [--out-transcript PATH]
[--no-denoise] [--no-transcribe]
| Flag | Default | Description |
|---|---|---|
--start |
0 |
Start offset in seconds into the source file. |
--duration |
15 |
Duration to extract in seconds. |
--out-wav |
~/.voxon/voices/<name>_clean.wav |
Output WAV path. |
--out-transcript |
~/.voxon/voices/<name>.txt |
Output transcript path. |
--no-denoise |
— | Skip noise reduction. |
--no-transcribe |
— | Skip Whisper transcription (write the .txt manually). |
voxon voices — List voices
voxon voices
Lists all prepared voices in ~/.voxon/voices/, indicating which is the current default and which is loaded in the running daemon.
voxon status — Daemon status
voxon status
Prints the daemon's current state: online/offline, active voice, backend model, and whether voice embeddings are cached.
voxon stop — Stop the daemon
voxon stop
Sends SIGTERM to the background daemon and removes the PID file.
Voice Files
All runtime state is stored under ~/.voxon/:
~/.voxon/
├── config # Active voice selection and other persisted settings
├── daemon.pid # PID of the running synthesis daemon
├── daemon.log # Daemon stdout/stderr — check this on errors
└── voices/
├── myvoice_clean.wav # Cleaned reference audio
├── voxon.txt # Exact transcript
├── alan_clean.wav
└── alan.txt
To add a pre-existing voice (if you already have a clean WAV and transcript), copy the files to ~/.voxon/voices/ manually and run voxon voices to confirm detection.
Configuration
voxon respects the following environment variables:
| Variable | Default | Description |
|---|---|---|
VOXON_PORT |
7860 |
TCP port the synthesis daemon listens on. |
VOXON_MODEL |
chatterbox-turbo |
Default TTS backend (chatterbox-turbo or qwen3). |
HTTP API
The daemon exposes a local HTTP API that you can query directly:
# Full WAV download
curl "http://localhost:7860/synthesize?text=Hello+world" -o out.wav
# Chunked streaming (lowest latency for long text)
curl -sN "http://localhost:7860/stream_chunked?text=Hello+world" -o out.wav
# Daemon health
curl http://localhost:7860/health
Endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/health |
Returns daemon status and readiness. |
GET |
/synthesize?text= |
Full synthesis, returns complete WAV. |
GET |
/stream?text= |
Streaming WAV (synthesises first, then streams). |
GET |
/stream_chunked?text= |
True sentence-level streaming — first audio arrives after the first sentence. |
Performance
All measurements on a single-speaker English sentence (~80 chars).
| Mac | Model load | Warmup | Synthesis (1 sentence) |
|---|---|---|---|
| M1 8 GB | ~30 s | ~5 s | ~3–5 s |
| M2 16 GB | ~20 s | ~4 s | ~2–4 s |
| M3 16 GB | ~15 s | ~3 s | ~1–3 s |
| M4 16 GB | ~10 s | ~2 s | ~1–2 s |
Model load and warmup happen once per daemon start. Voice embedding cache pre-computation (also once at startup) eliminates the dominant per-synthesis overhead on subsequent calls.
RTF (real-time factor) below 1.0 means synthesis is faster than real-time.
Tips for Best Quality
Reference audio:
- 15 seconds, single speaker, no background music, quiet environment
- Any consistent microphone works — quality matters less than consistency
Transcript:
- Must match word-for-word, including "um", "uh", false starts, and any audible sounds
- A single wrong word or missing word will degrade output quality noticeably
Input text:
- Sentences of 15–80 characters synthesise fastest; longer inputs are auto-split
- Punctuation matters — commas and periods control synthesis rhythm
Troubleshooting
Daemon fails to start:
cat ~/.voxon/daemon.log
Voice not found:
voxon voices # list all prepared voices
Port conflict:
VOXON_PORT=7861 voxon "Hello world"
Model download stalls: The first run downloads model weights from HuggingFace. Check your network connection. Progress is visible in ~/.voxon/daemon.log.
ChatterboxTurbo "reference clip too short": The clip must be strictly longer than 5 seconds. Re-run voxon prep with a longer --duration.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voxon-1.0.0.tar.gz.
File metadata
- Download URL: voxon-1.0.0.tar.gz
- Upload date:
- Size: 40.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b8614be87e78dc14294f43aa4e1b738ba8779832240004d52aed91d5d25c160
|
|
| MD5 |
eb6dc973b8233d63b21c4961f78c2aee
|
|
| BLAKE2b-256 |
44171d737f8085a5926d1960a4579e29c6974a39a9d18579b754286f02c1bd25
|
File details
Details for the file voxon-1.0.0-py3-none-any.whl.
File metadata
- Download URL: voxon-1.0.0-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3499d9609b997085fe47859265e5e5eb99294761b5cb1ccafa414d018dd13bcd
|
|
| MD5 |
43d86b99ebe85d955af64a0f2084ae1b
|
|
| BLAKE2b-256 |
a833612f708b151ebbaafc55c9cd493412dd0f85936fcdfd6372870b15f48b01
|