High-fidelity voice cloning TTS for Apple Silicon — powered by Qwen3-TTS and ChatterboxTurbo via MLX

These details have not been verified by PyPI

Project links

Project description

voxon

High-fidelity, real-time voice cloning on Apple Silicon.

One command to install. One command to speak.

pip install voxon
voxon "Hello, this is Voxon."

Overview

voxon turns a 15-second audio clip into a permanent voice identity that you can synthesise speech through at any time. It runs entirely on-device — no cloud, no API keys, no data leaving your machine.

The synthesis engine loads into memory once and stays there as a background daemon. Subsequent calls have sub-second dispatch overhead regardless of text length.

Backends supported:

Qwen3-TTS 1.7B (default) — Alibaba's 1.7B parameter multilingual TTS model, quantised to 8-bit for Apple Silicon via MLX
ChatterboxTurbo — ResembleAI's fast, high-quality voice cloning model (opt-in via VOXON_MODEL=chatterbox-turbo)

Requirements: macOS 13+, Apple Silicon (M1 or later), Python 3.11+

Prerequisites

System: macOS 13+, Apple Silicon (M1 or later), Python 3.11+

HuggingFace account (free, one-time):

The TTS model weights are hosted on HuggingFace and require a free account and a one-click licence acceptance before the first download.

Create a free account at huggingface.co
Accept the model licence — one click, no payment:
- Qwen3-TTS (default): huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base
Generate a token at huggingface.co/settings/tokens (read-only is sufficient)
Log in once from your terminal:

pip install huggingface_hub
huggingface-cli login

After this, model weights download automatically on first use and are cached locally — you never need to do this again.

If you skip this step, voxon will print a clear error with the exact URL to visit when you first try to synthesise.

Installation

pip install voxon

The installer pulls all required dependencies automatically, including MLX, PyTorch (CPU), faster-whisper, and the audio processing libraries. No manual configuration is needed.

First run: On the first synthesis call, model weights (~1–3 GB depending on backend) are downloaded from HuggingFace. This happens once and is cached locally.

Quick Start

Step 1 — Prepare a reference voice

Record or export approximately 15 seconds of clean speech as any audio file (WAV, MP3, FLAC, AIFF). Then run:

voxon prep my_recording.wav

This processes the audio through a five-stage pipeline:

Loads and trims to the specified window
Resamples to 24 kHz mono
Applies stationary noise reduction
Normalises peak amplitude
Auto-transcribes using Whisper large-v3-turbo

Critical: Open the generated .txt file in ~/.voxon/voices/ and verify every word matches the audio exactly — including hesitations and fillers. Transcript accuracy directly determines cloning fidelity.

# Use a specific time window
voxon prep recording.wav --start 30 --duration 15

# Skip noise reduction (for already-clean audio)
voxon prep recording.wav --no-denoise

# Transcribe manually (write the .txt yourself after)
voxon prep recording.wav --no-transcribe

Step 2 — Speak

voxon "Hello, this is Voxon speaking."

On the first invocation, voxon starts the synthesis daemon, waits for it to warm up (model load + Metal shader compilation), then synthesises and plays the audio. Subsequent calls are immediate.

# Use a specific voice
voxon --voice alan "Hello world"

# Save the audio to a file
voxon "Hello world" --save output.wav

# Both play and save
voxon "This sentence will be played and saved." --save sentence.wav

Command Reference

`voxon "<text>"` — Synthesise

voxon [--voice NAME] [--save FILE] "<text>"

Flag	Description
`--voice NAME`	Use a specific prepared voice. Restarts the daemon if the voice changes.
`--save FILE`	Save synthesised audio to this WAV file in addition to playing.

`voxon prep <file>` — Prepare a voice

voxon prep <FILE> [--start SECONDS] [--duration SECONDS]
                   [--out-wav PATH] [--out-transcript PATH]
                   [--no-denoise] [--no-transcribe]

Flag	Default	Description
`--start`	`0`	Start offset in seconds into the source file.
`--duration`	`15`	Duration to extract in seconds.
`--out-wav`	`~/.voxon/voices/<name>_clean.wav`	Output WAV path.
`--out-transcript`	`~/.voxon/voices/<name>.txt`	Output transcript path.
`--no-denoise`	—	Skip noise reduction.
`--no-transcribe`	—	Skip Whisper transcription (write the `.txt` manually).

`voxon voices` — List voices

voxon voices

Lists all prepared voices in ~/.voxon/voices/, indicating which is the current default and which is loaded in the running daemon.

`voxon status` — Daemon status

voxon status

Prints the daemon's current state: online/offline, active voice, backend model, and whether voice embeddings are cached.

`voxon stop` — Stop the daemon

voxon stop

Sends SIGTERM to the background daemon and removes the PID file.

Voice Files

All runtime state is stored under ~/.voxon/:

~/.voxon/
├── config              # Active voice selection and other persisted settings
├── daemon.pid          # PID of the running synthesis daemon
├── daemon.log          # Daemon stdout/stderr — check this on errors
└── voices/
    ├── myvoice_clean.wav   # Cleaned reference audio
    ├── voxon.txt         # Exact transcript
    ├── alan_clean.wav
    └── alan.txt

To add a pre-existing voice (if you already have a clean WAV and transcript), copy the files to ~/.voxon/voices/ manually and run voxon voices to confirm detection.

Configuration

voxon respects the following environment variables:

Variable	Default	Description
`VOXON_PORT`	`7860`	TCP port the synthesis daemon listens on.
`VOXON_MODEL`	`chatterbox-turbo`	Default TTS backend (`chatterbox-turbo` or `qwen3`).

HTTP API

The daemon exposes a local HTTP API that you can query directly:

# Full WAV download
curl "http://localhost:7860/synthesize?text=Hello+world" -o out.wav

# Chunked streaming (lowest latency for long text)
curl -sN "http://localhost:7860/stream_chunked?text=Hello+world" -o out.wav

# Daemon health
curl http://localhost:7860/health

Endpoints:

Method	Path	Description
`GET`	`/health`	Returns daemon status and readiness.
`GET`	`/synthesize?text=`	Full synthesis, returns complete WAV.
`GET`	`/stream?text=`	Streaming WAV (synthesises first, then streams).
`GET`	`/stream_chunked?text=`	True sentence-level streaming — first audio arrives after the first sentence.

Performance

All measurements on a single-speaker English sentence (~80 chars).

Mac	Model load	Warmup	Synthesis (1 sentence)
M1 8 GB	~30 s	~5 s	~3–5 s
M2 16 GB	~20 s	~4 s	~2–4 s
M3 16 GB	~15 s	~3 s	~1–3 s
M4 16 GB	~10 s	~2 s	~1–2 s

Model load and warmup happen once per daemon start. Voice embedding cache pre-computation (also once at startup) eliminates the dominant per-synthesis overhead on subsequent calls.

RTF (real-time factor) below 1.0 means synthesis is faster than real-time.

Tips for Best Quality

Reference audio:

15 seconds, single speaker, no background music, quiet environment
Any consistent microphone works — quality matters less than consistency

Transcript:

Must match word-for-word, including "um", "uh", false starts, and any audible sounds
A single wrong word or missing word will degrade output quality noticeably

Input text:

Sentences of 15–80 characters synthesise fastest; longer inputs are auto-split
Punctuation matters — commas and periods control synthesis rhythm

Troubleshooting

Daemon fails to start:

cat ~/.voxon/daemon.log

Voice not found:

voxon voices   # list all prepared voices

Port conflict:

VOXON_PORT=7861 voxon "Hello world"

Model download stalls: The first run downloads model weights from HuggingFace. Check your network connection. Progress is visible in ~/.voxon/daemon.log.

ChatterboxTurbo "reference clip too short": The clip must be strictly longer than 5 seconds. Re-run voxon prep with a longer --duration.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

May 25, 2026

1.0.0

May 25, 2026

This version

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxon-0.1.0.tar.gz (41.0 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voxon-0.1.0-py3-none-any.whl (40.7 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file voxon-0.1.0.tar.gz.

File metadata

Download URL: voxon-0.1.0.tar.gz
Upload date: May 25, 2026
Size: 41.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for voxon-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cf94ed76d1c45aba9062f9a00b7c1dcb7a41b938c47871d71ff7f8e174738b4b`
MD5	`5f32ecfca9b8bd388b529bf39fa3b43e`
BLAKE2b-256	`69394828a6ed43dd3c4a31d2e5c0bc95b7bffeef53f5c1fce60f01dbf7d00583`

See more details on using hashes here.

File details

Details for the file voxon-0.1.0-py3-none-any.whl.

File metadata

Download URL: voxon-0.1.0-py3-none-any.whl
Upload date: May 25, 2026
Size: 40.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for voxon-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c9e1ea4b048687c7495de4ce39bfb4a81de47e6ecc6c58c6ed035157701ae638`
MD5	`dffd5ca36a543b51574f671e09c1e674`
BLAKE2b-256	`2b003d997ec6a2e335091a367ff0040f1b7f6cd315e8c0bd8d91a02f1ab77f04`

See more details on using hashes here.

voxon 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

voxon

Overview

Prerequisites

Installation

Quick Start

Step 1 — Prepare a reference voice

Step 2 — Speak

Command Reference

voxon "<text>" — Synthesise

voxon prep <file> — Prepare a voice

voxon voices — List voices

voxon status — Daemon status

voxon stop — Stop the daemon

Voice Files

Configuration

HTTP API

Performance

Tips for Best Quality

Troubleshooting

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`voxon "<text>"` — Synthesise

`voxon prep <file>` — Prepare a voice

`voxon voices` — List voices

`voxon status` — Daemon status

`voxon stop` — Stop the daemon