Skip to main content

High-fidelity voice cloning TTS for Apple Silicon — powered by Qwen3-TTS and ChatterboxTurbo via MLX

Project description

voxon

PyPI version Python versions License: MIT Platform: macOS

High-fidelity, real-time voice cloning on Apple Silicon.

One command to install. One command to speak.

pip install voxon
voxon "Hello, this is Voxon."

Overview

voxon turns a 15-second audio clip into a permanent voice identity that you can synthesise speech through at any time. It runs entirely on-device — no cloud, no API keys, no data leaving your machine.

The synthesis engine loads into memory once and stays there as a background daemon. Subsequent calls have sub-second dispatch overhead regardless of text length.

Backends supported:

  • Qwen3-TTS 1.7B (default) — Alibaba's 1.7B parameter multilingual TTS model, quantised to 8-bit for Apple Silicon via MLX
  • ChatterboxTurbo — ResembleAI's fast, high-quality voice cloning model (opt-in via VOXON_MODEL=chatterbox-turbo)

Requirements: macOS 13+, Apple Silicon (M1 or later), Python 3.11+


Prerequisites

System: macOS 13+, Apple Silicon (M1 or later), Python 3.11+

HuggingFace account (free, one-time):

The TTS model weights are hosted on HuggingFace and require a free account and a one-click licence acceptance before the first download.

  1. Create a free account at huggingface.co
  2. Accept the model licence — one click, no payment:
  3. Generate a token at huggingface.co/settings/tokens (read-only is sufficient)
  4. Log in once from your terminal:
pip install huggingface_hub
huggingface-cli login

After this, model weights download automatically on first use and are cached locally — you never need to do this again.

If you skip this step, voxon will print a clear error with the exact URL to visit when you first try to synthesise.


Installation

pip install voxon

The installer pulls all required dependencies automatically, including MLX, PyTorch (CPU), faster-whisper, and the audio processing libraries. No manual configuration is needed.

First run: On the first synthesis call, model weights (~1–3 GB depending on backend) are downloaded from HuggingFace. This happens once and is cached locally.


Quick Start

Step 1 — Prepare a reference voice

Record or export approximately 15 seconds of clean speech as any audio file (WAV, MP3, FLAC, AIFF). Then run:

voxon prep my_recording.wav

This processes the audio through a five-stage pipeline:

  1. Loads and trims to the specified window
  2. Resamples to 24 kHz mono
  3. Applies stationary noise reduction
  4. Normalises peak amplitude
  5. Auto-transcribes using Whisper large-v3-turbo

Critical: Open the generated .txt file in ~/.voxon/voices/ and verify every word matches the audio exactly — including hesitations and fillers. Transcript accuracy directly determines cloning fidelity.

# Use a specific time window
voxon prep recording.wav --start 30 --duration 15

# Skip noise reduction (for already-clean audio)
voxon prep recording.wav --no-denoise

# Transcribe manually (write the .txt yourself after)
voxon prep recording.wav --no-transcribe

Step 2 — Speak

voxon "Hello, this is Voxon speaking."

On the first invocation, voxon starts the synthesis daemon, waits for it to warm up (model load + Metal shader compilation), then synthesises and plays the audio. Subsequent calls are immediate.

# Use a specific voice
voxon --voice alan "Hello world"

# Save the audio to a file
voxon "Hello world" --save output.wav

# Both play and save
voxon "This sentence will be played and saved." --save sentence.wav

Command Reference

voxon "<text>" — Synthesise

voxon [--voice NAME] [--save FILE] "<text>"
Flag Description
--voice NAME Use a specific prepared voice. Restarts the daemon if the voice changes.
--save FILE Save synthesised audio to this WAV file in addition to playing.

voxon prep <file> — Prepare a voice

voxon prep <FILE> [--start SECONDS] [--duration SECONDS]
                   [--out-wav PATH] [--out-transcript PATH]
                   [--no-denoise] [--no-transcribe]
Flag Default Description
--start 0 Start offset in seconds into the source file.
--duration 15 Duration to extract in seconds.
--out-wav ~/.voxon/voices/<name>_clean.wav Output WAV path.
--out-transcript ~/.voxon/voices/<name>.txt Output transcript path.
--no-denoise Skip noise reduction.
--no-transcribe Skip Whisper transcription (write the .txt manually).

voxon voices — List voices

voxon voices

Lists all prepared voices in ~/.voxon/voices/, indicating which is the current default and which is loaded in the running daemon.

voxon status — Daemon status

voxon status

Prints the daemon's current state: online/offline, active voice, backend model, and whether voice embeddings are cached.

voxon stop — Stop the daemon

voxon stop

Sends SIGTERM to the background daemon and removes the PID file.


Voice Files

All runtime state is stored under ~/.voxon/:

~/.voxon/
├── config              # Active voice selection and other persisted settings
├── daemon.pid          # PID of the running synthesis daemon
├── daemon.log          # Daemon stdout/stderr — check this on errors
└── voices/
    ├── myvoice_clean.wav   # Cleaned reference audio
    ├── voxon.txt         # Exact transcript
    ├── alan_clean.wav
    └── alan.txt

To add a pre-existing voice (if you already have a clean WAV and transcript), copy the files to ~/.voxon/voices/ manually and run voxon voices to confirm detection.


Configuration

voxon respects the following environment variables:

Variable Default Description
VOXON_PORT 7860 TCP port the synthesis daemon listens on.
VOXON_MODEL chatterbox-turbo Default TTS backend (chatterbox-turbo or qwen3).

HTTP API

The daemon exposes a local HTTP API that you can query directly:

# Full WAV download
curl "http://localhost:7860/synthesize?text=Hello+world" -o out.wav

# Chunked streaming (lowest latency for long text)
curl -sN "http://localhost:7860/stream_chunked?text=Hello+world" -o out.wav

# Daemon health
curl http://localhost:7860/health

Endpoints:

Method Path Description
GET /health Returns daemon status and readiness.
GET /synthesize?text= Full synthesis, returns complete WAV.
GET /stream?text= Streaming WAV (synthesises first, then streams).
GET /stream_chunked?text= True sentence-level streaming — first audio arrives after the first sentence.

Performance

All measurements on a single-speaker English sentence (~80 chars).

Mac Model load Warmup Synthesis (1 sentence)
M1 8 GB ~30 s ~5 s ~3–5 s
M2 16 GB ~20 s ~4 s ~2–4 s
M3 16 GB ~15 s ~3 s ~1–3 s
M4 16 GB ~10 s ~2 s ~1–2 s

Model load and warmup happen once per daemon start. Voice embedding cache pre-computation (also once at startup) eliminates the dominant per-synthesis overhead on subsequent calls.

RTF (real-time factor) below 1.0 means synthesis is faster than real-time.


Tips for Best Quality

Reference audio:

  • 15 seconds, single speaker, no background music, quiet environment
  • Any consistent microphone works — quality matters less than consistency

Transcript:

  • Must match word-for-word, including "um", "uh", false starts, and any audible sounds
  • A single wrong word or missing word will degrade output quality noticeably

Input text:

  • Sentences of 15–80 characters synthesise fastest; longer inputs are auto-split
  • Punctuation matters — commas and periods control synthesis rhythm

Troubleshooting

Daemon fails to start:

cat ~/.voxon/daemon.log

Voice not found:

voxon voices   # list all prepared voices

Port conflict:

VOXON_PORT=7861 voxon "Hello world"

Model download stalls: The first run downloads model weights from HuggingFace. Check your network connection. Progress is visible in ~/.voxon/daemon.log.

ChatterboxTurbo "reference clip too short": The clip must be strictly longer than 5 seconds. Re-run voxon prep with a longer --duration.


License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxon-0.1.0.tar.gz (41.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxon-0.1.0-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file voxon-0.1.0.tar.gz.

File metadata

  • Download URL: voxon-0.1.0.tar.gz
  • Upload date:
  • Size: 41.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for voxon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cf94ed76d1c45aba9062f9a00b7c1dcb7a41b938c47871d71ff7f8e174738b4b
MD5 5f32ecfca9b8bd388b529bf39fa3b43e
BLAKE2b-256 69394828a6ed43dd3c4a31d2e5c0bc95b7bffeef53f5c1fce60f01dbf7d00583

See more details on using hashes here.

File details

Details for the file voxon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: voxon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for voxon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c9e1ea4b048687c7495de4ce39bfb4a81de47e6ecc6c58c6ed035157701ae638
MD5 dffd5ca36a543b51574f671e09c1e674
BLAKE2b-256 2b003d997ec6a2e335091a367ff0040f1b7f6cd315e8c0bd8d91a02f1ab77f04

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page