Pure-ONNX multi-engine voice-cloning library — no torch at runtime
Project description
voiceclonnx
Pure-ONNX voice conversion. 10 engines. Zero PyTorch at runtime.
Audio-to-audio only — voiceclonnx converts the voice in an existing speech file to sound like a reference speaker. Text-driven synthesis (text → cloned audio) is a TTS concern and is out of scope.
Why voiceclonnx
- Zero PyTorch at runtime. Every engine runs on
onnxruntime,numpy,soundfile, andhuggingface_hubonly. No torch, no CUDA driver required for inference. - One install, every engine.
pip install voiceclonnxactivates all 10 engines immediately — no per-engine extras, no optional groups for inference. - 10 distinct architectures, one API. kNN feature-swap, factorized codec, flow-matching, tone-color, AR codec-LM, speaker-decoupled codec, and any-to-ONE — every engine measurably transfers the target voice, not just the words.
- STT- and speaker-verified. Each demo clip is transcribed with faster-whisper (WER, intelligibility) and scored for speaker similarity to the target voice. Both are published — see the speaker-similarity benchmark.
- INT8 quantization with measured tradeoffs. Most engines ship
*_q8.onnxvariants: 45–75% smaller, faster on CPU, with documented WER cost per engine. - Documented conversion toolchain. A step-by-step guide covers export → parity → quantize → push → adapter for anyone adding a new engine.
Listen first, install later
demo/README.md — every engine converts the same sentence to two reference voices (Aria and Sonia). GitHub renders the audio players inline. Compare all 10 engines by ear, zero code required.
Install
pip install voiceclonnx
Core dependencies: onnxruntime, numpy, soundfile, huggingface_hub.
ONNX models are downloaded on first use from Hugging Face Hub.
For model conversion / export tooling:
pip install "voiceclonnx[convert]" # torch, onnx, transformers, librosa (export only)
pip install "voiceclonnx[test]" # pytest, faster-whisper, edge-tts (test suite)
Quick start
Python
from voiceclonnx import VoiceCloner
cloner = VoiceCloner(engine="facodec")
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")
print(cloner.sample_rate) # 16000
CLI
# Convert a WAV file
voiceclonnx clone --engine facodec \
--audio source.wav \
--voice reference.wav \
--out converted.wav
# List all registered engines
voiceclonnx list
Engine comparison
All engines are included in pip install voiceclonnx — no per-engine extras.
The ONNX models live in the
voiceclonnx HF collection.
WER is measured with faster-whisper base.en against the source transcript
(lower is better; 0% = perfectly intelligible). Full data: demo/VERIFICATION.md.
| Engine | Family | Sample rate | WER | INT8 | Model | Best for |
|---|---|---|---|---|---|---|
facodec |
Factorized codec | 16 kHz | 0% | ✅ | TigreGotico/voiceclonnx-facodec | Best overall quality (0% WER + strong timbre) |
openvoice |
Tone-color transfer | 22 kHz | 0% | ✅ | TigreGotico/voiceclonnx-openvoice-v2 | Broadest style range, 0% WER |
chatterbox |
AR codec-LM | 24 kHz | 4–8% | ✅ (8% WER) | TigreGotico/voiceclonnx-chatterbox | Natural prosody; strongest source→target shift |
triaan |
Triple-AAN | 16 kHz | 4% | ✅ | TigreGotico/voiceclonnx-triaan-vc | Good quality, small footprint |
cosyvoice |
Flow-matching | 22 kHz | 8% | ⚠ int8 degrades | TigreGotico/voiceclonnx-cosyvoice | Cross-lingual conversion |
bicodec |
Semantic + global tokens | 16 kHz | 12% | ✅ | TigreGotico/voiceclonnx-bicodec | SparkTTS zero-shot VC |
knnvc |
kNN feature-swap | 16 kHz | 12–15% | ✅ | TigreGotico/voiceclonnx-knn-vc | Lightweight (123 MB int8), strong timbre |
focalcodec |
kNN feature-swap | 16 kHz | 15–19% | ⚠ int8 degrades | TigreGotico/voiceclonnx-focalcodec | Best timbre similarity (NeurIPS 2025) |
lscodec |
Speaker-decoupled codec | 24 kHz | ~35% | ✅ | TigreGotico/voiceclonnx-lscodec | Best timbre transfer; trades some WER (Interspeech 2025) |
rvc |
ContentVec + VITS | 40/48 kHz | 38%† | ✅ (base only) | TigreGotico/voiceclonnx-rvc | Any-to-ONE, community voices |
†
rvcWER reflects a sample community model. Any-to-ONE semantics differ from all other engines — see Choosing an engine.WER measures intelligibility, not voice similarity. Every engine is also scored for how closely its output matches the target speaker — see the speaker-similarity benchmark.
Choosing an engine
Best all-rounders (0% WER + strong timbre): facodec, openvoice —
start here unless you have a specific constraint.
Best target-voice fidelity (speaker similarity): focalcodec, lscodec,
chatterbox, facodec, knnvc, openvoice — see the ranked
speaker-similarity benchmark. lscodec has the
strongest timbre transfer of the codec family but trades ~35% WER for it — pick
it when voice identity matters more than perfect transcription.
Highest output sample rate: rvc at up to 48 kHz (any-to-ONE); chatterbox
at 24 kHz for any-to-any.
Natural prosody / expressive style: chatterbox — AR codec-LM that
transfers speaking style along with voice timbre.
Smallest INT8 footprint: knnvc at ~123 MB.
Any-to-ONE voice models (RVC ecosystem): rvc uses a voice model rather than
a reference audio clip. reference_voice is a path to an .onnx RVC model
(local file or HF repo ID). Thousands of community-trained voices exist on HF.
# rvc: reference_voice = path to an RVC .onnx model, NOT an audio file
cloner = VoiceCloner(engine="rvc")
out = cloner.clone_voice("source.wav", "/path/to/myvoice.onnx", "out.wav")
Non-commercial only: bicodec weights are CC BY-NC-SA 4.0 — verify before
deploying commercially.
Quantized models
All engines except chatterbox support quantized=True, which loads *_q8.onnx
INT8 variants: 45–75% smaller on disk and faster on CPU at a measured quality cost.
cloner = VoiceCloner(engine="knnvc", quantized=True)
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")
Some engines degrade significantly in INT8: focalcodec and cosyvoice should
be used in fp32 for production.
chatterbox INT8 matches fp32 quality (8% WER, 57% smaller) — we quantize
and host it at TigreGotico/voiceclonnx-chatterbox since upstream ships fp32 only.
See docs/QUANTS.md for the full WER and size comparison.
Adding an engine
- Subclass
VoiceClonerBasefromvoiceclonnx.engines.base. - Implement
clone_voice(audio, reference_voice, out_path) -> str. - Call
register_engine(EngineEntry(alias=..., adapter_class=...)). - Add the auto-import to
voiceclonnx/__init__.py.
See docs/converting.md for the full export → parity → quantize → push → adapter workflow, and CONTRIBUTING.md for the contribution checklist.
Documentation
- demo/README.md — listen to every engine, no install
- docs/index.md — engine families, install matrix, navigation
- docs/QUANTS.md — fp32 vs INT8 WER and size comparison
- docs/api.md — VoiceCloner, VoiceClonerBase, registry
- docs/engines/ — per-engine guides (config, model, WER, CLI)
- docs/converting.md — ONNX export / parity / quantize / push toolchain
- examples/ — Python and shell examples
License
Apache 2.0 — see LICENSE.
Model weights are governed by their upstream licenses (MIT, Apache-2.0, CC BY 4.0, CC BY-NC-SA 4.0 for bicodec). See docs/converting.md for the weight-license policy (distributable vs local-only).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voiceclonnx-0.0.1a1.tar.gz.
File metadata
- Download URL: voiceclonnx-0.0.1a1.tar.gz
- Upload date:
- Size: 153.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbea5d14cef3277d20016954311399e563cff3674898506bcfb5417c9fd4848b
|
|
| MD5 |
4929a6ef9fa41b8dbd1347f783678125
|
|
| BLAKE2b-256 |
22a6ae32df63e40ffcac1748332380604efc91da89e54fdd57c3a57e95af6d3a
|
File details
Details for the file voiceclonnx-0.0.1a1-py3-none-any.whl.
File metadata
- Download URL: voiceclonnx-0.0.1a1-py3-none-any.whl
- Upload date:
- Size: 205.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b53c98d77e107003cbba51924d8a4a27d38e836ef0da76232b4cb192be8cce2
|
|
| MD5 |
31ff789b4b1e5f82aa5499e3b322a703
|
|
| BLAKE2b-256 |
24899eecf653d183b9a86675bef145a5a9eda97d04a6dfb28c389ef8a45a4988
|