Skip to main content

Pure-ONNX multi-engine voice-cloning library — no torch at runtime

Project description

voiceclonnx

PyPI Python License

Pure-ONNX voice conversion. 10 engines. Zero PyTorch at runtime.

Audio-to-audio only — voiceclonnx converts the voice in an existing speech file to sound like a reference speaker. Text-driven synthesis (text → cloned audio) is a TTS concern and is out of scope.


Why voiceclonnx

  • Zero PyTorch at runtime. Every engine runs on onnxruntime, numpy, soundfile, and huggingface_hub only. No torch, no CUDA driver required for inference.
  • One install, every engine. pip install voiceclonnx activates all 10 engines immediately — no per-engine extras, no optional groups for inference.
  • 10 distinct architectures, one API. kNN feature-swap, factorized codec, flow-matching, tone-color, AR codec-LM, speaker-decoupled codec, and any-to-ONE — every engine measurably transfers the target voice, not just the words.
  • STT- and speaker-verified. Each demo clip is transcribed with faster-whisper (WER, intelligibility) and scored for speaker similarity to the target voice. Both are published — see the speaker-similarity benchmark.
  • INT8 quantization with measured tradeoffs. Most engines ship *_q8.onnx variants: 45–75% smaller, faster on CPU, with documented WER cost per engine.
  • Documented conversion toolchain. A step-by-step guide covers export → parity → quantize → push → adapter for anyone adding a new engine.

Listen first, install later

demo/README.md — every engine converts the same sentence to two reference voices (Aria and Sonia). GitHub renders the audio players inline. Compare all 10 engines by ear, zero code required.


Install

pip install voiceclonnx

Core dependencies: onnxruntime, numpy, soundfile, huggingface_hub. ONNX models are downloaded on first use from Hugging Face Hub.

For model conversion / export tooling:

pip install "voiceclonnx[convert]"   # torch, onnx, transformers, librosa (export only)
pip install "voiceclonnx[test]"      # pytest, faster-whisper, edge-tts (test suite)

Quick start

Python

from voiceclonnx import VoiceCloner

cloner = VoiceCloner(engine="facodec")
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")
print(cloner.sample_rate)   # 16000

CLI

# Convert a WAV file
voiceclonnx clone --engine facodec \
             --audio source.wav \
             --voice reference.wav \
             --out converted.wav

# List all registered engines
voiceclonnx list

Engine comparison

All engines are included in pip install voiceclonnx — no per-engine extras. The ONNX models live in the voiceclonnx HF collection. WER is measured with faster-whisper base.en against the source transcript (lower is better; 0% = perfectly intelligible). Full data: demo/VERIFICATION.md.

Engine Family Sample rate WER INT8 Model Best for
facodec Factorized codec 16 kHz 0% TigreGotico/voiceclonnx-facodec Best overall quality (0% WER + strong timbre)
openvoice Tone-color transfer 22 kHz 0% TigreGotico/voiceclonnx-openvoice-v2 Broadest style range, 0% WER
chatterbox AR codec-LM 24 kHz 4–8% ✅ (8% WER) TigreGotico/voiceclonnx-chatterbox Natural prosody; strongest source→target shift
triaan Triple-AAN 16 kHz 4% TigreGotico/voiceclonnx-triaan-vc Good quality, small footprint
cosyvoice Flow-matching 22 kHz 8% ⚠ int8 degrades TigreGotico/voiceclonnx-cosyvoice Cross-lingual conversion
bicodec Semantic + global tokens 16 kHz 12% TigreGotico/voiceclonnx-bicodec SparkTTS zero-shot VC
knnvc kNN feature-swap 16 kHz 12–15% TigreGotico/voiceclonnx-knn-vc Lightweight (123 MB int8), strong timbre
focalcodec kNN feature-swap 16 kHz 15–19% ⚠ int8 degrades TigreGotico/voiceclonnx-focalcodec Best timbre similarity (NeurIPS 2025)
lscodec Speaker-decoupled codec 24 kHz ~35% TigreGotico/voiceclonnx-lscodec Best timbre transfer; trades some WER (Interspeech 2025)
rvc ContentVec + VITS 40/48 kHz 38%† ✅ (base only) TigreGotico/voiceclonnx-rvc Any-to-ONE, community voices

rvc WER reflects a sample community model. Any-to-ONE semantics differ from all other engines — see Choosing an engine.

WER measures intelligibility, not voice similarity. Every engine is also scored for how closely its output matches the target speaker — see the speaker-similarity benchmark.


Choosing an engine

Best all-rounders (0% WER + strong timbre): facodec, openvoice — start here unless you have a specific constraint.

Best target-voice fidelity (speaker similarity): focalcodec, lscodec, chatterbox, facodec, knnvc, openvoice — see the ranked speaker-similarity benchmark. lscodec has the strongest timbre transfer of the codec family but trades ~35% WER for it — pick it when voice identity matters more than perfect transcription.

Highest output sample rate: rvc at up to 48 kHz (any-to-ONE); chatterbox at 24 kHz for any-to-any.

Natural prosody / expressive style: chatterbox — AR codec-LM that transfers speaking style along with voice timbre.

Smallest INT8 footprint: knnvc at ~123 MB.

Any-to-ONE voice models (RVC ecosystem): rvc uses a voice model rather than a reference audio clip. reference_voice is a path to an .onnx RVC model (local file or HF repo ID). Thousands of community-trained voices exist on HF.

# rvc: reference_voice = path to an RVC .onnx model, NOT an audio file
cloner = VoiceCloner(engine="rvc")
out = cloner.clone_voice("source.wav", "/path/to/myvoice.onnx", "out.wav")

Non-commercial only: bicodec weights are CC BY-NC-SA 4.0 — verify before deploying commercially.


Quantized models

All engines except chatterbox support quantized=True, which loads *_q8.onnx INT8 variants: 45–75% smaller on disk and faster on CPU at a measured quality cost.

cloner = VoiceCloner(engine="knnvc", quantized=True)
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")

Some engines degrade significantly in INT8: focalcodec and cosyvoice should be used in fp32 for production.

chatterbox INT8 matches fp32 quality (8% WER, 57% smaller) — we quantize and host it at TigreGotico/voiceclonnx-chatterbox since upstream ships fp32 only.

See docs/QUANTS.md for the full WER and size comparison.


Adding an engine

  1. Subclass VoiceClonerBase from voiceclonnx.engines.base.
  2. Implement clone_voice(audio, reference_voice, out_path) -> str.
  3. Call register_engine(EngineEntry(alias=..., adapter_class=...)).
  4. Add the auto-import to voiceclonnx/__init__.py.

See docs/converting.md for the full export → parity → quantize → push → adapter workflow, and CONTRIBUTING.md for the contribution checklist.


Documentation


License

Apache 2.0 — see LICENSE.

Model weights are governed by their upstream licenses (MIT, Apache-2.0, CC BY 4.0, CC BY-NC-SA 4.0 for bicodec). See docs/converting.md for the weight-license policy (distributable vs local-only).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voiceclonnx-0.0.1a1.tar.gz (153.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voiceclonnx-0.0.1a1-py3-none-any.whl (205.0 kB view details)

Uploaded Python 3

File details

Details for the file voiceclonnx-0.0.1a1.tar.gz.

File metadata

  • Download URL: voiceclonnx-0.0.1a1.tar.gz
  • Upload date:
  • Size: 153.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voiceclonnx-0.0.1a1.tar.gz
Algorithm Hash digest
SHA256 fbea5d14cef3277d20016954311399e563cff3674898506bcfb5417c9fd4848b
MD5 4929a6ef9fa41b8dbd1347f783678125
BLAKE2b-256 22a6ae32df63e40ffcac1748332380604efc91da89e54fdd57c3a57e95af6d3a

See more details on using hashes here.

File details

Details for the file voiceclonnx-0.0.1a1-py3-none-any.whl.

File metadata

  • Download URL: voiceclonnx-0.0.1a1-py3-none-any.whl
  • Upload date:
  • Size: 205.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voiceclonnx-0.0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 6b53c98d77e107003cbba51924d8a4a27d38e836ef0da76232b4cb192be8cce2
MD5 31ff789b4b1e5f82aa5499e3b322a703
BLAKE2b-256 24899eecf653d183b9a86675bef145a5a9eda97d04a6dfb28c389ef8a45a4988

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page