Skip to main content

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

Project description

VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

English | ไธญๆ–‡

Project Page Live Playground Documentation Hugging Face ModelScope

VoxCPM Logo

OpenBMB%2FVoxCPM | Trendshift

๐Ÿ‘‹ Join our community for discussion and support!
Feishu ย |ย  Discord

VoxCPM is a tokenizer-free Text-to-Speech system that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization to achieve highly natural and expressive synthesis.

VoxCPM2 is the latest major release โ€” a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.

โœจ Highlights

  • ๐ŸŒ 30-Language Multilingual โ€” Input text in any of the 30 supported languages and synthesize directly, no language tag needed
  • ๐ŸŽจ Voice Design โ€” Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace โ€ฆ), no reference audio required
  • ๐ŸŽ›๏ธ Controllable Cloning โ€” Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre
  • ๐ŸŽ™๏ธ Ultimate Cloning โ€” Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail โ€” timbre, rhythm, emotion, and style (same as VoxCPM1.5)
  • ๐Ÿ”Š 48kHz High-Quality Audio โ€” Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode/decode design, with built-in super-resolution โ€” no external upsampler needed
  • ๐Ÿง  Context-Aware Synthesis โ€” Automatically infers appropriate prosody and expressiveness from text content
  • โšก Real-Time Streaming โ€” RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by Nano-vLLM or vLLM-Omni โ€” official vLLM omni-modal serving for VoxCPM2 with PagedAttention and an OpenAI-compatible API
  • ๐Ÿ“œ Fully Open-Source & Commercial-Ready โ€” Weights and code released under the Apache-2.0 license, free for commercial use
๐ŸŒ Supported Languages (30)
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese

Chinese Dialect: ๅ››ๅท่ฏ, ็ฒค่ฏญ, ๅด่ฏญ, ไธœๅŒ—่ฏ, ๆฒณๅ—่ฏ, ้™•่ฅฟ่ฏ, ๅฑฑไธœ่ฏ, ๅคฉๆดฅ่ฏ, ้—ฝๅ—่ฏ

News

  • [2026.04] ๐Ÿ”ฅ We release VoxCPM2 โ€” 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! Weights | Docs | Playground
  • [2025.12] ๐ŸŽ‰ Open-source VoxCPM1.5 weights with SFT & LoRA fine-tuning. (๐Ÿ† #1 GitHub Trending)
  • [2025.09] ๐Ÿ”ฅ Release VoxCPM Technical Report.
  • [2025.09] ๐ŸŽ‰ Open-source VoxCPM-0.5B weights (๐Ÿ† #1 HuggingFace Trending)

Contents


๐Ÿš€ Quick Start

Installation

pip install voxcpm

Requirements: Python โ‰ฅ 3.10 (<3.13), PyTorch โ‰ฅ 2.5.0, CUDA โ‰ฅ 12.0. See Quick Start Docs for details.

Python API

๐Ÿ—ฃ๏ธ Text-to-Speech

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
  "openbmb/VoxCPM2",
  load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("saved: demo.wav")

If you prefer downloading from ModelScope first, you can use:

pip install modelscope
from modelscope import snapshot_download
snapshot_download("OpenBMB/VoxCPM2", local_dir='./pretrained_models/VoxCPM2') # specify the local directory to save the model

from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("./pretrained_models/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

๐ŸŽจ Voice Design

Create a voice from a natural-language description โ€” no reference audio needed. Format: put the description in parentheses at the start of text(e.g. "(your voice description)The text to synthesize."):

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

๐ŸŽ›๏ธ Controllable Voice Cloning

Upload a reference audio. The model clones the timbre, and you can still use control instructions to adjust speed, emotion, or style.

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="path/to/voice.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)

๐ŸŽ™๏ธ Ultimate Cloning

Provide both the reference audio and its exact transcript for audio-continuation-based cloning with every vocal nuance reproduced. For maximum cloning similarity, pass the same reference clip to both reference_wav_path and prompt_wav_path as shown below:

wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav", # optional, for better simliarity 
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
๐Ÿ”„ Streaming API
import numpy as np

chunks = []
for chunk in model.generate_streaming(
    text="Streaming text to speech is easy with VoxCPM!",
):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

CLI Usage

# Voice design (no reference audio needed)
voxcpm design \
  --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
  --output out.wav

# Controllable voice cloning with style control
voxcpm design \
  --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
  --control "Young female voice, warm and gentle, slightly smiling" \
  --output out.wav

# Voice cloning (reference audio)
voxcpm clone \
  --text "This is a voice cloning demo." \
  --reference-audio path/to/voice.wav \
  --output out.wav

# Ultimate cloning (prompt audio + transcript)
voxcpm clone \
  --text "This is a voice cloning demo." \
  --prompt-audio path/to/voice.wav \
  --prompt-text "reference transcript" \
  --reference-audio path/to/voice.wav \ # optional, for better simliarity
  --output out.wav

# Batch processing
voxcpm batch --input examples/input.txt --output-dir outs

# Help
voxcpm --help

Web Demo

python app.py --port 8808  # then open in browser: http://localhost:8808

๐Ÿšข Production Deployment (Nano-vLLM)

For high-throughput serving, use Nano-vLLM-VoxCPM โ€” a dedicated inference engine built on Nano-vLLM with concurrent request support and an async API.

pip install nano-vllm-voxcpm
from nanovllm_voxcpm import VoxCPM
import numpy as np, soundfile as sf

server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="Hello from VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()

RTF as low as ~0.13 on NVIDIA RTX 4090 (vs ~0.3 with the standard PyTorch implementation), with support for batched concurrent requests and a FastAPI HTTP server. See the Nano-vLLM-VoxCPM repo for deployment details.

๐Ÿญ Production Serving (vLLM-Omni)

For production multi-tenant deployments, use vLLM-Omni โ€” the official vLLM project's omni-modal extension with native VoxCPM2 support. PagedAttention KV cache, continuous batching, and a drop-in OpenAI-compatible /v1/audio/speech endpoint.

# Install from source (latest main โ€” vllm-omni is rapidly evolving)
uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni
uv pip install -e .

See the vLLM-Omni installation guide for other platforms (ROCm, XPU, MUSA, NPU) and Docker images.

# Launch an OpenAI-compatible TTS server (--omni enables omni-modal serving)
vllm serve openbmb/VoxCPM2 --omni --port 8000

# Call it from any OpenAI client
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2 on vLLM-Omni!","voice":"default"}' \
  --output out.wav

Built on the upstream vLLM scheduler, with batched concurrent requests, streaming chunk delivery, and multi-GPU deployment out of the box. See the VoxCPM2 example for full deployment recipes.

Full parameter reference, multi-scenario examples, and voice cloning tips โ†’ Quick Start Guide | Usage Guide | Cookbook


๐Ÿ“ฆ Models & Versions

VoxCPM2 VoxCPM1.5 VoxCPM-0.5B
Status ๐ŸŸข Latest Stable Legacy
Backbone Parameters 2B 0.6B 0.5B
Audio Sample Rate 48kHz 44.1kHz 16kHz
LM Token Rate 6.25Hz 6.25Hz 12.5Hz
Languages 30 2 (zh, en) 2 (zh, en)
Cloning Mode Isolated Reference & Continuation Continuation only Continuation only
Voice Design โœ… โ€” โ€”
Controllable Voice Cloning โœ… โ€” โ€”
SFT / LoRA โœ… โœ… โœ…
RTF (RTX 4090) ~0.30 ~0.15 ~0.17
RTF in Nano-VLLM (RTX 4090) ~0.13 ~0.08 ~0.10
VRAM ~8 GB ~6 GB ~5 GB
Weights ๐Ÿค— HF / MS ๐Ÿค— HF / MS ๐Ÿค— HF / MS
Technical Report Coming soon โ€” arXiv ICLR 2026
Demo Page Audio Samples โ€” Audio Samples

VoxCPM2 is built on a tokenizer-free, diffusion autoregressive paradigm. The model operates entirely in the latent space of AudioVAE V2, following a four-stage pipeline: LocEnc โ†’ TSLM โ†’ RALM โ†’ LocDiT, enabling rich expressiveness and 48kHz native audio output.

VoxCPM2 Model Architecture

For full architectural details, VoxCPM2-specific upgrades, and a model comparison table, see the Architecture Design.


๐Ÿ“Š Performance

VoxCPM2 achieves state-of-the-art or comparable results on public zero-shot and controllable TTS benchmarks.

Seed-TTS-eval

Seed-TTS-eval WER(โฌ‡)&SIM(โฌ†) Results (click to expand)
Model Parameters Open-Source test-EN test-ZH test-Hard
WER/%โฌ‡ SIM/%โฌ† CER/%โฌ‡ SIM/%โฌ† CER/%โฌ‡ SIM/%โฌ†
MegaTTS3 0.5B โŒ 2.79 77.1 1.52 79.0 - -
DiTAR 0.6B โŒ 1.69 73.5 1.02 75.3 - -
CosyVoice3 0.5B โŒ 2.02 71.8 1.16 78.0 6.08 75.8
CosyVoice3 1.5B โŒ 2.22 72.0 1.12 78.1 5.83 75.8
Seed-TTS - โŒ 2.25 76.2 1.12 79.6 7.59 77.6
MiniMax-Speech - โŒ 1.65 69.2 0.83 78.3 - -
F5-TTS 0.3B โœ… 2.00 67.0 1.53 76.0 8.67 71.3
MaskGCT 1B โœ… 2.62 71.7 2.27 77.4 - -
CosyVoice 0.3B โœ… 4.29 60.9 3.63 72.3 11.75 70.9
CosyVoice2 0.5B โœ… 3.09 65.9 1.38 75.7 6.83 72.4
SparkTTS 0.5B โœ… 3.14 57.3 1.54 66.0 - -
FireRedTTS 0.5B โœ… 3.82 46.0 1.51 63.5 17.45 62.1
FireRedTTS-2 1.5B โœ… 1.95 66.5 1.14 73.6 - -
Qwen2.5-Omni 7B โœ… 2.72 63.2 1.70 75.2 7.97 74.7
Qwen3-Omni 30B-A3B โœ… 1.39 - 1.07 - - -
OpenAudio-s1-mini 0.5B โœ… 1.94 55.0 1.18 68.5 23.37 64.3
IndexTTS2 1.5B โœ… 2.23 70.6 1.03 76.5 7.12 75.5
VibeVoice 1.5B โœ… 3.04 68.9 1.16 74.4 - -
HiggsAudio-v2 3B โœ… 2.44 67.7 1.50 74.0 55.07 65.6
VoxCPM-0.5B 0.6B โœ… 1.85 72.9 0.93 77.2 8.87 73.0
VoxCPM1.5 0.8B โœ… 2.12 71.4 1.18 77.0 7.74 73.1
MOSS-TTS โœ… 1.85 73.4 1.20 78.8 - -
Qwen3-TTS 1.7B โœ… 1.23 71.7 1.22 77.0 6.76 74.8
FishAudio S2 4B โœ… 0.99 - 0.54 - 5.99 -
LongCat-Audio-DiT 3.5B โœ… 1.50 78.6 1.09 81.8 6.04 79.7
VoxCPM2 2B โœ… 1.84 75.3 0.97 79.5 8.13 75.3

CV3-eval

CV3-eval Multilingual WER/CER(โฌ‡) Results (click to expand)
Model zh en hard-zh hard-en ja ko de es fr it ru
CosyVoice2 4.08 6.32 12.58 11.96 9.13 19.7 - - - - -
CosyVoice3-1.5B 3.91 4.99 9.77 10.55 7.57 5.69 6.43 4.47 11.8 10.5 6.64
Fish Audio S2 2.65 2.43 9.10 4.40 3.96 2.76 2.22 2.00 6.26 2.04 2.78
VoxCPM2 3.65 5.00 8.55 8.48 5.96 5.69 4.77 3.80 9.85 4.25 5.21

MiniMax-Multilingual-Test

Minimax-MLS-test WER(โฌ‡) Results (click to expand)
Language Minimax ElevenLabs Qwen3-TTS FishAudio S2 VoxCPM2
Arabic 1.665 1.666 โ€“ 3.500 13.046
Cantonese 34.111 51.513 โ€“ 30.670 38.584
Chinese 2.252 16.026 0.928 0.730 1.136
Czech 3.875 2.108 โ€“ 2.840 24.132
Dutch 1.143 0.803 โ€“ 0.990 0.913
English 2.164 2.339 0.934 1.620 2.289
Finnish 4.666 2.964 โ€“ 3.330 2.632
French 4.099 5.216 2.858 3.050 4.534
German 1.906 0.572 1.235 0.550 0.679
Greek 2.016 0.991 โ€“ 5.740 2.844
Hindi 6.962 5.827 โ€“ 14.640 19.699
Indonesian 1.237 1.059 โ€“ 1.460 1.084
Italian 1.543 1.743 0.948 1.270 1.563
Japanese 3.519 10.646 3.823 2.760 4.628
Korean 1.747 1.865 1.755 1.180 1.962
Polish 1.415 0.766 โ€“ 1.260 1.141
Portuguese 1.877 1.331 1.526 1.140 1.938
Romanian 2.878 1.347 โ€“ 10.740 21.577
Russian 4.281 3.878 3.212 2.400 3.634
Spanish 1.029 1.084 1.126 0.910 1.438
Thai 2.701 73.936 โ€“ 4.230 2.961
Turkish 1.52 0.699 โ€“ 0.870 0.817
Ukrainian 1.082 0.997 โ€“ 2.300 6.316
Vietnamese 0.88 73.415 โ€“ 7.410 3.307
Minimax-MLS-test SIM(โฌ†) Results (click to expand)
Language Minimax ElevenLabs Qwen3-TTS FishAudio S2 VoxCPM2
Arabic 73.6 70.6 โ€“ 75.0 79.1
Cantonese 77.8 67.0 โ€“ 80.5 83.5
Chinese 78.0 67.7 79.9 81.6 82.5
Czech 79.6 68.5 โ€“ 79.8 78.3
Dutch 73.8 68.0 โ€“ 73.0 80.8
English 75.6 61.3 77.5 79.7 85.4
Finnish 83.5 75.9 โ€“ 81.9 89.0
French 62.8 53.5 62.8 69.8 73.5
German 73.3 61.4 77.5 76.7 80.3
Greek 82.6 73.3 โ€“ 79.5 86.0
Hindi 81.8 73.0 โ€“ 82.1 85.6
Indonesian 72.9 66.0 โ€“ 76.3 80.0
Italian 69.9 57.9 81.7 74.7 78.0
Japanese 77.6 73.8 78.8 79.6 82.8
Korean 77.6 70.0 79.9 81.7 83.3
Polish 80.2 72.9 โ€“ 81.9 88.4
Portuguese 80.5 71.1 81.7 78.1 83.7
Romanian 80.9 69.9 โ€“ 73.3 79.7
Russian 76.1 67.6 79.2 79.0 81.1
Spanish 76.2 61.5 81.4 77.6 83.1
Thai 80.0 58.8 โ€“ 78.6 84.0
Turkish 77.9 59.6 โ€“ 83.5 87.1
Ukrainian 73.0 64.7 โ€“ 74.7 79.8
Vietnamese 74.3 36.9 โ€“ 74.0 80.6

Internal 30-Language ASR Benchmark

We additionally run an internal multilingual intelligibility benchmark with 30 languages ร— 500 samples. ASR transcription is evaluated via Gemini 3.1 Flash Lite API.

Internal 30-Language ASR Benchmark (click to expand)
Language Metric VoxCPM2 Fish S2-Pro
ar (Arabic) CER 1.23% 0.30%
da (Danish) WER 2.70% 3.52%
de (German) WER 0.96% 0.64%
el (Greek) WER 3.17% 4.61%
en (English) WER 0.42% 1.03%
es (Spanish) WER 1.33% 0.64%
fi (Finnish) WER 2.24% 2.80%
fr (French) WER 2.16% 2.34%
he (Hebrew) CER 2.98% 15.27%
hi (Hindi) CER 0.79% 0.91%
id (Indonesian) WER 1.36% 1.68%
it (Italian) WER 1.65% 1.08%
ja (Japanese) CER 2.40% 1.82%
km (Khmer) CER 2.05% 75.15%
ko (Korean) CER 0.95% 0.29%
lo (Lao) CER 1.90% 87.40%
ms (Malay) WER 1.75% 1.41%
my (Burmese) CER 1.42% 85.27%
nl (Dutch) WER 1.25% 1.68%
no (Norwegian) WER 2.49% 3.76%
pl (Polish) WER 1.90% 1.65%
pt (Portuguese) WER 1.48% 1.49%
ru (Russian) WER 0.90% 0.86%
sv (Swedish) WER 2.22% 2.63%
sw (Swahili) CER 1.07% 2.02%
th (Thai) CER 0.94% 1.92%
tl (Tagalog) WER 2.63% 4.00%
tr (Turkish) WER 1.65% 1.65%
vi (Vietnamese) WER 1.56% 5.56%
zh (Chinese) CER 0.92% 1.02%
Average (30 languages) 1.68% -

InstructTTSEval

Instruction-Guided Voice Design Results (click to expand)
Model InstructTTSEval-ZH InstructTTSEval-EN
APSโฌ† DSDโฌ† RPโฌ† APSโฌ† DSDโฌ† RPโฌ†
Hume โ€“ โ€“ โ€“ 83.0 75.3 54.3
VoxInstruct 47.5 52.3 42.6 54.9 57.0 39.3
Parler-tts-mini โ€“ โ€“ โ€“ 63.4 48.7 28.6
Parler-tts-large โ€“ โ€“ โ€“ 60.0 45.9 31.2
PromptTTS โ€“ โ€“ โ€“ 64.3 47.2 31.4
PromptStyle โ€“ โ€“ โ€“ 57.4 46.4 30.9
VoiceSculptor 75.7 64.7 61.5 โ€“ โ€“ โ€“
Mimo-Audio-7B-Instruct 75.7 74.3 61.5 80.6 77.6 59.5
Qwen3TTS-12Hz-1.7B-VD 85.2 81.1 65.1 82.9 82.4 68.4
VoxCPM2 85.2 71.5 60.8 84.2 83.2 71.4

โš™๏ธ Fine-tuning

VoxCPM supports both full fine-tuning (SFT) and LoRA fine-tuning. With as little as 5โ€“10 minutes of audio, you can adapt to a specific speaker, language, or domain.

# LoRA fine-tuning (parameter-efficient, recommended)
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml

# WebUI for training & inference
python lora_ft_webui.py   # then open http://localhost:7860

Full guide โ†’ Fine-tuning Guide (data preparation, configuration, training, LoRA hot-swapping, FAQ)


๐Ÿ“š Documentation

Full documentation: voxcpm.readthedocs.io

Topic Link
Quick Start & Installation Quick Start
Usage Guide & Cookbook User Guide
VoxCPM Series Models
Fine-tuning (SFT & LoRA) Fine-tuning Guide
FAQ & Troubleshooting FAQ

๐ŸŒŸ Ecosystem & Community

Project Description
Nano-vLLM High-throughput and Fast GPU serving
vLLM-Omni Official vLLM omni-modal serving for VoxCPM2 โ€” PagedAttention, OpenAI-compatible API
VoxCPM.cpp GGML/GGUF: CPU, CUDA, Vulkan inference
VoxCPM-ONNX ONNX export for CPU inference
VoxCPMANE Apple Neural Engine backend
voxcpm_rs Rust re-implementation
ComfyUI-VoxCPM ComfyUI node-based workflows
ComfyUI_RH_VoxCPM Feature-complete ComfyUI workflow for VoxCPM 2 with multi-speaker generation, LoRA, and auto-ASR
ComfyUI-VoxCPMTTS ComfyUI TTS extension
TTS WebUI Browser-based TTS extension

See the full Ecosystem in the docs. Community projects are not officially maintained by OpenBMB. Built something cool? Open an issue or PR to add it!


โš ๏ธ Risks and Limitations

  • Potential for Misuse: VoxCPM's voice cloning can generate highly realistic synthetic speech. It is strictly forbidden to use VoxCPM for impersonation, fraud, or disinformation. We strongly recommend clearly marking any AI-generated content.
  • Controllable Generation Stability: Voice Design and Controllable Voice Cloning results can vary between runs โ€” you may try to generate 1~3 times to obtain the desired voice or style. We are actively working on improving controllability consistency.
  • Language Coverage: VoxCPM2 officially supports 30 languages. For languages not on the list, you are welcome to test directly or try fine-tuning on your own data. We plan to expand language coverage in future releases.
  • Usage: This model is released under the Apache-2.0 license. For production deployments, we recommend conducting thorough testing and safety evaluation tailored to your use case.

๐Ÿ“– Citation

If you find VoxCPM helpful, please consider citing our work and starring โญ the repository!

@article{voxcpm2_2026,
  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
  author  = {VoxCPM Team},
  journal = {GitHub},
  year    = {2026},
}

@article{voxcpm2025,
  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation
             and True-to-Life Voice Cloning},
  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
  journal = {arXiv preprint arXiv:2509.24650},
  year    = {2025},
}

๐Ÿ“„ License

VoxCPM model weights and code are open-sourced under the Apache-2.0 license.

๐Ÿ™ Acknowledgments

  • DiTAR for the diffusion autoregressive backbone
  • MiniCPM-4 for the language model foundation
  • CosyVoice for the Flow Matching-based LocDiT implementation
  • DAC for the Audio VAE backbone
  • Our community users for trying VoxCPM, reporting issues, sharing ideas, and contributingโ€”your support helps the project keep getting better

Institutions

ModelBest ย ย ย  THUHCSI

โญ Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voxcpm-2.0.3.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voxcpm-2.0.3-py3-none-any.whl (88.3 kB view details)

Uploaded Python 3

File details

Details for the file voxcpm-2.0.3.tar.gz.

File metadata

  • Download URL: voxcpm-2.0.3.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voxcpm-2.0.3.tar.gz
Algorithm Hash digest
SHA256 db11caa8b24eb91ba94190f3707b06dbca82bb6b6911c04eccd12bde5b59086f
MD5 2b17ccd28aeeb06c5ffd4faf2a7ef18c
BLAKE2b-256 0b8faf41f058baa72ff8788d797f30966fc905abe4a62bb72940f38bf32279ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for voxcpm-2.0.3.tar.gz:

Publisher: publish-to-pypi.yml on OpenBMB/VoxCPM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file voxcpm-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: voxcpm-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 88.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voxcpm-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 24da58a30d094a9e9a7ead450ae9cffda0d31eaeba620b61ad99179dd87e486b
MD5 b65f721ccac709436828ebf716277130
BLAKE2b-256 f55076e912427684f7e71d443d9542802ad33df8764ef3bba954b96feeab41ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for voxcpm-2.0.3-py3-none-any.whl:

Publisher: publish-to-pypi.yml on OpenBMB/VoxCPM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page