VoxCPM inference engine based on Nano-vLLM
Project description
Nano-vLLM VoxCPM HiFi
An inference engine for VoxCPM based on Nano-vLLM.
Features:
- Faster than the pytorch implementation
- Support concurrent requests
- Friendly async API (can be wrapped by an HTTP server; see
deployment/README.md)
This repository contains a Python package (nanovllm_voxcpm/) plus an optional FastAPI demo.
What's included
This branch is no longer just a minimal async wrapper. It now includes a production-oriented FastAPI layer and a set of deployment / benchmarking helpers for real VoxCPM2 cloning use cases.
New HTTP capabilities
The FastAPI layer now supports:
- streaming generation
POST /generate
- non-streaming generation
POST /generate_blocking(returnsaudio/mpeg)POST /generate_blocking_wav(returnsaudio/wav)
- prompt caching
POST /add_promptDELETE /prompts/{prompt_id}
- reference-audio caching
POST /add_referenceDELETE /references/{reference_id}
- HiFi clone bundle caching
POST /add_hifiDELETE /hifi/{hifi_id}
GenerateRequest now supports all of the following conditioning paths:
- zero-shot
prompt_wav_* + prompt_textprompt_latents_base64 + prompt_textprompt_idref_audio_wav_*ref_audio_latents_base64ref_audio_idhifi_id
Web-aligned HiFi clone semantics
The original VoxCPM2 Gradio UI's “Ultimate Cloning / 极致克隆” is not equivalent to a plain prompt_id test, and it is also not equivalent to a plain ref_audio_id test.
The web UI effectively combines the same reference audio in two roles:
reference_wav_path→ separate reference-audio conditionprompt_wav_path + prompt_text→ continuation-style conditioning
The FastAPI equivalent is therefore:
{
"target_text": "...",
"prompt_wav_base64": "...",
"prompt_wav_format": "wav",
"prompt_text": "...",
"ref_audio_wav_base64": "...",
"ref_audio_wav_format": "wav"
}
To make this cheaper and reusable, this branch adds hifi_id, which internally binds one prompt_id plus one reference_id into a single reusable cache handle.
Included helper scripts
tools/bench_latent_concurrency.pytools/bench_prompt_vs_ref_hot.pytools/bench_prompt_family_hot.pytools/bench_hifi_concurrency.pytools/bench_hifi_blocking_concurrency.pytools/post_start_warmup.shtools/nanovllm_hifi_gradio.pytools/nanovllm-hifi-gradio.service.example
Deployment / test notes from the RTX 4090 host (2026-04-12)
Host used during validation:
- OS: Ubuntu 24.04
- GPU: NVIDIA GeForce RTX 4090
- Driver:
590.48.01 - CUDA reported by
nvidia-smi:13.1
Main deployment pitfalls encountered
-
uv sync --frozen+ build isolation was not enough forflash-attnon this host
The reliable path here was installing the repo editable without build isolation and then compiling/installingflash-attnexplicitly. -
Original VoxCPM2 / Gradio side had a torch / torchvision mismatch
This was fixed by movingtorchvisionto a version matching the installed CUDA/torch line. -
The original VoxCPM2 environment needed a newer NVIDIA driver
nvidia-driver-590was required to stabilize the CUDA 13 line used by that side. -
Warmup inside FastAPI lifespan was unsafe
Doing generation directly inside startup triggeredscheduler.py: assert scheduled_seqs. The safe solution was a post-start warmup script executed after/healthbecame ready. -
prompt_idandref_audio_idshould not be compared as if they were the same task
prompt_idis continuation-style conditioning.ref_audio_idis a separate reference-audio condition. The real web “HiFi / 极致克隆” path is the combined route described above.
Latest stable configuration used for HiFi testing
NANOVLLM_SERVERPOOL_ENFORCE_EAGER=falseNANOVLLM_SERVERPOOL_INFERENCE_TIMESTEPS=10NANOVLLM_QUEUE_COALESCE_MS=5- FastAPI default
cfg_value=2.0 - HiFi post-start warmup enabled via
tools/post_start_warmup.sh
Latest measured HiFi numbers (strict warm procedure)
Warm procedure used before concurrency tests:
- service warmup
add_hifi- sleep 3s
- warm the same
hifi_idtwice - sleep 5s
- run 5-concurrency test
- sleep 5s
- run 10-concurrency test
With NANOVLLM_QUEUE_COALESCE_MS=5:
| scenario | avg TTFB (s) | avg total (s) | P95 total (s) |
|---|---|---|---|
| HiFi warm single request | 0.185~0.220 | 0.696~0.722 | - |
| HiFi streaming, 5 concurrency | 0.246 | 1.078 | 1.158 |
| HiFi streaming, 10 concurrency | 0.401 | 1.582 | 1.682 |
| HiFi blocking MP3, 5 concurrency | - | 0.998 | 1.091 |
| HiFi blocking MP3, 10 concurrency | - | 1.296 | 1.426 |
Queue coalescing comparison under the same strict HiFi method
NANOVLLM_QUEUE_COALESCE_MS |
5-concurrency avg total (s) | 10-concurrency avg total (s) | verdict |
|---|---|---|---|
| 2 | 1.430 | 2.013 | slower than 5 |
| 5 | 1.078 | 1.582 | best overall |
| 10 | 1.131 | 2.421 | helps 5-conc a bit, hurts 10-conc badly |
Current recommendation for HiFi on this host: NANOVLLM_QUEUE_COALESCE_MS=5.
Installation
Install from PyPI
Core package:
pip install nano-vllm-voxcpm-hifi
Or with uv:
uv pip install nano-vllm-voxcpm-hifi
Note: the optional FastAPI demo service (deployment/) is not published on PyPI.
Prerequisites
- Linux + NVIDIA GPU (CUDA)
- Python >= 3.10
flash-attnis required (the package imports it at runtime)
The runtime is GPU-centric (Triton + FlashAttention). CPU-only execution is not supported.
Install from source (dev)
This repo uses uv and includes a lockfile (uv.lock).
uv sync --frozen
Dev deps (tests):
uv sync --frozen --dev
Note: flash-attn may require additional system CUDA tooling depending on your environment.
Basic Usage
See example.py for an end-to-end async example.
Quickstart:
uv run python example.py
Load a model
VoxCPM.from_pretrained(...) accepts either:
- a local model directory path, or
- a HuggingFace repo id (it will download via
huggingface_hub.snapshot_download).
The model directory is expected to contain:
config.json- one or more
*.safetensorsweight files audiovae.pth(VAE weights)
Generate (async)
If you call from_pretrained() inside an async event loop, it returns an AsyncVoxCPMServerPool.
import asyncio
import numpy as np
from nanovllm_voxcpm import VoxCPM
async def main() -> None:
server = VoxCPM.from_pretrained(
model="/path/to/VoxCPM",
devices=[0],
max_num_batched_tokens=8192,
max_num_seqs=16,
gpu_memory_utilization=0.95,
)
await server.wait_for_ready()
chunks = []
async for chunk in server.generate(target_text="Hello world"):
chunks.append(chunk) # each chunk is a float32 numpy array
wav = np.concatenate(chunks, axis=0)
# Write with the model's sample rate (see your model's AudioVAE config; often 16000)
# import soundfile as sf; sf.write("out.wav", wav, sample_rate)
await server.stop()
if __name__ == "__main__":
asyncio.run(main())
Generate (sync)
If you call from_pretrained() outside an event loop, it returns a SyncVoxCPMServerPool.
import numpy as np
from nanovllm_voxcpm import VoxCPM
server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = []
for chunk in server.generate(target_text="Hello world"):
chunks.append(chunk)
wav = np.concatenate(chunks, axis=0)
server.stop()
Prompting and reference audio (optional)
The VoxCPM2 server supports these conditioning inputs:
- zero-shot: no prompt or reference audio
- prompt continuation: provide
prompt_latents+prompt_text - stored prompt: provide a
prompt_id(viaadd_prompt) and then generate with that id - reference audio: provide
ref_audio_latentsto add a separate reference-audio condition
ref_audio_latents is independent from prompt_latents:
- use
prompt_latentswhen you want to continue from an existing audio prefix - use
ref_audio_latentswhen you want to provide extra reference audio without treating it as the decode prefix
See the public API in nanovllm_voxcpm/models/voxcpm2/server.py for details.
FastAPI demo
The HTTP server demo is documented separately to keep this README focused:
deployment/README.md
If you want the deployment server dependencies too, use:
uv sync --all-packages --frozen
Benchmark
The benchmark/ directory contains an end-to-end inference benchmark that drives
the public server API and reports throughput/latency metrics.
Quick run:
uv run python benchmark/bench_inference.py --model ~/VoxCPM1.5 --devices 0 --concurrency 1 --warmup 1 --iters 5
Use a longer English prompt (~100 words) for more stable results:
uv run python benchmark/bench_inference.py --model ~/VoxCPM1.5 --devices 0 --concurrency 1 --warmup 1 --iters 5 \
--target-text-file benchmark/target_text_100w_en.txt
See benchmark/README.md for more flags.
Reference Results (RTX 4090)
All reference numbers in this section are measured on NVIDIA GeForce RTX 4090.
The benchmark reports RTF_per_req_mean, defined as the mean over requests of
(request_wall_time / request_audio_duration) under the given concurrency.
Test setup:
- GPU: NVIDIA GeForce RTX 4090
- Model:
~/VoxCPM1.5 - Benchmark:
benchmark/bench_inference.py - Runs:
--warmup 1 --iters 5
Short prompt ("Hello world."):
Note: with a very short prompt, the model's stopping behavior can be noisy, so output audio duration (and thus RTF) may have high variance at higher concurrency.
| concurrency | TTFB p50 (s) | TTFB p90 (s) | RTF_per_req_mean |
|---|---|---|---|
| 1 | 0.1741 ± 0.0012 | 0.1741 ± 0.0012 | 0.1918 ± 0.0127 |
| 8 | 0.1804 ± 0.0041 | 0.1807 ± 0.0040 | 0.2353 ± 0.0162 |
| 16 | 0.1870 ± 0.0055 | 0.1878 ± 0.0054 | 0.3009 ± 0.0094 |
| 32 | 0.1924 ± 0.0052 | 0.1932 ± 0.0051 | 0.4055 ± 0.0099 |
| 64 | 0.2531 ± 0.0823 | 0.2918 ± 0.0938 | 0.6755 ± 0.0668 |
Long prompt (benchmark/target_text_100w_en.txt):
| concurrency | TTFB p50 (s) | TTFB p90 (s) | RTF_per_req_mean |
|---|---|---|---|
| 1 | 0.1909 ± 0.0102 | 0.1909 ± 0.0102 | 0.0805 ± 0.0007 |
| 8 | 0.1902 ± 0.0021 | 0.1905 ± 0.0021 | 0.1159 ± 0.0004 |
| 16 | 0.2044 ± 0.0050 | 0.2050 ± 0.0051 | 0.1825 ± 0.0007 |
| 32 | 0.2168 ± 0.0034 | 0.2185 ± 0.0032 | 0.3207 ± 0.0022 |
| 64 | 0.3235 ± 0.0063 | 0.3250 ± 0.0064 | 0.5556 ± 0.0033 |
Closed-loop users benchmark (benchmark/bench_closed_loop_users.py):
- Model:
~/VoxCPM1.5 - Command:
uv run python benchmark/bench_closed_loop_users.py \
--model ~/VoxCPM1.5 \
--num-users 60 --warmup-s 5 --duration-s 60 \
--target-text-file benchmark/target_text_100w_en.txt \
--max-generate-length 2000
Results (measured window):
| item | value |
|---|---|
| sample_rate (Hz) | 44100 |
| users | 60 |
| started | 119 |
| achieved rps | 1.98 |
| ok | 119 |
| err | 0 |
TTFB (seconds, ok requests):
| p50 | p90 | p95 | p99 | mean | stdev |
|---|---|---|---|---|---|
| 0.2634 | 0.3477 | 0.3531 | 0.3631 | 0.2884 | 0.0451 |
RTF (wall/audio, ok requests):
| p50 | p90 | p95 | p99 | mean | stdev |
|---|---|---|---|---|---|
| 0.7285 | 0.7946 | 0.8028 | 0.8255 | 0.6929 | 0.1062 |
Acknowledgments
License
MIT License
Known Issue
If you see the errors below:
ValueError: Missing parameters: ['base_lm.embed_tokens.weight', 'base_lm.layers.0.self_attn.qkv_proj.weight', ... , 'stop_proj.weight', 'stop_proj.bias', 'stop_head.weight']
[rank0]:[W1106 07:26:04.469150505 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
It's because nanovllm loads model parameters from *.safetensors, but some VoxCPM releases ship weights as .pt.
Fix:
- use a safetensors-converted checkpoint (or convert the checkpoint yourself)
- ensure the
*.safetensorsfiles live next toconfig.jsonin the model directory
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nano_vllm_voxcpm_hifi-2.0.1.post2-py3-none-any.whl.
File metadata
- Download URL: nano_vllm_voxcpm_hifi-2.0.1.post2-py3-none-any.whl
- Upload date:
- Size: 84.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17b79cbe81998f6a410e55b2137c0d083a587b8bf5476b80557ccc9dc39a8443
|
|
| MD5 |
3e2a97ba149d52aca42a58b8559260ef
|
|
| BLAKE2b-256 |
98274ba92243ebf2373aa4bce0c20880f4cd8e81cd01b71f382897ba6ae3b481
|