VoxCPM inference engine based on Nano-vLLM
Project description
Nano-vLLM-VoxCPM
An inference engine for VoxCPM based on Nano-vLLM.
Features:
- Faster than the pytorch implementation
- Support concurrent requests
- Friendly async API (can be wrapped by an HTTP server; see
deployment/README.md)
This repository contains a Python package (nanovllm_voxcpm/) plus an optional FastAPI demo.
Installation
Nano-vLLM-VoxCPM is not available on PyPI yet. Install from source.
Prerequisites
- Linux + NVIDIA GPU (CUDA)
- Python >= 3.10
flash-attnis required (the package imports it at runtime)
The runtime is GPU-centric (Triton + FlashAttention). CPU-only execution is not supported.
Install with uv (recommended)
This repo uses uv and includes a lockfile (uv.lock).
uv sync --frozen
Dev deps (tests):
uv sync --frozen --dev
Note: flash-attn may require additional system CUDA tooling depending on your environment.
Basic Usage
See example.py for an end-to-end async example.
Quickstart:
uv run python example.py
Load a model
VoxCPM.from_pretrained(...) accepts either:
- a local model directory path, or
- a HuggingFace repo id (it will download via
huggingface_hub.snapshot_download).
The model directory is expected to contain:
config.json- one or more
*.safetensorsweight files audiovae.pth(VAE weights)
Generate (async)
If you call from_pretrained() inside an async event loop, it returns an AsyncVoxCPMServerPool.
import asyncio
import numpy as np
from nanovllm_voxcpm import VoxCPM
async def main() -> None:
server = VoxCPM.from_pretrained(
model="/path/to/VoxCPM",
devices=[0],
max_num_batched_tokens=8192,
max_num_seqs=16,
gpu_memory_utilization=0.95,
)
await server.wait_for_ready()
chunks = []
async for chunk in server.generate(target_text="Hello world"):
chunks.append(chunk) # each chunk is a float32 numpy array
wav = np.concatenate(chunks, axis=0)
# Write with the model's sample rate (see your model's AudioVAE config; often 16000)
# import soundfile as sf; sf.write("out.wav", wav, sample_rate)
await server.stop()
if __name__ == "__main__":
asyncio.run(main())
Generate (sync)
If you call from_pretrained() outside an event loop, it returns a SyncVoxCPMServerPool.
import numpy as np
from nanovllm_voxcpm import VoxCPM
server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = []
for chunk in server.generate(target_text="Hello world"):
chunks.append(chunk)
wav = np.concatenate(chunks, axis=0)
server.stop()
Prompting (optional)
The VoxCPM server supports three prompt modes:
- zero-shot: no prompt
- provide
prompt_latents+prompt_text - provide a stored
prompt_id(viaadd_prompt) and then generate with that id
See the docstrings in nanovllm_voxcpm/models/voxcpm/server.py for details.
FastAPI demo
The HTTP server demo is documented separately to keep this README focused:
deployment/README.md
If you want the deployment server dependencies too, use:
uv sync --all-packages --frozen
Benchmark
The benchmark/ directory contains an end-to-end inference benchmark that drives
the public server API and reports throughput/latency metrics.
Quick run:
uv run python benchmark/bench_inference.py --model ~/VoxCPM1.5 --devices 0 --concurrency 1 --warmup 1 --iters 5
Use a longer English prompt (~100 words) for more stable results:
uv run python benchmark/bench_inference.py --model ~/VoxCPM1.5 --devices 0 --concurrency 1 --warmup 1 --iters 5 \
--target-text-file benchmark/target_text_100w_en.txt
See benchmark/README.md for more flags.
Reference Results (RTX 4090)
All reference numbers in this section are measured on NVIDIA GeForce RTX 4090.
The benchmark reports RTF_per_req_mean, defined as the mean over requests of
(request_wall_time / request_audio_duration) under the given concurrency.
Test setup:
- GPU: NVIDIA GeForce RTX 4090
- Model:
~/VoxCPM1.5 - Benchmark:
benchmark/bench_inference.py - Runs:
--warmup 1 --iters 5
Short prompt ("Hello world."):
Note: with a very short prompt, the model's stopping behavior can be noisy, so output audio duration (and thus RTF) may have high variance at higher concurrency.
| concurrency | TTFB p50 (s) | TTFB p90 (s) | RTF_per_req_mean |
|---|---|---|---|
| 1 | 0.1741 ± 0.0012 | 0.1741 ± 0.0012 | 0.1918 ± 0.0127 |
| 8 | 0.1804 ± 0.0041 | 0.1807 ± 0.0040 | 0.2353 ± 0.0162 |
| 16 | 0.1870 ± 0.0055 | 0.1878 ± 0.0054 | 0.3009 ± 0.0094 |
| 32 | 0.1924 ± 0.0052 | 0.1932 ± 0.0051 | 0.4055 ± 0.0099 |
| 64 | 0.2531 ± 0.0823 | 0.2918 ± 0.0938 | 0.6755 ± 0.0668 |
Long prompt (benchmark/target_text_100w_en.txt):
| concurrency | TTFB p50 (s) | TTFB p90 (s) | RTF_per_req_mean |
|---|---|---|---|
| 1 | 0.1909 ± 0.0102 | 0.1909 ± 0.0102 | 0.0805 ± 0.0007 |
| 8 | 0.1902 ± 0.0021 | 0.1905 ± 0.0021 | 0.1159 ± 0.0004 |
| 16 | 0.2044 ± 0.0050 | 0.2050 ± 0.0051 | 0.1825 ± 0.0007 |
| 32 | 0.2168 ± 0.0034 | 0.2185 ± 0.0032 | 0.3207 ± 0.0022 |
| 64 | 0.3235 ± 0.0063 | 0.3250 ± 0.0064 | 0.5556 ± 0.0033 |
Closed-loop users benchmark (benchmark/bench_closed_loop_users.py):
- Model:
~/VoxCPM1.5 - Command:
uv run python benchmark/bench_closed_loop_users.py \
--model ~/VoxCPM1.5 \
--num-users 60 --warmup-s 5 --duration-s 60 \
--target-text-file benchmark/target_text_100w_en.txt \
--max-generate-length 2000
Results (measured window):
| item | value |
|---|---|
| sample_rate (Hz) | 44100 |
| users | 60 |
| started | 119 |
| achieved rps | 1.98 |
| ok | 119 |
| err | 0 |
TTFB (seconds, ok requests):
| p50 | p90 | p95 | p99 | mean | stdev |
|---|---|---|---|---|---|
| 0.2634 | 0.3477 | 0.3531 | 0.3631 | 0.2884 | 0.0451 |
RTF (wall/audio, ok requests):
| p50 | p90 | p95 | p99 | mean | stdev |
|---|---|---|---|---|---|
| 0.7285 | 0.7946 | 0.8028 | 0.8255 | 0.6929 | 0.1062 |
Acknowledgments
License
MIT License
Known Issue
If you see the errors below:
ValueError: Missing parameters: ['base_lm.embed_tokens.weight', 'base_lm.layers.0.self_attn.qkv_proj.weight', ... , 'stop_proj.weight', 'stop_proj.bias', 'stop_head.weight']
[rank0]:[W1106 07:26:04.469150505 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
It's because nanovllm loads model parameters from *.safetensors, but some VoxCPM releases ship weights as .pt.
Fix:
- use a safetensors-converted checkpoint (or convert the checkpoint yourself)
- ensure the
*.safetensorsfiles live next toconfig.jsonin the model directory
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nano_vllm_voxcpm-1.0.0.tar.gz.
File metadata
- Download URL: nano_vllm_voxcpm-1.0.0.tar.gz
- Upload date:
- Size: 49.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b85bc32f74a04a6cfdee89fc2752dd7b7eb4c770764bd80a2414022525acf87
|
|
| MD5 |
e1827e83f53f92b36784e44d83fbf06a
|
|
| BLAKE2b-256 |
15d12ffa08d2d15e82280ea66145152aebc9c58b6bf4c5e1c80e56296c57e787
|
File details
Details for the file nano_vllm_voxcpm-1.0.0-py3-none-any.whl.
File metadata
- Download URL: nano_vllm_voxcpm-1.0.0-py3-none-any.whl
- Upload date:
- Size: 59.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dd207eee4469855eb4441c6fa1bda9e6ab78e9e287ca3ca0dba042d60e95f88
|
|
| MD5 |
05a6b32c5ef5589fb78978a74b2b3799
|
|
| BLAKE2b-256 |
812e6e7e9eb1aa027b6c0d60ccf6b6bfc8fb26c549af787cd37452fc23aa7505
|