Skip to main content

VoxCPM inference engine based on Nano-vLLM

Project description

Nano-vLLM-VoxCPM

An inference engine for VoxCPM based on Nano-vLLM.

Features:

  • Faster than the pytorch implementation
  • Support concurrent requests
  • Friendly async API (can be wrapped by an HTTP server; see deployment/README.md)

This repository contains a Python package (nanovllm_voxcpm/) plus an optional FastAPI demo.

Installation

Nano-vLLM-VoxCPM is not available on PyPI yet. Install from source.

Prerequisites

  • Linux + NVIDIA GPU (CUDA)
  • Python >= 3.10
  • flash-attn is required (the package imports it at runtime)

The runtime is GPU-centric (Triton + FlashAttention). CPU-only execution is not supported.

Install with uv (recommended)

This repo uses uv and includes a lockfile (uv.lock).

uv sync --frozen

Dev deps (tests):

uv sync --frozen --dev

Note: flash-attn may require additional system CUDA tooling depending on your environment.

Basic Usage

See example.py for an end-to-end async example.

Quickstart:

uv run python example.py

Load a model

VoxCPM.from_pretrained(...) accepts either:

  • a local model directory path, or
  • a HuggingFace repo id (it will download via huggingface_hub.snapshot_download).

The model directory is expected to contain:

  • config.json
  • one or more *.safetensors weight files
  • audiovae.pth (VAE weights)

Generate (async)

If you call from_pretrained() inside an async event loop, it returns an AsyncVoxCPMServerPool.

import asyncio
import numpy as np

from nanovllm_voxcpm import VoxCPM


async def main() -> None:
    server = VoxCPM.from_pretrained(
        model="/path/to/VoxCPM",
        devices=[0],
        max_num_batched_tokens=8192,
        max_num_seqs=16,
        gpu_memory_utilization=0.95,
    )
    await server.wait_for_ready()

    chunks = []
    async for chunk in server.generate(target_text="Hello world"):
        chunks.append(chunk)  # each chunk is a float32 numpy array

    wav = np.concatenate(chunks, axis=0)
    # Write with the model's sample rate (see your model's AudioVAE config; often 16000)
    # import soundfile as sf; sf.write("out.wav", wav, sample_rate)

    await server.stop()


if __name__ == "__main__":
    asyncio.run(main())

Generate (sync)

If you call from_pretrained() outside an event loop, it returns a SyncVoxCPMServerPool.

import numpy as np

from nanovllm_voxcpm import VoxCPM


server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = []
for chunk in server.generate(target_text="Hello world"):
    chunks.append(chunk)
wav = np.concatenate(chunks, axis=0)
server.stop()

Prompting (optional)

The VoxCPM server supports three prompt modes:

  • zero-shot: no prompt
  • provide prompt_latents + prompt_text
  • provide a stored prompt_id (via add_prompt) and then generate with that id

See the docstrings in nanovllm_voxcpm/models/voxcpm/server.py for details.

FastAPI demo

The HTTP server demo is documented separately to keep this README focused:

  • deployment/README.md

If you want the deployment server dependencies too, use:

uv sync --all-packages --frozen

Benchmark

The benchmark/ directory contains an end-to-end inference benchmark that drives the public server API and reports throughput/latency metrics.

Quick run:

uv run python benchmark/bench_inference.py --model ~/VoxCPM1.5 --devices 0 --concurrency 1 --warmup 1 --iters 5

Use a longer English prompt (~100 words) for more stable results:

uv run python benchmark/bench_inference.py --model ~/VoxCPM1.5 --devices 0 --concurrency 1 --warmup 1 --iters 5 \
  --target-text-file benchmark/target_text_100w_en.txt

See benchmark/README.md for more flags.

Reference Results (RTX 4090)

All reference numbers in this section are measured on NVIDIA GeForce RTX 4090.

The benchmark reports RTF_per_req_mean, defined as the mean over requests of (request_wall_time / request_audio_duration) under the given concurrency.

Test setup:

  • GPU: NVIDIA GeForce RTX 4090
  • Model: ~/VoxCPM1.5
  • Benchmark: benchmark/bench_inference.py
  • Runs: --warmup 1 --iters 5

Short prompt ("Hello world."):

Note: with a very short prompt, the model's stopping behavior can be noisy, so output audio duration (and thus RTF) may have high variance at higher concurrency.

concurrency TTFB p50 (s) TTFB p90 (s) RTF_per_req_mean
1 0.1741 ± 0.0012 0.1741 ± 0.0012 0.1918 ± 0.0127
8 0.1804 ± 0.0041 0.1807 ± 0.0040 0.2353 ± 0.0162
16 0.1870 ± 0.0055 0.1878 ± 0.0054 0.3009 ± 0.0094
32 0.1924 ± 0.0052 0.1932 ± 0.0051 0.4055 ± 0.0099
64 0.2531 ± 0.0823 0.2918 ± 0.0938 0.6755 ± 0.0668

Long prompt (benchmark/target_text_100w_en.txt):

concurrency TTFB p50 (s) TTFB p90 (s) RTF_per_req_mean
1 0.1909 ± 0.0102 0.1909 ± 0.0102 0.0805 ± 0.0007
8 0.1902 ± 0.0021 0.1905 ± 0.0021 0.1159 ± 0.0004
16 0.2044 ± 0.0050 0.2050 ± 0.0051 0.1825 ± 0.0007
32 0.2168 ± 0.0034 0.2185 ± 0.0032 0.3207 ± 0.0022
64 0.3235 ± 0.0063 0.3250 ± 0.0064 0.5556 ± 0.0033

Closed-loop users benchmark (benchmark/bench_closed_loop_users.py):

  • Model: ~/VoxCPM1.5
  • Command:
uv run python benchmark/bench_closed_loop_users.py \
  --model ~/VoxCPM1.5 \
  --num-users 60 --warmup-s 5 --duration-s 60 \
  --target-text-file benchmark/target_text_100w_en.txt \
  --max-generate-length 2000

Results (measured window):

item value
sample_rate (Hz) 44100
users 60
started 119
achieved rps 1.98
ok 119
err 0

TTFB (seconds, ok requests):

p50 p90 p95 p99 mean stdev
0.2634 0.3477 0.3531 0.3631 0.2884 0.0451

RTF (wall/audio, ok requests):

p50 p90 p95 p99 mean stdev
0.7285 0.7946 0.8028 0.8255 0.6929 0.1062

Acknowledgments

License

MIT License

Known Issue

If you see the errors below:

ValueError: Missing parameters: ['base_lm.embed_tokens.weight', 'base_lm.layers.0.self_attn.qkv_proj.weight', ... , 'stop_proj.weight', 'stop_proj.bias', 'stop_head.weight']
[rank0]:[W1106 07:26:04.469150505 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

It's because nanovllm loads model parameters from *.safetensors, but some VoxCPM releases ship weights as .pt.

Fix:

  • use a safetensors-converted checkpoint (or convert the checkpoint yourself)
  • ensure the *.safetensors files live next to config.json in the model directory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nano_vllm_voxcpm-1.0.0rc4.tar.gz (266.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nano_vllm_voxcpm-1.0.0rc4-py3-none-any.whl (60.1 kB view details)

Uploaded Python 3

File details

Details for the file nano_vllm_voxcpm-1.0.0rc4.tar.gz.

File metadata

  • Download URL: nano_vllm_voxcpm-1.0.0rc4.tar.gz
  • Upload date:
  • Size: 266.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nano_vllm_voxcpm-1.0.0rc4.tar.gz
Algorithm Hash digest
SHA256 fd71c203bb9e85cce59939ede48a99aa0f734c6f890149a356eb2a16b1d6434b
MD5 1006aba66c183a73d8396a8186c23e96
BLAKE2b-256 7effb751265c9a5b3e53213ec21408a7b5d43994d0482e9ca0780b58f6ca8a5d

See more details on using hashes here.

File details

Details for the file nano_vllm_voxcpm-1.0.0rc4-py3-none-any.whl.

File metadata

  • Download URL: nano_vllm_voxcpm-1.0.0rc4-py3-none-any.whl
  • Upload date:
  • Size: 60.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nano_vllm_voxcpm-1.0.0rc4-py3-none-any.whl
Algorithm Hash digest
SHA256 b443e56bb3a9fda9bac284b614803d1efcc36ffa49756df66ac4077986671f84
MD5 79a31c4c74e6923a6f8f2372d56d9b1a
BLAKE2b-256 13936ab421d746aae989d7424ec761739f13e0e89838e40b13a4c5b69d9c0b5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page