TurboQuant KV cache compression for voice AI inference — 5x more concurrent sessions on the same GPU

These details have not been verified by PyPI

Project links

Project description

VoiceQuant

5x More Concurrent Voice Agents on the Same GPU

VoiceQuant applies TurboQuant KV cache compression to cut KV memory 5x, enabling 50 concurrent voice agent sessions on a single T4 GPU instead of 10.

The Problem

Voice AI inference is expensive. Each concurrent caller needs their own KV cache. On a T4 (16GB), after loading a 7B model (~4GB), you have ~12GB for KV caches. At FP16, that's only ~8 concurrent sessions at 4K context. Add a 1500-token system prompt and you're looking at even fewer.

The Solution

VoiceQuant uses TurboQuant (PolarQuant rotation + Lloyd-Max quantization + QJL residual correction) to compress KV caches from 16-bit to 4-bit with 0.99+ cosine similarity. Same GPU, ~40 concurrent sessions.

Quick Start

pip install voicequant

# Start serving (requires GPU + vLLM)
voicequant serve --model Qwen/Qwen2.5-7B-Instruct-AWQ

# Validate compression quality
voicequant verify --model Qwen/Qwen2.5-7B-Instruct-AWQ --bits 4

# Run voice AI benchmarks
voicequant bench --all --report benchmark_report.md

Benchmark Results

Metric	FP16	TQ4 (4-bit)	TQ3 (3-bit)	Improvement
Concurrent sessions (T4, 4K ctx)	~8	~40	~55	5x
KV cache per session (4K ctx)	~150 MB	~30 MB	~22 MB	5-7x smaller
TTFB at 8K context	baseline	~same	~same	neutral
Key cosine similarity	1.000	0.993+	0.985+	-
Value cosine similarity	1.000	0.990+	0.975+	-
Tool calling accuracy	100%	~100%	~99%	-

Deploy to Modal (One Command)

# Generate deployment files
voicequant deploy modal --model Qwen/Qwen2.5-7B-Instruct-AWQ --gpu T4

# Deploy
modal deploy deploy/modal_deploy.py

Your endpoint is now live at https://your-workspace--voicequant.modal.run/v1.

Deploy to RunPod

voicequant deploy runpod --model Qwen/Qwen2.5-7B-Instruct-AWQ --gpu T4

Deploy with Docker

voicequant deploy docker --model Qwen/Qwen2.5-7B-Instruct-AWQ
docker compose up --build

Use with LiveKit Agents

VoiceQuant exposes an OpenAI-compatible API, so any agent framework works as a drop-in:

from livekit.agents import AgentSession, Agent, JobContext, WorkerOptions, cli
from livekit.plugins import deepgram, openai, cartesia, silero

VOICEQUANT_URL = "https://your-workspace--voicequant.modal.run/v1"

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    session = AgentSession(
        vad=silero.VAD.load(),
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(
            model="Qwen/Qwen2.5-7B-Instruct-AWQ",
            base_url=VOICEQUANT_URL,
            api_key="voicequant",
        ),
        tts=cartesia.TTS(model="sonic-3"),
    )

    await session.start(
        agent=Agent(instructions="You are a helpful voice assistant."),
        room=ctx.room,
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

Use with Inference Gateway

Add VoiceQuant as a provider in your gateway config:

# gateway.yaml
models:
  llm:
    voicequant/qwen2.5-7b-tq4:
      provider: openai_compatible
      base_url: https://your-voicequant.modal.run/v1
      api_key: voicequant

Supported Models

Model	Size	AWQ Variant	Weights RAM	Voice Quality	Recommended GPU
Qwen2.5-3B-Instruct	3B	AWQ 4-bit	~2GB	Good for simple tasks	T4 (16GB)
Qwen2.5-7B-Instruct	7B	AWQ 4-bit	~4GB	Excellent for voice	T4/A10G
Llama-3.1-8B-Instruct	8B	AWQ 4-bit	~5GB	Great all-around	T4/A10G
Mistral-7B-Instruct-v0.3	7B	AWQ 4-bit	~4GB	Good instruction following	T4/A10G
Qwen2.5-14B-Instruct	14B	AWQ 4-bit	~8GB	Best quality in class	A10G/L4

Concurrent Session Estimates

GPU	Memory	Model Weights	Available for KV	FP16 Sessions	TQ4 Sessions
T4	16 GB	~4 GB	~12 GB	~8	~40
A10G	24 GB	~4 GB	~20 GB	~13	~65
L4	24 GB	~4 GB	~20 GB	~13	~65
A100	80 GB	~4 GB	~76 GB	~50	~250
H100	80 GB	~4 GB	~76 GB	~50	~250

How TurboQuant Works

PolarQuant Rotation: A fixed random orthogonal matrix rotates KV cache coordinates so they become approximately Gaussian distributed.
Lloyd-Max Quantization: Optimal scalar quantization for Gaussian data. Provably minimizes MSE for the given bit budget.
QJL Residual Correction (keys only): Random projection of the quantization residual preserves inner product expectations, correcting bias in attention scores.

Result: 3-4 bits per element with 0.99+ cosine similarity.

Keys: 2-bit MSE quantization + 1-bit QJL bias correction (3 bits total)
Values: 3-bit MSE quantization
Both compressed in a single fused kernel per attention head

Voice-Specific Optimizations

Residual window (default: 256 tokens): Recent tokens stay in FP16 for maximum quality. Older tokens (system prompt, early conversation) get compressed aggressively.
Low max_tokens (default: 150): Voice responses should be 1-3 sentences. A 500-token response takes 10+ seconds to speak.
Continuous batching: vLLM's continuous batching handles many concurrent short sessions efficiently.
Streaming by default: TTFB matters more than throughput for voice AI.

CLI Reference

# Start server
voicequant serve --model MODEL --tq-bits 4 --port 8000

# Run benchmarks
voicequant bench --all --report output.md
voicequant bench --scenario concurrent --max-sessions 50
voicequant bench --scenario multi_turn

# Validate quality
voicequant verify --model MODEL --bits 4 --threshold 0.99

# Deploy
voicequant deploy modal --model MODEL --gpu T4
voicequant deploy runpod --model MODEL --gpu T4
voicequant deploy docker --model MODEL

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat completions (streaming + non-streaming)
`/v1/models`	GET	List available models
`/v1/health`	GET	Health check with GPU memory status
`/v1/capacity`	GET	Estimated concurrent session capacity
`/v1/kv-stats`	GET	KV cache memory usage and compression ratio
`/metrics`	GET	Prometheus-format metrics

Development

pip install -e ".[all]"
pytest tests/ -v

Acknowledgments

TurboQuant — Google Research (ICLR 2026): PolarQuant rotation + Lloyd-Max + QJL residual correction
DevTechJr/turboquant-gpu — cuTile CUDA kernels + PyTorch fallback
Alberto-Codes/turboquant-vllm — vLLM plugin integration
0xSero/turboquant — Standalone implementation
mitkox/vllm-turboquant — vLLM fork with Triton backend

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicequant-0.0.1.tar.gz (391.5 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voicequant-0.0.1-py3-none-any.whl (96.7 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file voicequant-0.0.1.tar.gz.

File metadata

Download URL: voicequant-0.0.1.tar.gz
Upload date: Apr 15, 2026
Size: 391.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for voicequant-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9d65f5ee7299320a94122fefe8f3d55f1a7b6be675d5ba2db3807c923d0c7d88`
MD5	`600dbb5aef9ecbe510b852e7f655a98b`
BLAKE2b-256	`4c42456b82c44736aacafbba26a24bb5afaeb71cfb7ed37a25b1957e5194a8d4`

See more details on using hashes here.

File details

Details for the file voicequant-0.0.1-py3-none-any.whl.

File metadata

Download URL: voicequant-0.0.1-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 96.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.6.6

File hashes

Hashes for voicequant-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d458a73fcd2c12b9f59310b47b217ce550d810b16749cc692055fc6ac0860c38`
MD5	`a043234dc7efafabecc5d903347182ab`
BLAKE2b-256	`82c1a37171abf703ec7df6f2994035d1f85c71e357e9edee5384011efe47a6d6`

See more details on using hashes here.

voicequant 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VoiceQuant

The Problem

The Solution

Quick Start

Benchmark Results

Deploy to Modal (One Command)

Deploy to RunPod

Deploy with Docker

Use with LiveKit Agents

Use with Inference Gateway

Supported Models

Concurrent Session Estimates

How TurboQuant Works

Voice-Specific Optimizations

CLI Reference

API Endpoints

Development

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes