Skip to main content

TurboQuant KV cache compression for voice AI inference — 5x more concurrent sessions on the same GPU

Project description

VoiceQuant

5x More Concurrent Voice Agents on the Same GPU

VoiceQuant applies TurboQuant KV cache compression to cut KV memory 5x, enabling 50 concurrent voice agent sessions on a single T4 GPU instead of 10.

The Problem

Voice AI inference is expensive. Each concurrent caller needs their own KV cache. On a T4 (16GB), after loading a 7B model (~4GB), you have ~12GB for KV caches. At FP16, that's only ~8 concurrent sessions at 4K context. Add a 1500-token system prompt and you're looking at even fewer.

The Solution

VoiceQuant uses TurboQuant (PolarQuant rotation + Lloyd-Max quantization + QJL residual correction) to compress KV caches from 16-bit to 4-bit with 0.99+ cosine similarity. Same GPU, ~40 concurrent sessions.

Quick Start

pip install voicequant

# Start serving (requires GPU + vLLM)
voicequant serve --model Qwen/Qwen2.5-7B-Instruct-AWQ

# Validate compression quality
voicequant verify --model Qwen/Qwen2.5-7B-Instruct-AWQ --bits 4

# Run voice AI benchmarks
voicequant bench --all --report benchmark_report.md

Benchmark Results

Metric FP16 TQ4 (4-bit) TQ3 (3-bit) Improvement
Concurrent sessions (T4, 4K ctx) ~8 ~40 ~55 5x
KV cache per session (4K ctx) ~150 MB ~30 MB ~22 MB 5-7x smaller
TTFB at 8K context baseline ~same ~same neutral
Key cosine similarity 1.000 0.993+ 0.985+ -
Value cosine similarity 1.000 0.990+ 0.975+ -
Tool calling accuracy 100% ~100% ~99% -

Deploy to Modal (One Command)

# Generate deployment files
voicequant deploy modal --model Qwen/Qwen2.5-7B-Instruct-AWQ --gpu T4

# Deploy
modal deploy deploy/modal_deploy.py

Your endpoint is now live at https://your-workspace--voicequant.modal.run/v1.

Deploy to RunPod

voicequant deploy runpod --model Qwen/Qwen2.5-7B-Instruct-AWQ --gpu T4

Deploy with Docker

voicequant deploy docker --model Qwen/Qwen2.5-7B-Instruct-AWQ
docker compose up --build

Use with LiveKit Agents

VoiceQuant exposes an OpenAI-compatible API, so any agent framework works as a drop-in:

from livekit.agents import AgentSession, Agent, JobContext, WorkerOptions, cli
from livekit.plugins import deepgram, openai, cartesia, silero

VOICEQUANT_URL = "https://your-workspace--voicequant.modal.run/v1"

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    session = AgentSession(
        vad=silero.VAD.load(),
        stt=deepgram.STT(model="nova-3"),
        llm=openai.LLM(
            model="Qwen/Qwen2.5-7B-Instruct-AWQ",
            base_url=VOICEQUANT_URL,
            api_key="voicequant",
        ),
        tts=cartesia.TTS(model="sonic-3"),
    )

    await session.start(
        agent=Agent(instructions="You are a helpful voice assistant."),
        room=ctx.room,
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

Use with Inference Gateway

Add VoiceQuant as a provider in your gateway config:

# gateway.yaml
models:
  llm:
    voicequant/qwen2.5-7b-tq4:
      provider: openai_compatible
      base_url: https://your-voicequant.modal.run/v1
      api_key: voicequant

Supported Models

Model Size AWQ Variant Weights RAM Voice Quality Recommended GPU
Qwen2.5-3B-Instruct 3B AWQ 4-bit ~2GB Good for simple tasks T4 (16GB)
Qwen2.5-7B-Instruct 7B AWQ 4-bit ~4GB Excellent for voice T4/A10G
Llama-3.1-8B-Instruct 8B AWQ 4-bit ~5GB Great all-around T4/A10G
Mistral-7B-Instruct-v0.3 7B AWQ 4-bit ~4GB Good instruction following T4/A10G
Qwen2.5-14B-Instruct 14B AWQ 4-bit ~8GB Best quality in class A10G/L4

Concurrent Session Estimates

GPU Memory Model Weights Available for KV FP16 Sessions TQ4 Sessions
T4 16 GB ~4 GB ~12 GB ~8 ~40
A10G 24 GB ~4 GB ~20 GB ~13 ~65
L4 24 GB ~4 GB ~20 GB ~13 ~65
A100 80 GB ~4 GB ~76 GB ~50 ~250
H100 80 GB ~4 GB ~76 GB ~50 ~250

How TurboQuant Works

  1. PolarQuant Rotation: A fixed random orthogonal matrix rotates KV cache coordinates so they become approximately Gaussian distributed.
  2. Lloyd-Max Quantization: Optimal scalar quantization for Gaussian data. Provably minimizes MSE for the given bit budget.
  3. QJL Residual Correction (keys only): Random projection of the quantization residual preserves inner product expectations, correcting bias in attention scores.

Result: 3-4 bits per element with 0.99+ cosine similarity.

  • Keys: 2-bit MSE quantization + 1-bit QJL bias correction (3 bits total)
  • Values: 3-bit MSE quantization
  • Both compressed in a single fused kernel per attention head

Voice-Specific Optimizations

  • Residual window (default: 256 tokens): Recent tokens stay in FP16 for maximum quality. Older tokens (system prompt, early conversation) get compressed aggressively.
  • Low max_tokens (default: 150): Voice responses should be 1-3 sentences. A 500-token response takes 10+ seconds to speak.
  • Continuous batching: vLLM's continuous batching handles many concurrent short sessions efficiently.
  • Streaming by default: TTFB matters more than throughput for voice AI.

CLI Reference

# Start server
voicequant serve --model MODEL --tq-bits 4 --port 8000

# Run benchmarks
voicequant bench --all --report output.md
voicequant bench --scenario concurrent --max-sessions 50
voicequant bench --scenario multi_turn

# Validate quality
voicequant verify --model MODEL --bits 4 --threshold 0.99

# Deploy
voicequant deploy modal --model MODEL --gpu T4
voicequant deploy runpod --model MODEL --gpu T4
voicequant deploy docker --model MODEL

API Endpoints

Endpoint Method Description
/v1/chat/completions POST OpenAI-compatible chat completions (streaming + non-streaming)
/v1/models GET List available models
/v1/health GET Health check with GPU memory status
/v1/capacity GET Estimated concurrent session capacity
/v1/kv-stats GET KV cache memory usage and compression ratio
/metrics GET Prometheus-format metrics

Development

pip install -e ".[all]"
pytest tests/ -v

Acknowledgments

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voicequant-0.0.1.tar.gz (391.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voicequant-0.0.1-py3-none-any.whl (96.7 kB view details)

Uploaded Python 3

File details

Details for the file voicequant-0.0.1.tar.gz.

File metadata

  • Download URL: voicequant-0.0.1.tar.gz
  • Upload date:
  • Size: 391.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for voicequant-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9d65f5ee7299320a94122fefe8f3d55f1a7b6be675d5ba2db3807c923d0c7d88
MD5 600dbb5aef9ecbe510b852e7f655a98b
BLAKE2b-256 4c42456b82c44736aacafbba26a24bb5afaeb71cfb7ed37a25b1957e5194a8d4

See more details on using hashes here.

File details

Details for the file voicequant-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: voicequant-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 96.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.6

File hashes

Hashes for voicequant-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d458a73fcd2c12b9f59310b47b217ce550d810b16749cc692055fc6ac0860c38
MD5 a043234dc7efafabecc5d903347182ab
BLAKE2b-256 82c1a37171abf703ec7df6f2994035d1f85c71e357e9edee5384011efe47a6d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page