Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across inference engines (llama.cpp, vLLM, MLX).
Project description
tqkit
Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across LLM inference engines.
What this is
tqkit is a single CLI and Python package that talks to every inference engine that ships TurboQuant+ KV-cache compression:
- llama.cpp (TheTom/llama.cpp@feature/turboquant-kv-cache)
- vLLM (CUDA) (TheTom/vllm@feature/turboquant-kv-cache)
- vLLM (AMD ROCm) (TheTom/vllm@feature/turboquant-amd-noautotune)
- MLX-Swift (TheTom/mlx@feature/turboquant-plus)
- vllm-swift plugin
You bring the inference engine. tqkit autodetects what's installed, runs the canonical benchmark, and prints a reproducible KV-savings table.
Why this exists
KV cache is the dominant memory cost at long context. TurboQuant+ asymmetric (K=FP8, V=4-bit + metadata) shrinks it ~62% (or ~57% accounting for the 4 boundary layers that stay FP16). The savings replicate across engines and hardware vendors. tqkit is the proof, the tool, and the install path.
For a 14B model at 1M tokens of context:
| layout | KV cache size (all-quantized) | fits on MI300X 192GB after weights? |
|---|---|---|
| FP16 | 192 GB | no |
TQ+ asym (turboquant_k8v4) |
73.5 GB headline / ~83 GB realistic with boundary skip | yes |
TQ+ sym 4-bit (turboquant_4bit_nc) |
50.3 GB | yes (more headroom) |
You can verify the math yourself:
pip install tqkit
tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
tq table --model qwen2.5-14b-instruct-1m
Install
pip install tqkit
Usage
tq backends # autodetect installed engines
tq report --model qwen2.5-14b-instruct-1m --ctx 32K # KV cache size for one config
tq table --model qwen2.5-14b-instruct-1m # full layout × ctx grid
tq integrate <backend> # install + serve recipe
tq bench # canonical benchmark (v0.3.0)
Example output:
$ tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
[KV cache] model: Qwen/Qwen2.5-14B-Instruct-1M
[KV cache] arch: layers=48 kv_heads=8 head_dim=128
[KV cache] layout: tq+asym
[KV cache] per-token: 72.0 KB (vs 192.0 KB FP16)
[KV cache] total @ 1M ctx: 72.0 GB (vs 192.0 GB FP16, 62.5% savings)
Integration recipes
One-page docs for plugging TurboQuant+ into each supported backend live under docs/integrate/.
- llama.cpp — NVIDIA, Apple, AMD, CPU
- vLLM (NVIDIA CUDA) — A100, H100, RTX 4090
- vLLM (AMD ROCm) — MI300X (the only TQ+ port for AMD anywhere)
- MLX-Swift — Apple Silicon Macs + iPhone
- vllm-swift — Apple Silicon OpenAI-API server
Docker (AMD ROCm)
docker pull thetom/vllm-turboquant:rocm-7.2
docker run --rm -it \
--device=/dev/kfd --device=/dev/dri --group-add video --ipc=host \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-p 8000:8000 thetom/vllm-turboquant:rocm-7.2 \
--model Qwen/Qwen2.5-14B-Instruct-1M --kv-cache-dtype turboquant_k8v4
See docker/README.md for build details.
Status
v0.4.0 — alpha. Shipping today:
- KV math +
tq report+tq table - Pinned
canonical_bench.yml+tq config - Engine bridges (
tq bench) for llama.cpp, vLLM (CUDA + AMD), MLX-Swift, vllm-swift - Integration recipes for all 5 backends (
docs/integrate/) - Docker scaffold for AMD ROCm (
docker/Dockerfile.vllm-amd) - Models supported in
tq report: Qwen2.5 7B/14B/32B, Qwen3-8B, Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3-Next-80B-A3B, Llama-3.1 8B/70B, Mistral-7B - 39 tests, 92% line coverage, ≥85% gate enforced
See CHANGELOG.md for the full version history.
License
Apache 2.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tqkit-0.4.1.tar.gz.
File metadata
- Download URL: tqkit-0.4.1.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b7bfe7c747c5a7f61d3bf345b205ae17f6aa2cf32a8c5d69d8b0c8b9931c574
|
|
| MD5 |
bf0f4b38d9c8561cb0e68a581eaa92ec
|
|
| BLAKE2b-256 |
4e1b01be7f616eb62862e9dec3e8ea312ee339c3cf8a49f9392be16583f76a04
|
File details
Details for the file tqkit-0.4.1-py3-none-any.whl.
File metadata
- Download URL: tqkit-0.4.1-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36f8ff07c0b6d42022ac222d4f4ff22a60b2fc1e396c111c211d727b81c047ef
|
|
| MD5 |
337b8d6b34f2e6bc3f144ffa95339319
|
|
| BLAKE2b-256 |
cf00395c998e7ab80c226e8c6e28db465b9dd1169e18bbd6c6c80aa5c6108508
|