Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across inference engines (llama.cpp, vLLM, MLX).
Project description
tqkit
Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across LLM inference engines.
What this is
tqkit is a single CLI and Python package that talks to every inference engine that ships TurboQuant+ KV-cache compression:
- llama.cpp (TheTom/llama.cpp@feature/turboquant-kv-cache)
- vLLM (CUDA) (TheTom/vllm@feature/turboquant-kv-cache)
- vLLM (AMD ROCm) (TheTom/vllm@feature/turboquant-amd-noautotune)
- MLX-Swift (TheTom/mlx@feature/turboquant-plus)
- vllm-swift plugin
You bring the inference engine. tqkit autodetects what's installed, runs the canonical benchmark, and prints a reproducible KV-savings table.
Why this exists
KV cache is the dominant memory cost at long context. TurboQuant+ asymmetric (K=q8_0, V=turbo4) shrinks it ~62% with negligible accuracy loss. The savings replicate across engines and hardware vendors. tqkit is the proof, the tool, and the install path.
For a 14B model at 1M tokens of context:
| layout | KV cache size | fits on MI300X 192GB? |
|---|---|---|
| FP16 | 192 GB | no (after weights, ~28 GB free) |
| TQ+ asym (K=q8_0, V=turbo4) | 72 GB | yes |
You can verify the math yourself:
pip install tqkit
tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
tq table --model qwen2.5-14b-instruct-1m
Install
pip install tqkit
Usage
tq backends # autodetect installed engines
tq report --model qwen2.5-14b-instruct-1m --ctx 32K # KV cache size for one config
tq table --model qwen2.5-14b-instruct-1m # full layout × ctx grid
tq integrate <backend> # install + serve recipe
tq bench # canonical benchmark (v0.3.0)
Example output:
$ tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
[KV cache] model: Qwen/Qwen2.5-14B-Instruct-1M
[KV cache] arch: layers=48 kv_heads=8 head_dim=128
[KV cache] layout: tq+asym
[KV cache] per-token: 72.0 KB (vs 192.0 KB FP16)
[KV cache] total @ 1M ctx: 72.0 GB (vs 192.0 GB FP16, 62.5% savings)
Integration recipes
One-page docs for plugging TurboQuant+ into each supported backend live under docs/integrate/.
- llama.cpp — NVIDIA, Apple, AMD, CPU
- vLLM (NVIDIA CUDA) — A100, H100, RTX 4090
- vLLM (AMD ROCm) — MI300X (the only TQ+ port for AMD anywhere)
- MLX-Swift — Apple Silicon Macs + iPhone
- vllm-swift — Apple Silicon OpenAI-API server
Docker (AMD ROCm)
docker pull thetom/vllm-turboquant:rocm-7.2
docker run --rm -it \
--device=/dev/kfd --device=/dev/dri --group-add video --ipc=host \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-p 8000:8000 thetom/vllm-turboquant:rocm-7.2 \
--model Qwen/Qwen2.5-14B-Instruct-1M --kv-cache-dtype turboquant_k8v4
See docker/README.md for build details.
Status
v0.2.1 — alpha. KV math + reporter + table + canonical bench config + integration docs ship today. Canonical bench runner with engine bridges lands in v0.3.0.
License
Apache 2.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tqkit-0.3.0.tar.gz.
File metadata
- Download URL: tqkit-0.3.0.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
55808e48234c96db05e3d2a5162ad8a6780ef010c21e4962832e7ad3268e0a44
|
|
| MD5 |
66c7b9f0a62003b9daa83153d5236a4f
|
|
| BLAKE2b-256 |
75f1f350f9cd9c5e79f7c9792e9bbbc8b0adc5f24709e5bc38d32fe04ac9ccbd
|
File details
Details for the file tqkit-0.3.0-py3-none-any.whl.
File metadata
- Download URL: tqkit-0.3.0-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab3abf478119cbb1ad4f9393296ebffba5047955d113b845ae484be712f7c942
|
|
| MD5 |
0cca3b650f863b478d70a60ab88f2c00
|
|
| BLAKE2b-256 |
d78beb1bc92acb6ae7c7f37fa6ee21b64bee788c64a8e8e5b7aaf6ad587bcc44
|