Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across inference engines (llama.cpp, vLLM, MLX).
Project description
tqkit
Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across LLM inference engines.
What this is
tqkit is a single CLI and Python package that talks to every inference engine that ships TurboQuant+ KV-cache compression:
- llama.cpp (TheTom/llama.cpp@feature/turboquant-kv-cache)
- vLLM (CUDA) (TheTom/vllm@feature/turboquant-kv-cache)
- vLLM (AMD ROCm) (TheTom/vllm@feature/turboquant-amd)
- MLX-Swift (TheTom/mlx@feature/turboquant-plus)
- vllm-swift plugin
You bring the inference engine. tqkit autodetects what's installed, runs the canonical benchmark, and prints a reproducible KV-savings table.
Why this exists
KV cache is the dominant memory cost at long context. TurboQuant+ asymmetric (K=q8_0, V=turbo4) shrinks it ~62% with negligible accuracy loss. The savings replicate across engines and hardware vendors. tqkit is the proof, the tool, and the install path.
For a 14B model at 1M tokens of context:
| layout | KV cache size | fits on MI300X 192GB? |
|---|---|---|
| FP16 | 192 GB | no (after weights, ~28 GB free) |
| TQ+ asym (K=q8_0, V=turbo4) | 72 GB | yes |
You can verify the math yourself:
pip install tqkit
tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
tq table --model qwen2.5-14b-instruct-1m
Install
pip install tqkit
Usage
tq backends # autodetect installed engines
tq report --model qwen2.5-14b-instruct-1m --ctx 32K # KV cache size for one config
tq table --model qwen2.5-14b-instruct-1m # full layout × ctx grid
tq integrate <backend> # install + serve recipe
tq bench # canonical benchmark (v0.3.0)
Example output:
$ tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
[KV cache] model: Qwen/Qwen2.5-14B-Instruct-1M
[KV cache] arch: layers=48 kv_heads=8 head_dim=128
[KV cache] layout: tq+asym
[KV cache] per-token: 72.0 KB (vs 192.0 KB FP16)
[KV cache] total @ 1M ctx: 72.0 GB (vs 192.0 GB FP16, 62.5% savings)
Status
v0.2.0 — alpha. KV math + reporter + table work. Canonical bench runner with engine bridges lands in v0.3.0.
License
Apache 2.0.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tqkit-0.2.1.tar.gz.
File metadata
- Download URL: tqkit-0.2.1.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0480e588f8db0e816adefd791f305102d41018ace9f111c8e76f8656244ff4fa
|
|
| MD5 |
1891605f718062ba43d2b4609b23e812
|
|
| BLAKE2b-256 |
75e52e03ccd3d3e3a058e119b6685c6bf2cd1e950af8e378671364b888cc2259
|
File details
Details for the file tqkit-0.2.1-py3-none-any.whl.
File metadata
- Download URL: tqkit-0.2.1-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
641b6e0eac6f01d8e1751fd0bd731dc092c5861c56721dcacbcf9fc7a146150e
|
|
| MD5 |
64f52d2a1f8bb89d9d7e4ce8278b8dac
|
|
| BLAKE2b-256 |
af5c75a7409eb1e7b3014d3ef57214a84ebffdffa533dd150bd84010e0dd7f47
|