Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across inference engines (llama.cpp, vLLM, MLX).

These details have not been verified by PyPI

Project links

Project description

tqkit

Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across LLM inference engines.

What this is

tqkit is a single CLI and Python package that talks to every inference engine that ships TurboQuant+ KV-cache compression:

llama.cpp (TheTom/llama.cpp@feature/turboquant-kv-cache)
vLLM (CUDA) (TheTom/vllm@feature/turboquant-kv-cache)
vLLM (AMD ROCm) (TheTom/vllm@feature/turboquant-amd-noautotune)
MLX-Swift (TheTom/mlx@feature/turboquant-plus)
vllm-swift plugin

You bring the inference engine. tqkit autodetects what's installed, runs the canonical benchmark, and prints a reproducible KV-savings table.

Why this exists

KV cache is the dominant memory cost at long context. TurboQuant+ asymmetric (K=FP8, V=4-bit + metadata) shrinks it ~62% (or ~57% accounting for the 4 boundary layers that stay FP16). The savings replicate across engines and hardware vendors. tqkit is the proof, the tool, and the install path.

For a 14B model at 1M tokens of context:

layout	KV cache size (all-quantized)	fits on MI300X 192GB after weights?
FP16	192 GB	no
TQ+ asym (`turboquant_k8v4`)	73.5 GB headline / ~83 GB realistic with boundary skip	yes
TQ+ sym 4-bit (`turboquant_4bit_nc`)	50.3 GB	yes (more headroom)

You can verify the math yourself:

pip install tqkit
tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
tq table --model qwen2.5-14b-instruct-1m

Install

pip install tqkit

Usage

tq backends                                            # autodetect installed engines
tq report --model qwen2.5-14b-instruct-1m --ctx 32K    # KV cache size for one config
tq table --model qwen2.5-14b-instruct-1m               # full layout × ctx grid
tq integrate <backend>                                 # install + serve recipe
tq bench                                               # canonical benchmark (v0.3.0)

Example output:

$ tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
[KV cache] model: Qwen/Qwen2.5-14B-Instruct-1M
[KV cache] arch: layers=48 kv_heads=8 head_dim=128
[KV cache] layout: tq+asym
[KV cache] per-token: 72.0 KB (vs 192.0 KB FP16)
[KV cache] total @ 1M ctx: 72.0 GB (vs 192.0 GB FP16, 62.5% savings)

Integration recipes

One-page docs for plugging TurboQuant+ into each supported backend live under docs/integrate/.

llama.cpp — NVIDIA, Apple, AMD, CPU
vLLM (NVIDIA CUDA) — A100, H100, RTX 4090
vLLM (AMD ROCm) — MI300X (the only TQ+ port for AMD anywhere)
MLX-Swift — Apple Silicon Macs + iPhone
vllm-swift — Apple Silicon OpenAI-API server

Docker (AMD ROCm)

docker pull thetom/vllm-turboquant:rocm-7.2
docker run --rm -it \
    --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    -p 8000:8000 thetom/vllm-turboquant:rocm-7.2 \
        --model Qwen/Qwen2.5-14B-Instruct-1M --kv-cache-dtype turboquant_k8v4

See docker/README.md for build details.

Status

v0.4.0 — alpha. Shipping today:

KV math + tq report + tq table
Pinned canonical_bench.yml + tq config
Engine bridges (tq bench) for llama.cpp, vLLM (CUDA + AMD), MLX-Swift, vllm-swift
Integration recipes for all 5 backends (docs/integrate/)
Docker scaffold for AMD ROCm (docker/Dockerfile.vllm-amd)
Models supported in tq report: Qwen2.5 7B/14B/32B, Qwen3-8B, Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3-Next-80B-A3B, Llama-3.1 8B/70B, Mistral-7B
39 tests, 92% line coverage, ≥85% gate enforced

See CHANGELOG.md for the full version history.

License

Apache 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.2

May 6, 2026

0.4.1

May 6, 2026

This version

0.4.0

May 6, 2026

0.3.0

May 6, 2026

0.2.1

May 6, 2026

0.2.0

May 6, 2026

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqkit-0.4.0.tar.gz (26.1 kB view details)

Uploaded May 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tqkit-0.4.0-py3-none-any.whl (25.7 kB view details)

Uploaded May 6, 2026 Python 3

File details

Details for the file tqkit-0.4.0.tar.gz.

File metadata

Download URL: tqkit-0.4.0.tar.gz
Upload date: May 6, 2026
Size: 26.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tqkit-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`0d72eaf4750b700debb2dc04341a64225a757d6e084cfbe280bad6a7640d61a8`
MD5	`6886d701f3741e65d7c770f974db3a87`
BLAKE2b-256	`14179ddddf2145b6f652d5b65337639ace5cb99605f5ebee0aae7c13e32d9a57`

See more details on using hashes here.

File details

Details for the file tqkit-0.4.0-py3-none-any.whl.

File metadata

Download URL: tqkit-0.4.0-py3-none-any.whl
Upload date: May 6, 2026
Size: 25.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tqkit-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ccc021b111315b20e3c2a3d4aae529aae3c88ee7a188237353238d0d5b5798ad`
MD5	`02d8da92e0ba52d863f91bf8464eff40`
BLAKE2b-256	`e59e67af5a3aa45624fbacad2e89ac8fc215a72f2e957acb35ce936ee7cf393a`

See more details on using hashes here.

tqkit 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tqkit

What this is

Why this exists

Install

Usage

Integration recipes

Docker (AMD ROCm)

Status

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes