Skip to main content

Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across inference engines (llama.cpp, vLLM, MLX).

Project description

tqkit

Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across LLM inference engines.

What this is

tqkit is a single CLI and Python package that talks to every inference engine that ships TurboQuant+ KV-cache compression:

You bring the inference engine. tqkit autodetects what's installed, runs the canonical benchmark, and prints a reproducible KV-savings table.

Why this exists

KV cache is the dominant memory cost at long context. TurboQuant+ asymmetric (K=FP8, V=4-bit + metadata) shrinks it ~62% (or ~57% accounting for the 4 boundary layers that stay FP16). The savings replicate across engines and hardware vendors. tqkit is the proof, the tool, and the install path.

For a 14B model at 1M tokens of context:

layout KV cache size (all-quantized) fits on MI300X 192GB after weights?
FP16 192 GB no
TQ+ asym (turboquant_k8v4) 73.5 GB headline / ~83 GB realistic with boundary skip yes
TQ+ sym 4-bit (turboquant_4bit_nc) 50.3 GB yes (more headroom)

You can verify the math yourself:

pip install tqkit
tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
tq table --model qwen2.5-14b-instruct-1m

Install

pip install tqkit

Usage

tq backends                                            # autodetect installed engines
tq report --model qwen2.5-14b-instruct-1m --ctx 32K    # KV cache size for one config
tq table --model qwen2.5-14b-instruct-1m               # full layout × ctx grid
tq integrate <backend>                                 # install + serve recipe
tq bench                                               # canonical benchmark (v0.3.0)

Example output:

$ tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
[KV cache] model: Qwen/Qwen2.5-14B-Instruct-1M
[KV cache] arch: layers=48 kv_heads=8 head_dim=128
[KV cache] layout: tq+asym
[KV cache] per-token: 72.0 KB (vs 192.0 KB FP16)
[KV cache] total @ 1M ctx: 72.0 GB (vs 192.0 GB FP16, 62.5% savings)

Integration recipes

One-page docs for plugging TurboQuant+ into each supported backend live under docs/integrate/.

Docker (AMD ROCm)

docker pull thetom/vllm-turboquant:rocm-7.2
docker run --rm -it \
    --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    -p 8000:8000 thetom/vllm-turboquant:rocm-7.2 \
        --model Qwen/Qwen2.5-14B-Instruct-1M --kv-cache-dtype turboquant_k8v4

See docker/README.md for build details.

Status

v0.4.0 — alpha. Shipping today:

  • KV math + tq report + tq table
  • Pinned canonical_bench.yml + tq config
  • Engine bridges (tq bench) for llama.cpp, vLLM (CUDA + AMD), MLX-Swift, vllm-swift
  • Integration recipes for all 5 backends (docs/integrate/)
  • Docker scaffold for AMD ROCm (docker/Dockerfile.vllm-amd)
  • Models supported in tq report: Qwen2.5 7B/14B/32B, Qwen3-8B, Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3-Next-80B-A3B, Llama-3.1 8B/70B, Mistral-7B
  • 39 tests, 92% line coverage, ≥85% gate enforced

See CHANGELOG.md for the full version history.

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqkit-0.4.1.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tqkit-0.4.1-py3-none-any.whl (28.0 kB view details)

Uploaded Python 3

File details

Details for the file tqkit-0.4.1.tar.gz.

File metadata

  • Download URL: tqkit-0.4.1.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tqkit-0.4.1.tar.gz
Algorithm Hash digest
SHA256 5b7bfe7c747c5a7f61d3bf345b205ae17f6aa2cf32a8c5d69d8b0c8b9931c574
MD5 bf0f4b38d9c8561cb0e68a581eaa92ec
BLAKE2b-256 4e1b01be7f616eb62862e9dec3e8ea312ee339c3cf8a49f9392be16583f76a04

See more details on using hashes here.

File details

Details for the file tqkit-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: tqkit-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 28.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tqkit-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 36f8ff07c0b6d42022ac222d4f4ff22a60b2fc1e396c111c211d727b81c047ef
MD5 337b8d6b34f2e6bc3f144ffa95339319
BLAKE2b-256 cf00395c998e7ab80c226e8c6e28db465b9dd1169e18bbd6c6c80aa5c6108508

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page