Skip to main content

Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across inference engines (llama.cpp, vLLM, MLX).

Project description

tqkit

Unified toolkit for benchmarking and integrating TurboQuant+ KV-cache compression across LLM inference engines.

What this is

tqkit is a single CLI and Python package that talks to every inference engine that ships TurboQuant+ KV-cache compression:

You bring the inference engine. tqkit autodetects what's installed, runs the canonical benchmark, and prints a reproducible KV-savings table.

Why this exists

KV cache is the dominant memory cost at long context. TurboQuant+ asymmetric (K=q8_0, V=turbo4) shrinks it ~62% with negligible accuracy loss. The savings replicate across engines and hardware vendors. tqkit is the proof, the tool, and the install path.

For a 14B model at 1M tokens of context:

layout KV cache size fits on MI300X 192GB?
FP16 192 GB no (after weights, ~28 GB free)
TQ+ asym (K=q8_0, V=turbo4) 72 GB yes

You can verify the math yourself:

pip install tqkit
tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
tq table --model qwen2.5-14b-instruct-1m

Install

pip install tqkit

Usage

tq backends                                            # autodetect installed engines
tq report --model qwen2.5-14b-instruct-1m --ctx 32K    # KV cache size for one config
tq table --model qwen2.5-14b-instruct-1m               # full layout × ctx grid
tq integrate <backend>                                 # install + serve recipe
tq bench                                               # canonical benchmark (v0.3.0)

Example output:

$ tq report --model qwen2.5-14b-instruct-1m --ctx 1M --layout tq+asym
[KV cache] model: Qwen/Qwen2.5-14B-Instruct-1M
[KV cache] arch: layers=48 kv_heads=8 head_dim=128
[KV cache] layout: tq+asym
[KV cache] per-token: 72.0 KB (vs 192.0 KB FP16)
[KV cache] total @ 1M ctx: 72.0 GB (vs 192.0 GB FP16, 62.5% savings)

Status

v0.2.0 — alpha. KV math + reporter + table work. Canonical bench runner with engine bridges lands in v0.3.0.

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tqkit-0.2.1.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tqkit-0.2.1-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file tqkit-0.2.1.tar.gz.

File metadata

  • Download URL: tqkit-0.2.1.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tqkit-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0480e588f8db0e816adefd791f305102d41018ace9f111c8e76f8656244ff4fa
MD5 1891605f718062ba43d2b4609b23e812
BLAKE2b-256 75e52e03ccd3d3e3a058e119b6685c6bf2cd1e950af8e378671364b888cc2259

See more details on using hashes here.

File details

Details for the file tqkit-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: tqkit-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for tqkit-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 641b6e0eac6f01d8e1751fd0bd731dc092c5861c56721dcacbcf9fc7a146150e
MD5 64f52d2a1f8bb89d9d7e4ce8278b8dac
BLAKE2b-256 af5c75a7409eb1e7b3014d3ef57214a84ebffdffa533dd150bd84010e0dd7f47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page