fraQtl runtime — drop-in KV cache compression + INT3-resident weight loading for HuggingFace transformers.
Project description
fraQtl
5x KV cache compression. +0.002 PPL. 7 models, 3B–70B. One line of code.
Runtime KV-cache compression via the Attention Importance Kernel. Protect the directions that matter. Quantize the rest. Drop-in, no retraining, production-ready.
Results (verified, 7 models)
| Model | Params | Arch | k=16 | k=32 |
|---|---|---|---|---|
| Mistral 7B | 7B | GQA-8 | +0.019 | +0.007 |
| Llama 3.2 3B | 3B | GQA-3 | +0.043 | +0.011 |
| Llama-2-7B | 7B | MHA-32 | +0.022 | +0.007 |
| Qwen 2.5 3B | 3B | GQA-2 | +0.034 | +0.010 |
| Llama 3.1 8B | 8B | GQA-8 | +0.034 | +0.025 |
| Llama-2-13B | 13B | MHA-40 | +0.019 | +0.005 |
| Llama 3.1 70B | 70B | GQA-8 | +0.079 | +0.019 |
All measured at runtime on live KV cache. Split prefill/eval methodology. Same config everywhere.
vs Competition (Llama-2-7B)
| Method | PPL Delta | Compression |
|---|---|---|
| fraQtl k=32 | +0.007 | 5x |
| fraQtl k=16 | +0.022 | 5x |
| KVQuant 2-bit | +0.27 | ~5x |
| KIVI K2V2 | +1.00 | ~5x |
Memory at Scale
| Context | KV Cache (FP16) | fraQtl 5x | Savings |
|---|---|---|---|
| 4K | 2.1 GB | 430 MB | 1.7 GB |
| 32K | 17 GB | 3.4 GB | 14 GB |
| 128K | 69 GB | 14 GB | 55 GB |
Install
pip install git+https://github.com/samuelsalfati/fraqtl.git
Quick Start
import fraqtl
# Authenticate (get token at fraqtl.ai)
fraqtl.login("sk_fraqtl_...")
# Compress
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1",
torch_dtype="float16", device_map="auto")
model = fraqtl.aipress_kv(model, calib_seqs)
# That's it. Serve normally.
CLI
fraqtl compress --model mistralai/Mistral-7B-v0.1 --k 16 --eval
fraqtl analyze --model mistralai/Mistral-7B-v0.1
How It Works
- Eigenbasis — compute the Attention Importance Kernel (V^T alpha^T alpha V) from one forward pass
- Protect — top-k eigendirections at full precision
- Sacrifice — remaining directions at INT3
- Zero overhead — W_O fusion absorbs rotation into weights
Paper
"The Right Basis, Not the Right Subspace: Downstream-Optimal Quantization for KV-Cache Compression"
Samuel Salfati, Cornell University
Patent
Patent pending (filed April 6, 2026).
License
Proprietary. Early access available at fraqtl.ai.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fraqtl_runtime-0.1.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.
File metadata
- Download URL: fraqtl_runtime-0.1.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
- Upload date:
- Size: 857.3 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df98b4b4fe6361d0c29a1159e26cf6488c68365e17ccad23354bf2e568dd2f65
|
|
| MD5 |
934e30160268c9c4130b25d162658b57
|
|
| BLAKE2b-256 |
c8a7d9a393cf6385fe9af13fdeba6a39adbed8d7ca3332741a0de31436c32e81
|