Skip to main content

Adaptive Streaming Vector Quantization KV Cache for LLMs

Project description

AdapTQ: Adaptive Streaming Vector Quantization

AdapTQ is a production-grade C++17 KV cache quantization engine for LLM inference on edge and memory-constrained systems.

The Integration Pitch: AdapTQ is an optional KV-cache backend. It runs entirely on the CPU, requires no model changes, and fits into existing inference pipelines with minimal adapter-style wrapper logic.

🚀 Quickstart

import torch; from adaptq import AdaptQAttention
# 1. Initialize drop-in PyTorch wrapper (4-bit default)
layer = AdaptQAttention(dim=128, heads=4)
# 2. Forward pass dynamically routes continuous BxHxD generation tensors
out = layer(q=torch.randn(1, 4, 128), k=torch.randn(1, 4, 128), v=torch.randn(1, 4, 128))

📊 Real-world Benchmarks

Tested on standard AVX2 desktop hardware (4 heads, dim=128, caching up to 4096 tokens). Note: We ignore sequence lengths < 256 in these claims, as short sequences are explicitly routed to standard FP32 execution via our hybrid fallback.

Metric Result (Seq ≥ 256)
Latencies p50: 877.9 µs | p95: 2161.8 µs
Stable Speedup ~10.18x vs NumPy FP32 equivalent
Throughput ~1,139 tokens/sec
Memory 2.10 MB vs FP16's 8.39 MB (4.0x smaller)

Quantization Fidelity (Honest Metrics)

We use a targeted $\pm 3\sigma$ variance soft-clipping on FWHT distributions without altering Max-Lloyd codebooks. Our strictly measured empirical quality against baseline FP32:

  • Cosine Similarity: ~0.947 (1.000 = exact identical match)
  • Mean Squared Error (MSE): ~1.8e-04

AdapTQ Benchmarks

🏗 Architecture & Features

  • Unified SIMD Pipeline: 2, 3, and 4-bit decoding share a single, quad-unrolled branchless loop using AVX2 intrinsics. No scalar fallbacks in the hot path.
  • Fast Hadamard Rotation (HAR): $O(d \log d)$ fully in-place rotation minimizes outliers gracefully before codebook matching.
  • Zero Heap Allocations: Pure stack/thread-local memory buffers in the hot path.
  • Pre-Compute LUTs: Dot products execute directly against packed indices in SIMD registers—avoiding full dequantization inside the attention kernel.

🛠 Installation & Integration

git clone https://github.com/l3tchupkt/adaptq.git
cd adaptq

# Build python PyBind modules natively
pip install . 

Python Native Application

Use the native generic python API to bypass neural-network tensors explicitly:

import numpy as np
from adaptq import Engine

engine = Engine(dim=128, heads=4, bits=4, capacity=2048)
k, v, q = np.random.randn(4, 128), np.random.randn(4, 128), np.random.randn(4, 128)

engine.append(k, v)
output = engine.compute(q)

llama.cpp Adapter

Using AdapTQ as the native KV Cache replacement during computation phase over GGML. (Requires using llm_build_kqv hooks. See /integration/llama_cpp_patch.md for full unified patch details.)

#include "adapters/adapter_llamacpp.h"
LlamaCppAdaptQAdapter adapter(n_heads, head_dim, bits, capacity, seed, v_mass, hybrid_thr);
adapter.feed_kv(head, key_array, val_array, token_pos);
adapter.attention(head, query_array, out_array);

📝 License

See active repository license policies. Developed based on AdapTQ: Adaptive Streaming Vector Quantization for Edge-Deployed Large Language Models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adaptq-0.1.0.tar.gz (25.0 kB view details)

Uploaded Source

File details

Details for the file adaptq-0.1.0.tar.gz.

File metadata

  • Download URL: adaptq-0.1.0.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for adaptq-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d864f3d6de5914b14bc096e047ce994aa0eb479a625a24248f4e584c7f8ea7cf
MD5 9e12bf5194d35409e41339ef3e386440
BLAKE2b-256 b8b35dff9ecbd7cf15fa7cb95baeca661f540c783aeb4858d5484e5bd768ba38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page