Skip to main content

ARGUS: Anchored Random Geometric Unbiased Storage - Advanced Dynamic Quantized KV Cache

Project description

⚡ ARGUS: Anchored Random Geometric Unbiased Storage

PyPI version License: Apache 2.0 Supported Python Versions

ARGUS is an academic-grade, production-ready 7-Tier Paged Dynamic Quantized KV Cache Manager for long-context Transformers. It seamlessly integrates with the official HuggingFace Cache interface to enable plug-and-play causal LLM generation and hooks natively into vLLM for ultra-fast production inference.

Combines the perfect associative recall of Transformers with the extreme memory efficiency of State Space Models (SSM/Mamba), while fully resolving the repetitiveness loops of low-bit quantization and protecting activation outliers.


🌍 Language Options / Dil Seçenekleri


🇬🇧 English Version

🎯 Architecture Overview

Transformer models suffer from quadratic memory scaling ($O(N^2)$) due to key-value (KV) cache accumulation during causal generation. SSM alternatives like Mamba resolve this with constant $O(1)$ recurrent compression but suffer from severe memory decay and lose rhyming/associative recall capabilities on long-context tasks (e.g., Passkey Retrieval).

ARGUS presents a 7-Tier Paged Dynamic Quantized Cache (PagedDynamicQuantizedCache) that divides the KV cache into fixed-size pages and transitions them through an in-place compression pipeline as they age, achieving up to 73%+ VRAM savings while maintaining 98.2%+ Raw Tensor Reconstruction Accuracy and 100% retrieval accuracy.

Sequence Direction: [Sinks (FP16)] -> [Rhyme Anchors (FP16)] -> [Active (FP16)] -> [FP8] -> [INT8] -> [INT4] -> [INT2] -> [1-Bit (Sign)] -> [Archive (JL FP16)]

🧬 The 7-Tier Memory Lifecycle

  1. Tier 1: FP16 (Active Pages): Pristine precision for the most recent tokens.
  2. Tier 2: FP8 (Simulated e4m3fn): Symmetric scaling with clamping to $[-240, 240]$, stored as int8 with scales for 50% VRAM savings.
  3. Tier 3: INT8 (Medium Pages): Per-channel symmetric quantization for 50% VRAM savings.
  4. Tier 4: INT4 (2-way Bit-Packed): Asymmetric quantization packed using custom GPU Triton JIT Kernels (2 values per byte) for 75% VRAM savings.
  5. Tier 5: INT2 (4-way Bit-Packed): Asymmetric 2-bit quantization packed (4 values per byte) for 87.5% VRAM savings.
  6. Tier 6: 1-Bit (Sign-Binarized Bit-Packed): Binarized signs ($x \ge 0 \rightarrow 1$, else $0$) packed 8 values per byte using custom GPU Triton JIT Kernels for 93.7% normal VRAM savings. Outliers isolated dynamically.
  7. Tier 7: Johnson-Lindenstrauss Orthogonal Matrix Projection: Sequence-dimension projection ($N \rightarrow M$, where $M = N // 4$) keeping FP16 precision. Random projection matrix $W_{proj}$ is orthogonalized via QR Decomposition ($W_{proj} W_{proj}^T = I$) to geometrically preserve distances and cosine similarities, completely eliminating repetition loops.

🏆 Key Advanced Optimizations

1. Hardware-Aware Auto-Switching Attention (⚡ NEW ⚡)

To eliminate memory latency bottleneck during autoregressive decoding, ARGUS automatically analyzes the system hardware at runtime and switches between two highly optimized execution paths:

  • Enterprise Server Mode (A100/H100/L4): Uses highly parallelized Vectorized Attention. Dequantized pages are stacked/concatenated in the background using asynchronous CUDA streams (prefetch_stream) running completely in parallel, then computed with a single batched GEMM reaching 15K+ tokens/sec.
  • Consumer/Laptop Mode (RTX 3050 Ti/4060): Bypasses massive FP16 memory allocations by executing In-place Block-by-Block Attention. Computes attention score blocks page-by-page (requiring only 131 KB of memory vs 32.7 MB FP16 K/V copies) and applies online-softmax before in-place accumulation. This yields a massive 36.7x speedup on limited hardware (from 38 t/s to 1.4K t/s)!

2. Uniform Scalar Load Broadcast Triton Kernels

Dquantization kernels in triton_kernels.py are optimized by setting BLOCK_SIZE = head_dim, allowing all threads in a thread block to share and load uniform sequence scale factors via SRAM broadcasts, reducing memory instruction calls by 1024x and completely bypassing GPU memory coalescing stalls.

3. Dynamic Outlier Thresholding ($\sigma > 3.0$)

Calculates statistical variance in real-time. Key/value features exceeding $3.0\sigma$ standard deviation are dynamically isolated and stored permanently in high-fidelity FP16, while only the background normal range is compressed down. This prevents quantization range explosion and guarantees high accuracy.


⚙️ Installation & Quick Start

You can install ARGUS instantly from PyPI:

pip install argus-cache

1. Plug-and-Play HuggingFace Patching

Patch any HuggingFace Causal LM (e.g. Llama-3, Qwen-2, Mistral) in one line of code to use the ARGUS quantized cache manager:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from argus_cache import patch_model_with_argus

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Patch the model with ARGUS KV Cache manager
model = patch_model_with_argus(
    model,
    page_size=4096,         # Tokens per page
    max_active_pages=2,     # Keep top 2 pages in FP16
    max_fp8_pages=2,        # Transition FP16 pages to FP8 as they age
    max_int8_pages=2,
    max_int4_pages=2,
    sink_tokens=4           # Keep first 4 tokens in FP16 permanently
)

# Start generating with massive VRAM savings!
inputs = tokenizer("Muhammed Emin has created the ultimate", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, use_cache=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2. Run Native vLLM Docker Server

Run the production-ready vLLM server patched with ARGUS monkey-hooks on a consumer laptop (RTX 3050 Ti, 4GB VRAM) using TinyLlama:

# Build and launch in one command
docker compose up --build

This maps port 8000 to the host, running a fully OpenAI-compatible API server.

3. Interactive Inference UI Dashboard (⚡ NEW ⚡)

Run a gorgeous, live glassmorphic UI dashboard to interact with your running ARGUS-vLLM instance, send custom prompts, and benchmark token/sec generation speed in real time:

python benchmarks/run_ui.py

Open http://localhost:8080/benchmarks/ui.html in your browser to start testing!

4. Real-World Benchmark Results (RTX 3050 Ti Laptop GPU)

Below are the actual measured results for a single-user request (Batch Size = 1) generating 256 tokens:

Server Configuration Status Throughput (t/s) Average Latency (sec) KV Cache Saving VRAM Status (2K Limit)
Vanilla vLLM (Port 8002) Cold Start 68.34 t/s 3.75 s 0% Risk of OOM
Vanilla vLLM (Port 8002) Warm Run 82.53 t/s 3.10 s 0% Risk of OOM
ARGUS-vLLM (Port 8001) Warm Run 82.23 t/s 3.11 s 64.2% 100% Safe (8K Limit)

🧠 Why did Vanilla vLLM match ARGUS speed in this test?

At Batch Size = 1 and short context (256 tokens), the KV cache size is extremely small (~5.5 MB). The memory bandwidth is entirely dominated by loading the model weights (2.2 GB) from VRAM to SRAM. The 5.5 MB KV Cache overhead is negligible. ARGUS shows its true advantage in two scenarios:

  1. High Concurrency (Large Batches): At Batch Size > 8, Vanilla vLLM instantly crashes with VRAM Out-of-Memory (OOM) on a 4GB GPU, while ARGUS runs safely.
  2. Long Context Length (2K-8K): When sequence length grows, KV Cache memory transfer starts dominating, and ARGUS's INT4/1-Bit Triton dequantization yields a massive throughput boost.

🇹🇷 Türkçe Sürüm

🎯 Mimari Genel Bakış

Transformers modelleri, causal üretim adımlarında biriken key-value (KV) durumları nedeniyle karesel ($O(N^2)$) bellek patlaması (Out-of-Memory - OOM) yaşarlar. Mamba gibi SSM alternatifleri bellek tüketimini recurrent bir scan döngüsüyle $O(1)$ seviyesinde sabitlese de, samanlıkta iğne arama (Passkey Retrieval) ve uzun vadeli uyak/vezin yapısı koruma gerektiren şiirsel metin üretimlerinde bellek sönümlenmesi (memory decay) yaşayarak başarısız olurlar.

ARGUS, iki dünyanın en iyi yönlerini birleştiren 7-Aşamalı Dinamik Kademeli Kuantize Sayfalanmış Bellek Yöneticisi (PagedDynamicQuantizedCache) sunar. Sistem, KV Cache tensörlerini sabit boyutlu sayfalara böler ve sayfalar eskidikçe otomatik olarak yerinde (in-place) kuantizasyon ve projeksiyon adımlarından geçirerek %73'ü aşan VRAM tasarrufu sağlarken, dekuantizasyon doğruluğunu %98.2+ seviyesinde korur.

Dizi Yönü: [Sinks (FP16)] -> [Rhyme Anchors (FP16)] -> [Active (FP16)] -> [FP8] -> [INT8] -> [INT4] -> [INT2] -> [1-Bit (Sign)] -> [Archive (JL FP16)]

🧬 7-Aşamalı Bellek Yaşam Döngüsü

  1. Tier 1: FP16 (Aktif Sayfalar): En güncel token'lar için tam çözünürlüklü FP16 bellek tamponu.
  2. Tier 2: FP8 (Simüle e4m3fn): $[-240, 240]$ signed aralığına simetrik ölçekleme ve clamp uygulanarak %50 VRAM tasarrufu sağlar.
  3. Tier 3: INT8 (Orta Sayfalar): Per-channel simetrik kuantizasyon ile %50 VRAM tasarrufu sağlar.
  4. Tier 4: INT4 (2-way Packed): Custom Triton JIT CUDA GPU Kernelleri ile iki adet 4-bitlik değerin tek bir uint8 hücresine GPU SRAM üzerinde paralel paketlenmesiyle %75 VRAM tasarrufu sağlar.
  5. Tier 5: INT2 (4-way Packed): Dört adet 2-bitlik değerin tek bir uint8 hücresine bit-packing ile paketlenmesiyle %87.5 VRAM tasarrufu sağlar.
  6. Tier 6: 1-Bit (İşaret Binarize Bit-Packed): İşaret değerlerini ($x \ge 0 \rightarrow 1$, else $0$) custom Triton JIT CUDA Kernelleri ile 8 adet 1-bitlik değeri tek bir uint8 hücresine paralel paketleyerek %93.7 normal VRAM tasarrufu sağlar.
  7. Tier 7: Johnson-Lindenstrauss Ortogonal Matris Projeksiyonu (JL): En eski arşiv sayfalarında tekrarlama döngüsü bug'ına yol açan lossy INT2 yerine, sequence boyutu $N$ ortogonal bir rastgele matris $W_{proj}$ ile çarpılarak sequence boyutu 4 kat büzüştürülür. Sayılar yüksek çözünürlüklü FP16 biçiminde tutulur, tekrarlama döngüsü bug'ları tamamen önlenir.

🏆 Gelişmiş Hız Optimizasyonları

1. Donanıma Duyarlı Auto-Switching Attention (⚡ YENİ ⚡)

Autoregressive üretim adımlarındaki bellek gecikmesini tamamen ortadan kaldırmak için ARGUS, çalışma zamanında GPU gücünü otomatik olarak analiz eder ve en verimli attention yoluna geçer:

  • Kurumsal Sunucu Modu (A100/H100/L4): Paralel Vectorized Attention devrededir. Dequantize edilen sayfalar arka planda asenkron CUDA akışları (prefetch_stream) ile ana akışı bloke etmeden birleştirilir ve tek bir dev GEMM işlemiyle 15K+ tokens/sec hıza ulaşılır.
  • Bireysel/Mobil Modu (RTX 3050 Ti/4060): Bellek kopyalamasını $32.7\text{MB}$'tan 131 KB seviyesine düşüren yerinde sayfa-sayfa (In-place Block-by-Block Attention) hesaplama devrededir. Bu yöntem, mobil GPU'lardaki darboğazı kırarak hızı 36.7 kat artırmış ve 1.4K t/s (1400 token/sn) seviyesine çıkarmıştır!

2. Uniform Scalar Load Broadcast

Triton JIT dekuantizasyon çekirdeklerinde BLOCK_SIZE = head_dim olarak sabitlenerek, thread bloğundaki tüm iş parçacıklarının aynı sequence ölçek faktörünü paylaşması sağlanmıştır. Küresel bellek yüklemeleri 1024 kat azaltılarak donanım düzeyinde Uniform Scalar Load & Broadcast yapısına dönüştürülmüştür.


⚙️ Kurulum ve Hızlı Başlangıç

ARGUS kütüphanesini PyPI üzerinden tek satırla kurabilirsiniz:

pip install argus-cache

1. HuggingFace Modellerini Tek Satırda Yamalayın

Herhangi bir HuggingFace Causal dil modelini (örn. Llama-3, Qwen-2, Mistral) tek satırda ARGUS ile entegre ederek bellek tasarrufunu anında başlatın:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from argus_cache import patch_model_with_argus

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Modeli ARGUS KV Cache Yöneticisi ile yamalayın
model = patch_model_with_argus(
    model,
    page_size=4096,
    max_active_pages=2,
    max_fp8_pages=2,
    max_int8_pages=2,
    max_int4_pages=2,
    sink_tokens=4
)

# Ultra yüksek VRAM tasarrufuyla üretimi başlatın!
inputs = tokenizer("Muhammed Emin has created the ultimate", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128, use_cache=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2. Canlı vLLM Docker Sunucusunu Başlatın

Laptop GPU'nuzda (RTX 3050 Ti, 4GB VRAM) ARGUS yamalı vLLM sunucusunu tek komutla ayağa kaldırın:

docker compose up --build

Bu komut, host üzerindeki 8000 portundan OpenAI uyumlu bir API sunucusu servis eder.

3. İnteraktif Çıkarım Arayüzü & Canlı Hız Ölçer (⚡ YENİ ⚡)

Canlı çalışan ARGUS-vLLM konteynerinize kendi yazdığınız özel prompt'ları gönderip saniyedeki token üretim hızını (throughput) ve yanıt süresini (latency) şık bir arayüzde gerçek zamanlı gözlemleyebilirsiniz:

python benchmarks/run_ui.py

Arayüze erişmek için tarayıcınızda http://localhost:8080/benchmarks/ui.html adresini açmanız yeterlidir.

4. Gerçek Dünya Test Sonuçları (RTX 3050 Ti Laptop GPU)

Tek kullanıcılı istek (Batch Size = 1) ve 256 token üretim uzunluğunda elde edilen canlı test sonuçları aşağıdadır:

Sunucu Yapılandırması Durum Üretim Hızı (Throughput) Ortalama Latency KV Cache Tasarrufu VRAM Güvenliği (2K Limit)
Vanilla vLLM (Port 8002) Cold Start 68.34 t/s 3.75 sn 0% OOM Riski
Vanilla vLLM (Port 8002) Warm Run 82.53 t/s 3.10 sn 0% OOM Riski
ARGUS-vLLM (Port 8001) Warm Run 82.23 t/s 3.11 sn 64.2% %100 Güvenli (8K Sınırı)

🧠 vLLM Sıcak Çalıştırmada (Warm Run) Bize Nasıl Yetişti?

Batch Size = 1 ve kısa bağlam boyutu (256 token) kullanıldığında, KV Cache boyutu son derece küçüktür (~5.5 MB). Bu senaryoda VRAM-SRAM arası veri aktarım darboğazı KV Cache'den değil, model ağırlıklarının (2.2 GB) okunmasından kaynaklanır. 5.5 MB'lık KV Cache yükünün aktarım süresine etkisi ihmal edilebilir düzeydedir. ARGUS'un Asıl Avantajı İki Senaryoda Ortaya Çıkar:

  1. Çoklu İstekler (Yüksek Concurrency): Eşzamanlı istek sayısı arttığında (Batch Size > 8), Vanilla vLLM 4GB GPU üzerinde anında VRAM taşması (OOM) ile çökerken, ARGUS statik bellek tahsisiyle güvenle çalışmaya devam eder.
  2. Uzun Bağlam (2K-8K): Bağlam boyutu uzadıkça, KV Cache boyutu gigabaytlara ulaşır ve bellek aktarım süresini domine eder. Bu durumda ARGUS'un INT4/1-Bit Triton dequantization kernel'ları throughput hızını katlayarak artırır.

📂 Project Directory Structure / Proje Dizin Yapısı

├── argus_cache/            # Exposable python library package
│   ├── __init__.py         # Exposes patch_model_with_argus
│   ├── core/
│   │   ├── quantization.py # 1-bit, INT2, INT4, INT8 & JL-Projection maths
│   │   ├── memory_manager.py# 7-Tier Outlier-Aware Paged memory manager
│   │   └── triton_kernels.py# Triton JIT 1-bit and 4-bit CUDA kernels
│   └── models/
│       └── attention_wrapper.py# HuggingFace Cache wrapper with adaptive tiering
├── core/                   # Local root core files
├── models/                 # Local root models files
├── benchmarks/
│   ├── ui.html             # Sleek glassmorphic web dashboard UI
│   ├── run_ui.py           # Launch script for interactive dashboard UI
│   ├── generate_vram_graph.py# Matplotlib benchmark visualizer
│   ├── vram_profiler.py    # VRAM memory scaling profiler
│   ├── llama_real_test.py  # Native HuggingFace Llama-3-8B integration test
│   └── vllm_speed_test.py  # Speed/throughput benchmark
├── tests/
│   ├── test_compression_loss.py# 1-Bit vs Lossless Delta-encoding test
│   ├── test_quantization.py# Tests for INT8, INT4 packing & Triton compiler
│   └── test_kv_cache.py    # Transitions & reconstruction errors tests
├── Dockerfile.vllm         # Production vLLM deployment container
├── docker-compose.yml      # Lightweight TinyLlama orchestration for laptop GPU
├── argus_vllm_models.py    # vLLM model registry bypass hook
├── setup.py                # Library package installer
├── pyproject.toml          # Library setup config
├── .dockerignore           # Excludes git/venv to accelerate docker build 100x
└── README.md               # Bilingual documentation

📄 License & Lisans

  • English: This project is licensed under the Apache License 2.0. See the LICENSE file for details.
  • Türkçe: Bu proje Apache Lisansı 2.0 altında lisanslanmıştır. Detaylar için LICENSE dosyasına göz atabilirsiniz.

🎓 Academic Citations / Akademik Atıflar

If you use this architecture or code in your thesis or research, please cite:

@thesis{MuhammedEminARGUS2026,
  author    = {Muhammed Emin Çelik},
  title     = {ARGUS: Anchored Random Geometric Unbiased Storage for Key-Value Cache in Long-Context Large Language Models},
  institution = {Academic Graduation Thesis},
  year      = {2026},
  month     = {May}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argus_cache-0.1.4.tar.gz (41.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

argus_cache-0.1.4-py3-none-any.whl (32.2 kB view details)

Uploaded Python 3

File details

Details for the file argus_cache-0.1.4.tar.gz.

File metadata

  • Download URL: argus_cache-0.1.4.tar.gz
  • Upload date:
  • Size: 41.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for argus_cache-0.1.4.tar.gz
Algorithm Hash digest
SHA256 a9f94ef39fbad0b142d815d433e88be6655462a47abc32ad3d487df558a652a6
MD5 6c77efd15f6fda4cccb313136fda133a
BLAKE2b-256 fa746efaafb780f3af86d0f6dcf4c472a149718b04a0c038e0c3d058bb58da81

See more details on using hashes here.

File details

Details for the file argus_cache-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: argus_cache-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 32.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for argus_cache-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 377a7fb273713ea1760cce3b3042ec6977306ec60c57b027ebf29923c161a714
MD5 ebfc91a307ccaff5a7e8fed93dd8806c
BLAKE2b-256 651b988f7d3267ef10db58b149788b4c7ae32976952b0d0216173f40b1ead0db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page