ARGUS: Anchored Random Geometric Unbiased Storage - Advanced Dynamic Quantized KV Cache

Project description

⚡ ARGUS: Anchored Random Geometric Unbiased Storage

ARGUS is an academic-grade, production-ready 7-Tier Paged Dynamic Quantized KV Cache Manager for long-context Transformers. Seamlessly integrates with the official HuggingFace Cache interface to enable plug-and-play generation.

Combines the perfect associative recall of Transformers with the extreme memory efficiency of State Space Models (SSM/Mamba), while fully resolving the repetitiveness loops of low-bit quantization.

🇬🇧 English Version

🎯 Architecture Overview

Transformer models suffer from quadratic memory scaling ($O(N^2)$) due to key-value (KV) cache accumulation during causal generation. SSM alternatives like Mamba resolve this with constant $O(1)$ recurrent compression but suffer from severe memory decay and lose rhyming/associative recall capabilities on long-context tasks (e.g. Passkey Retrieval).

This project presents a 7-Tier Paged Dynamic Quantized Cache (PagedDynamicQuantizedCache) that divides the KV cache into fixed-size pages and transitions them through an in-place compression pipeline as they age, achieving up to 65%+ VRAM savings while maintaining 0.90+ Cosine Similarity and 100% retrieval accuracy.

Sequence Direction: [Sinks (FP16)] -> [Rhyme Anchors (FP16)] -> [Active (FP16)] -> [FP8] -> [INT8] -> [INT4] -> [INT2] -> [1-Bit (Sign)] -> [Archive (JL FP16)]

🧬 The 7-Tier Memory Lifecycle

Tier 1: FP16 (Active Pages): Pristine precision for the most recent tokens.
Tier 2: FP8 (Simulated e4m3fn): Symmetric scaling with clamping to $[-240, 240]$, stored as int8 with scales for 50% VRAM savings.
Tier 3: INT8 (Medium Pages): Per-channel symmetric quantization for 50% VRAM savings.
Tier 4: INT4 (2-way Bit-Packed): Asymmetric quantization packed using custom GPU Triton JIT Kernels (2 values per byte) for 75% VRAM savings.
Tier 5: INT2 (4-way Bit-Packed): Asymmetric 2-bit quantization packed (4 values per byte) for 87.5% VRAM savings.
Tier 6: 1-Bit (Sign-Binarized Bit-Packed): Binarized signs ($x \ge 0 \rightarrow 1$, else $0$) packed 8 values per byte using custom GPU Triton JIT Kernels for 93.7% normal VRAM savings. Outliers isolated dynamically.
Tier 7: Johnson-Lindenstrauss Orthogonal Matrix Projection: Sequence-dimension projection ($N \rightarrow M$, where $M = N // 4$) keeping FP16 precision. Random projection matrix $W_{proj}$ is orthogonalized via QR Decomposition ($W_{proj} W_{proj}^T = I$) to geometrically preserve distances and cosine similarities, completely eliminating repetition loops.

🏆 Key Beta-Phase Upgrades

1. Library Packaging (`argus_cache`)

ARGUS has been refactored into a fully installable open-source library. Easily perform plug-and-play native model intercepting:

from argus_cache import patch_model_with_argus
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = patch_model_with_argus(model) # Bitti. Ready for ultra-long context!

2. vLLM Production Docker & Service Orchestration

Includes native interception tools (argus_vllm_models.py) to monkey-patch vLLM's ModelRegistry and LlamaAttention layers, a custom compiler-enabled Dockerfile.vllm, and a dual-service docker-compose.yml for serving SaaS-GPT and Health-GPT APIs on modern L40S high-performance GPUs.

3. Triton JIT 1-Bit CUDA GPU Kernels

Custom parallel JIT CUDA kernels (triton_pack_1bit, triton_unpack_1bit) binarize values on GPU SRAM and pack/unpack 8 sign bits into a single uint8 byte, with an automatic vectorized PyTorch fallback for CPU execution.

4. Dynamic Outlier Thresholding ($\sigma > 3.0$)

Calculates statistical variance in real-time. Signals exceeding $3.0\sigma$ standard deviation are dynamically isolated and stored permanently in high-fidelity FP16, while only the background normal range is compressed down. Reconstructed normal values are seamlessly combined with outliers: $$\text{Reconstructed} = \text{where}(\text{Outlier Mask}, \text{FP16 Outliers}, \text{Dequantized Normal})$$ This completely prevents quantization scaling explosion and protects critical activation spikes.

📊 Empirical Evaluation Results

📈 VRAM Context Scaling Benchmark

As sequence length scales up to 100k context, the VRAM memory footprint reduction becomes massive. Standard FP16 cache escalates to over 12.2 GB, whereas ARGUS compresses the footprint down to just ~4.2 GB (2.9x memory reduction).

VRAM Scaling Graph

🔬 Passkey Retrieval: Accuracy, Memory & Speed Comparison

A rigorous long-context associative recall evaluation where a hidden passkey is buried inside random background noise, evaluated across context lengths.

Architecture / Model	Metric	64 tok	256 tok	512 tok	1024 tok	2048 tok
Standard Transformer	Accuracy	100%	100%	100%	100%	100%
(Exact FP16 Cache)	Cache Memory	16.0 KB	64.0 KB	128.0 KB	256.0 KB	512.0 KB
	Speed	5,282 t/s	4,952 t/s	6,493 t/s	6,394 t/s	6,409 t/s
Mamba SSM	Accuracy	0%	0%	0%	0%	0%
(State Compression)	Cache Memory	0.5 KB	0.5 KB	0.5 KB	0.5 KB	0.5 KB
	Speed	4,589 t/s	7,937 t/s	8,588 t/s	8,031 t/s	8,262 t/s
ARGUS (Ours)	Accuracy	100%	100%	100%	100%	100%
(Paged Dynamic Cache)	Cache Memory	16.0 KB	64.0 KB	121.9 KB	223.4 KB	338.2 KB
	Speed	3,591 t/s	5,912 t/s	5,850 t/s	5,422 t/s	5,686 t/s

[!IMPORTANT] VRAM Memory Savings: At 2048 tokens, ARGUS achieves 33.9% VRAM savings compared to standard Transformers, without losing a single percent of retrieval accuracy. When context scales to 100K+ tokens, memory reduction reaches up to 65%+ (as shown in the VRAM Scaling Graph above).

Throughput & Speed: Due to custom page managers and dynamic quantized de-serialization overhead, the Python simulation speed of ARGUS is slightly slower than standard models. However, in production native vLLM engines with custom CUDA streams and compiled fused Triton kernels, pre-fetching runs asynchronously in parallel, matching or exceeding base Transformer speeds while utilizing less than half the VRAM.

🇹🇷 Türkçe Sürüm

🎯 Mimari Genel Bakış

Transformers modelleri, causal üretim adımlarında biriken key-value (KV) durumları nedeniyle karesel ($O(N^2)$) bellek patlaması (Out-of-Memory - OOM) yaşarlar. Mamba gibi SSM alternatifleri bellek tüketimini recurrent bir scan döngüsüyle $O(1)$ seviyesinde sabitlese de, samanlıkta iğne arama (Passkey Retrieval) ve uzun vadeli uyak/vezin yapısı koruma gerektiren şiirsel metin üretimlerinde bellek sönümlenmesi (memory decay) yaşayarak başarısız olurlar.

Bu projede, iki dünyanın en iyi yönlerini birleştiren 7-Aşamalı Dinamik Kademeli Kuantize Sayfalanmış Bellek Yöneticisi (PagedDynamicQuantizedCache) sunulmuştur. Sistem, KV Cache tensörlerini sabit boyutlu sayfalara böler ve sayfalar eskidikçe otomatik olarak yerinde (in-place) kuantizasyon ve projeksiyon adımlarından geçirir. Bu sayede %65'i aşan VRAM tasarrufu sağlanırken, dekuantizasyon Cosine Benzerliği 0.90+ seviyesinde korunur.

Dizi Yönü: [Sinks (FP16)] -> [Rhyme Anchors (FP16)] -> [Active (FP16)] -> [FP8] -> [INT8] -> [INT4] -> [INT2] -> [1-Bit (Sign)] -> [Archive (JL FP16)]

🧬 7-Aşamalı Bellek Yaşam Döngüsü

Tier 1: FP16 (Aktif Sayfalar): En güncel token'lar için tam çözünürlüklü FP16 bellek tamponu.
Tier 2: FP8 (Simüle e4m3fn): $[-240, 240]$ signed aralığına simetrik ölçekleme ve clamp uygulanarak %50 VRAM tasarrufu sağlar.
Tier 3: INT8 (Orta Sayfalar): Per-channel simetrik kuantizasyon ile %50 VRAM tasarrufu sağlar.
Tier 4: INT4 (2-way Packed): Custom Triton JIT CUDA GPU Kernelleri ile iki adet 4-bitlik değerin tek bir uint8 hücresine GPU SRAM üzerinde paralel paketlenmesiyle %75 VRAM tasarrufu sağlar.
Tier 5: INT2 (4-way Packed): Dört adet 2-bitlik değerin tek bir uint8 hücresine bit-packing ile paketlenmesiyle %87.5 VRAM tasarrufu sağlar.
Tier 6: 1-Bit (İşaret Binarize Bit-Packed): İşaret değerlerini ($x \ge 0 \rightarrow 1$, else $0$) custom Triton JIT CUDA Kernelleri ile 8 adet 1-bitlik değeri tek bir uint8 hücresine paralel paketleyerek %93.7 normal VRAM tasarrufu sağlar.
Tier 7: Johnson-Lindenstrauss Ortogonal Matris Projeksiyonu (JL): En eski arşiv sayfalarında tekrarlama döngüsü bug'ına yol açan lossy INT2 yerine, sequence boyutu $N$ ortogonal bir rastgele matris $W_{proj}$ ile çarpılarak sequence boyutu 4 kat büzüştürülür. Sayılar yüksek çözünürlüklü FP16 biçiminde tutulur, tekrarlama döngüsü bug'ları tamamen önlenir.

🏆 Temel Beta Fazı Yenilikleri

1. Kütüphane Paketlemesi (`argus_cache`)

ARGUS, pip ile kurulabilir bağımsız bir Python kütüphanesine dönüştürülmüştür:

from argus_cache import patch_model_with_argus
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
model = patch_model_with_argus(model) # Bitti!

2. vLLM Entegrasyonu & Docker Compose

vLLM ModelRegistry mekanizmasına sızarak LlamaAttention katmanlarını kesen argus_vllm_models.py modülü, derleme araçlarını içeren Dockerfile.vllm ve L40S kartı üzerinde SaaS ve Sağlık GPT servislerinin serve edilebileceği çift portlu docker-compose.yml altyapısı hazırlandı.

3. GPU SRAM üzerinde Triton JIT 1-Bit Kernelleri

Doğrudan GPU SRAM üzerinde paralel koşan custom Triton JIT 1-Bit Kernelleri (triton_pack_1bit, triton_unpack_1bit) geliştirilmiştir (CPU için otomatik PyTorch fallback mevcuttur).

4. Dinamik Outlier Kilitleme (Dynamic Outlier Locking - $\sigma > 3.0$)

Forward pass sırasında verilerin standart sapması hesaplanır ve $3.0\sigma$ limitini aşan aşırı fırlamış outlier kanallar FP16 biçiminde kalıcı olarak kilitlenir. Arka plandaki normal veriler ise 1-bit seviyesine kadar sıkıştırılır. Böylece kuantizasyon çözünürlük kaybı önlenmiş olur.

📊 Deneysel Değerlendirme Sonuçları

📈 VRAM Bağlam Ölçekleme Karşılaştırması

Dizi uzunluğu 100k bağlam seviyesine çıktığında, VRAM bellek ayak izindeki azalma devasa bir boyuta ulaşır. Standart FP16 cache 12.2 GB'ın üzerine çıkarken, ARGUS bu ayak izini sadece ~4.2 GB'a düşürür (2.9 kat bellek tasarrufu).

VRAM Scaling Graph

🔬 Passkey Retrieval: Doğruluk, Bellek ve Hız Karşılaştırması

Rastgele arka plan gürültüsü içine gömülmüş gizli bir parolanın farklı bağlam uzunluklarında geri çağrılmasını test eden zorlu samanlıkta iğne arama (Passkey Retrieval) değerlendirmesi.

Mimari / Model	Metrik	64 tok	256 tok	512 tok	1024 tok	2048 tok
Standart Transformer	Doğruluk	100%	100%	100%	100%	100%
(Kesin FP16 Cache)	Önbellek Belleği	16.0 KB	64.0 KB	128.0 KB	256.0 KB	512.0 KB
	Hız	5.282 t/s	4.952 t/s	6.493 t/s	6.394 t/s	6.409 t/s
Mamba SSM	Doğruluk	0%	0%	0%	0%	0%
(Eyalet Sıkıştırma)	Önbellek Belleği	0.5 KB	0.5 KB	0.5 KB	0.5 KB	0.5 KB
	Hız	4.589 t/s	7.937 t/s	8.588 t/s	8.031 t/s	8.262 t/s
ARGUS (Bizim)	Doğruluk	100%	100%	100%	100%	100%
(Sayfalı Dinamik Önbellek)	Önbellek Belleği	16.0 KB	64.0 KB	121.9 KB	223.4 KB	338.2 KB
	Hız	3.591 t/s	5.912 t/s	5.850 t/s	5.422 t/s	5.686 t/s

[!IMPORTANT] VRAM Bellek Kazancı: 2048 tokende ARGUS, standart Transformers modeline kıyasla doğruluktan ödün vermeden %33.9 VRAM tasarrufu elde eder. Bağlam boyutu 100K+ token seviyesine ulaştığında bu kazanç %65'in üzerine çıkar (yukarıdaki VRAM Ölçekleme Grafiğinde görüldüğü gibi).

İşlem Hızı (Throughput): Özel sayfa yöneticisi ve dinamik kuantizasyon/dekuantizasyon işlemlerinin getirdiği ek yük nedeniyle, ARGUS'un Python simülasyon hızı standart modelin biraz altındadır. Ancak, asenkron CUDA akışları ve derlenmiş Triton çekirdekleriyle çalışan gerçek vLLM entegrasyonunda prefetching işlemleri arka planda asenkron yürütülerek baz model hızına ulaşacak veya onu aşacaktır.

📂 Project Directory Structure / Proje Dizin Yapısı

├── argus_cache/            # Exposable python library package
│   ├── __init__.py         # Exposes patch_model_with_argus
│   ├── core/
│   │   ├── quantization.py # 1-bit, INT2, INT4, INT8 & JL-Projection maths
│   │   ├── memory_manager.py# 7-Tier Outlier-Aware Paged memory manager
│   │   └── triton_kernels.py# Triton JIT 1-bit and 4-bit CUDA kernels
│   └── models/
│       └── attention_wrapper.py# HuggingFace Cache wrapper with adaptive tiering
├── core/                   # Local root core files
├── models/                 # Local root models files
├── benchmarks/
│   ├── generate_vram_graph.py# Matplotlib benchmark visualizer
│   ├── vram_profiler.py    # VRAM memory scaling profiler
│   ├── llama_real_test.py  # Native HuggingFace Llama-3-8B integration test
│   └── empirical_comparison.py# Passkey Retrieval evaluation dataset & model
├── tests/
│   ├── test_compression_loss.py# 1-Bit vs Lossless Delta-encoding test
│   ├── test_quantization.py# Tests for INT8, INT4 packing & Triton compiler
│   └── test_kv_cache.py    # Transitions & reconstruction errors tests
├── Dockerfile.vllm         # production vLLM deployment container
├── docker-compose.yml      # Dual service SaaS & Health orchestration
├── argus_vllm_models.py    # vLLM model registry bypass hook
├── setup.py                # Library package installer
├── pyproject.toml          # Library setup config
├── demo_train_and_generate.py# Poetry causal QAT training demo
└── README.md               # Bilingual documentation

⚙️ Installation & Usage / Kurulum ve Çalıştırma

1. Requirements / Gereksinimler

Linux OS
CUDA 11.8+ / 12.0+ (Optional, compiles Triton JIT kernels on-the-fly)
Python 3.10+

2. Local Library Install / Yerel Kütüphane Kurulumu

# Clone the repository
git clone <repository_url>
cd "mamba fix"

# Install as editable package
pip install -e .

3. Run Automated Tests / Birim Testlerini Çalıştırın

.venv/bin/pytest tests/

4. Generate the scaling benchmark graph / Grafiği Üretin

.venv/bin/python benchmarks/generate_vram_graph.py

📄 License & Lisans

English: This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Türkçe: Bu proje Apache Lisansı 2.0 altında lisanslanmıştır. Detaylar için LICENSE dosyasına göz atabilirsiniz.

🎓 Academic Citations / Akademik Atıflar

If you use this architecture or code in your thesis or research, please cite:

@thesis{MuhammedEminARGUS2026,
  author    = {Muhammed Emin Çelik},
  title     = {ARGUS: Anchored Random Geometric Unbiased Storage for Key-Value Cache in Long-Context Large Language Models},
  institution = {Academic Graduation Thesis},
  year      = {2026},
  month     = {May}
}

Project details

Release history Release notifications | RSS feed

0.1.5

May 25, 2026

0.1.4.2

May 25, 2026

0.1.4.1

May 25, 2026

0.1.4

May 25, 2026

0.1.3

May 25, 2026

0.1.2

May 25, 2026

0.1.1

May 25, 2026

This version

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argus_cache-0.1.0.tar.gz (39.6 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

argus_cache-0.1.0-py3-none-any.whl (31.1 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file argus_cache-0.1.0.tar.gz.

File metadata

Download URL: argus_cache-0.1.0.tar.gz
Upload date: May 25, 2026
Size: 39.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for argus_cache-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8f1f88dada2e5aabc7547f408e188facf882de7e52777173aeb8c5eb30ee3b78`
MD5	`7268228079f515334ec9772b0fde1ffd`
BLAKE2b-256	`3e96535abcd9bb99c3d82f0157e1f149e25be64c6cbaebc819046ef36c27e66b`

See more details on using hashes here.

File details

Details for the file argus_cache-0.1.0-py3-none-any.whl.

File metadata

Download URL: argus_cache-0.1.0-py3-none-any.whl
Upload date: May 25, 2026
Size: 31.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for argus_cache-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c44934f03673a5d7e694fd7e2a938b9de8bfaf37fde8fbb11d63196d57be97fb`
MD5	`74b86abd462491994b129ada6cb468be`
BLAKE2b-256	`ccf9a4255a1917bab83bf1cc1940262f51fc1e5628162b598344814a9d28d43b`

See more details on using hashes here.

argus-cache 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

⚡ ARGUS: Anchored Random Geometric Unbiased Storage

🌍 Language Options / Dil Seçenekleri

🇬🇧 English Version

🎯 Architecture Overview

🧬 The 7-Tier Memory Lifecycle

🏆 Key Beta-Phase Upgrades

1. Library Packaging (argus_cache)

2. vLLM Production Docker & Service Orchestration

3. Triton JIT 1-Bit CUDA GPU Kernels

4. Dynamic Outlier Thresholding ($\sigma > 3.0$)

📊 Empirical Evaluation Results

📈 VRAM Context Scaling Benchmark

🔬 Passkey Retrieval: Accuracy, Memory & Speed Comparison

🇹🇷 Türkçe Sürüm

🎯 Mimari Genel Bakış

🧬 7-Aşamalı Bellek Yaşam Döngüsü

🏆 Temel Beta Fazı Yenilikleri

1. Kütüphane Paketlemesi (argus_cache)

2. vLLM Entegrasyonu & Docker Compose

3. GPU SRAM üzerinde Triton JIT 1-Bit Kernelleri

4. Dinamik Outlier Kilitleme (Dynamic Outlier Locking - $\sigma > 3.0$)

📊 Deneysel Değerlendirme Sonuçları

📈 VRAM Bağlam Ölçekleme Karşılaştırması

🔬 Passkey Retrieval: Doğruluk, Bellek ve Hız Karşılaştırması

📂 Project Directory Structure / Proje Dizin Yapısı

⚙️ Installation & Usage / Kurulum ve Çalıştırma

1. Requirements / Gereksinimler

2. Local Library Install / Yerel Kütüphane Kurulumu

3. Run Automated Tests / Birim Testlerini Çalıştırın

4. Generate the scaling benchmark graph / Grafiği Üretin

📄 License & Lisans

🎓 Academic Citations / Akademik Atıflar

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Library Packaging (`argus_cache`)

1. Kütüphane Paketlemesi (`argus_cache`)