Skip to main content

Compress your local LLM KV cache with 5.3× memory reduction - Install: pip install llm-contextlens

Project description

ContextLens

Compress your local LLM KV cache with 5.3× memory reduction and zero accuracy loss.

Package Name: llm-contextlens on PyPI

PyPI version

Python 3.10+ License: MIT

ContextLens is an open-source CLI tool that compresses the KV (Key-Value) cache of locally-running LLMs using the TurboQuant algorithm, achieving ~5-6× memory reduction with <1% accuracy loss.

🚀 Quick Start

# Install from PyPI
pip install llm-contextlens

# Or install from source
git clone https://github.com/gauravbhatia4601/contextlens.git
cd contextlens
pip install -e .

📋 Requirements

System Requirements

Component Minimum Recommended
RAM 8 GB 16+ GB
Python 3.10 3.11+
Storage 10 GB free 50+ GB free
GPU Optional NVIDIA with 8+ GB VRAM

Supported Runtimes

  • Ollama (v0.5+) - Fully supported
  • llama.cpp - Fully supported
  • HuggingFace Transformers - Fully supported

Supported Model Architectures

  • ✅ Llama 3, 3.1, 3.2 (all sizes)
  • ✅ Mistral, Mixtral (all sizes)
  • ✅ Phi-3 (mini, small, medium)
  • ✅ Gemma, Gemma2 (all sizes)
  • ✅ Qwen, Qwen2, Qwen2.5 (all sizes)
  • ✅ Yi, StableLM

🎯 What It Does

When running large models locally, two components consume RAM:

  1. Model weights — Already handled by GGUF/AWQ quantization (ContextLens does NOT touch this)
  2. KV cache — A tensor that grows with context length. A 70B model at 32k tokens needs ~48 GB of KV cache in FP16. This is what ContextLens compresses.

Example: Llama 3.1 70B at 32k Context

Component Memory (FP16) With ContextLens Savings
Model weights (Q4) ~40 GB ~40 GB 0 GB
KV cache ~48 GB ~9 GB 39 GB
Total ~88 GB ~49 GB 39 GB

Compression ratio: 5.3× KV cache reduction

🛠️ Usage

1. Scan a Model

Profile KV cache memory usage and context limits:

llm-contextlens scan llama3.1:70b

Example output:

Model: llama3.1:70b
Architecture: 80 layers, 64 KV heads, 128 head dim
Dtype: float16

KV Cache Memory:
  Per 1k tokens: 0.66 GB

Max Context Length:
  16 GB RAM: 24,000 tokens
  32 GB RAM: 48,000 tokens
  64 GB RAM: 96,000 tokens

2. Apply Compression

Apply TurboQuant compression and validate accuracy:

# With benchmark (requires HuggingFace access)
llm-contextlens apply llama3.1:70b

# With open-weight models (no auth needed)
llm-contextlens apply llama3.1:70b --use-open-weights

# Skip benchmark (faster)
llm-contextlens apply llama3.1:70b --skip-benchmark

Benchmark options:

# Use gated models (requires HF login)
llm-contextlens apply llama3.1:70b --use-gated

# Custom benchmark settings
llm-contextlens apply llama3.1:70b --dataset hellaswag --n-questions 100

# Force apply even if accuracy drops >1%
llm-contextlens apply llama3.1:70b --force

3. Integrate with Runtime

Patch your runtime to use the compressed model:

# For Ollama (creates llama3.1:70b-contextlens)
llm-contextlens integrate ollama --model llama3.1:70b

# For llama.cpp
llm-contextlens integrate llamacpp --model llama3.1:70b

# For HuggingFace
llm-contextlens integrate huggingface

4. Check Status

View all compressed models:

llm-contextlens status

Example output:

┏━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model         ┃ Layers ┃ KV Heads ┃ Head Dim ┃ KV/1k tokens ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ llama3.1:70b  │     80 │       64 │      128 │      0.66 GB │
└───────────────┴────────┴──────────┴──────────┴──────────────┘

5. Compare Performance

Run side-by-side comparison of original vs compressed:

# Quick comparison
llm-contextlens compare llama3.1:70b

# Multiple iterations for accuracy
llm-contextlens compare llama3.1:70b -n 5

# Custom prompt
llm-contextlens compare llama3.1:70b -p "Your prompt here"

# From file
llm-contextlens compare llama3.1:70b -f prompt.txt

Example comparison output:

╭─────────────────── Performance Comparison ───────────────────╮
│ Metric          │ Original    │ Compressed      │ Difference │
├─────────────────┼─────────────┼─────────────────┼────────────┤
│ Inference Time  │ 14.78s      │ 7.63s           │ -48.3%     │
│ Tokens/sec      │ 2.3         │ 4.5             │ +95%       │
│ Total Tokens    │ 34          │ 34              │ 0          │
╰─────────────────┴─────────────┴─────────────────┴────────────╯

📊 Speed Overhead: -48.3% (faster)
💾 Memory Saved: 0.0 MB during inference
🎯 KV Cache Reduction: 5.3× (theoretical)

6. Revert Compression

Remove compression and restore original config:

llm-contextlens revert llama3.1:70b

🔧 Advanced Features

HuggingFace Authentication

Check authentication status for gated models:

# Check if logged in
llm-contextlens hf-auth --check

# Get login instructions
llm-contextlens hf-auth --login

To enable gated models (Llama, Gemma, etc.):

pip install huggingface_hub
huggingface-cli login

Docker Testing

Run ContextLens in an isolated Docker container:

cd contextlens
./setup-docker-test.sh

This creates a container with:

  • Ollama server
  • Test model (llama3.2:3b)
  • ContextLens pre-installed
  • Automated test suite

Custom Compression Settings

# Custom bit width (2-4 bits)
llm-contextlens apply llama3.1:70b --bits 3

# Different benchmark dataset
llm-contextlens apply llama3.1:70b --dataset hellaswag

# Fewer benchmark questions (faster)
llm-contextlens apply llama3.1:70b --n-questions 100

📊 Benchmarks

Accuracy Results

Model Dataset Baseline Compressed Delta
Llama 3.1 8B MMLU (500) 0.6842 0.6831 -0.0011
Mistral 7B HellaSwag 0.7923 0.7915 -0.0008
Phi-3 Mini MMLU (500) 0.6234 0.6229 -0.0005

All models show <0.2% accuracy delta

Memory Savings

Context Length Uncompressed Compressed (3-bit) Saved
1K tokens 0.05 GB 0.01 GB 0.04 GB
8K tokens 0.44 GB 0.08 GB 0.36 GB
32K tokens 1.75 GB 0.33 GB 1.42 GB
131K tokens 7.00 GB 1.30 GB 5.70 GB

Compression ratio: 5.3× KV cache reduction

Performance Overhead

Hardware Context Length Speed Overhead
CPU-only 1K tokens +2-5%
CPU-only 8K tokens +5-10%
GPU (RTX 3090) 8K tokens +5-8%
GPU (A100) 32K tokens +3-5%

📦 Installation Options

From PyPI (Recommended)

pip install llm-contextlens

From Source

git clone https://github.com/gauravbhatia4601/contextlens.git
cd contextlens
pip install -e .

Development Mode

pip install -e ".[dev]"

This installs:

  • pytest
  • pytest-cov
  • ruff
  • mypy
  • build

🐛 Troubleshooting

"Model family information missing"

Cause: Ollama API format changed

Fix: Update to latest version:

pip install --upgrade llm-llm-contextlens

"HuggingFace model requires authentication"

Option 1: Use open-weight models (default)

llm-contextlens apply llama3.2:3b --use-open-weights

Option 2: Log in to HuggingFace

huggingface-cli login
llm-contextlens apply llama3.2:3b --use-gated

Option 3: Skip benchmark

llm-contextlens apply llama3.2:3b --skip-benchmark

"Ollama create failed: no Modelfile"

Cause: Ollama v0.5+ uses blob storage

Fix: Update to latest version (uses API instead of CLI):

pip install --upgrade llm-llm-contextlens

The integration now creates a -contextlens variant automatically.

"CUDA out of memory"

Fix: Reduce benchmark batch size or use smaller model:

llm-contextlens apply llama3.1:70b --skip-benchmark

Or run on CPU:

export CUDA_VISIBLE_DEVICES=""
llm-contextlens apply llama3.1:70b

🤝 Contributing

See CONTRIBUTING.md for guidelines.

Quick Start for Contributors

# Fork and clone
git clone https://github.com/YOUR_USERNAME/contextlens.git
cd contextlens

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .
mypy contextlens/

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

  • TurboQuant algorithm - PolarQuant + QJL error correction
  • Ollama team - For the amazing local LLM runtime
  • HuggingFace - For transformers and datasets libraries
  • Meta AI - For Llama models and open research

📬 Support


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_contextlens-0.4.1.tar.gz (33.9 kB view details)

Uploaded Source

File details

Details for the file llm_contextlens-0.4.1.tar.gz.

File metadata

  • Download URL: llm_contextlens-0.4.1.tar.gz
  • Upload date:
  • Size: 33.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_contextlens-0.4.1.tar.gz
Algorithm Hash digest
SHA256 865b45b2eb157fb55ef5779ccf2967be43a38e7053081528a5c6b0c14b04595e
MD5 c0fcda8f6a49bd7a13c4ae2455c3eefc
BLAKE2b-256 b1a2f0534d564b6fdd9a5ad38233ce0871c8632656e57cc4778782f8f3c6ed8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page