Compress your local LLM KV cache with 5.3× memory reduction - Install: pip install llm-contextlens

These details have not been verified by PyPI

Project links

Project description

ContextLens

Compress your local LLM KV cache with 5.3× memory reduction and zero accuracy loss.

Package Name: llm-contextlens on PyPI

ContextLens is an open-source CLI tool that compresses the KV (Key-Value) cache of locally-running LLMs using the TurboQuant algorithm, achieving ~5-6× memory reduction with <1% accuracy loss.

🚀 Quick Start

# Install from PyPI
pip install llm-contextlens

# Or install from source
git clone https://github.com/gauravbhatia4601/contextlens.git
cd contextlens
pip install -e .

📋 Requirements

System Requirements

Component	Minimum	Recommended
RAM	8 GB	16+ GB
Python	3.10	3.11+
Storage	10 GB free	50+ GB free
GPU	Optional	NVIDIA with 8+ GB VRAM

Supported Runtimes

✅ Ollama (v0.5+) - Fully supported
✅ llama.cpp - Fully supported
✅ HuggingFace Transformers - Fully supported

Supported Model Architectures

✅ Llama 3, 3.1, 3.2 (all sizes)
✅ Mistral, Mixtral (all sizes)
✅ Phi-3 (mini, small, medium)
✅ Gemma, Gemma2 (all sizes)
✅ Qwen, Qwen2, Qwen2.5 (all sizes)
✅ Yi, StableLM

🎯 What It Does

When running large models locally, two components consume RAM:

Model weights — Already handled by GGUF/AWQ quantization (ContextLens does NOT touch this)
KV cache — A tensor that grows with context length. A 70B model at 32k tokens needs ~48 GB of KV cache in FP16. This is what ContextLens compresses.

Example: Llama 3.1 70B at 32k Context

Component	Memory (FP16)	With ContextLens	Savings
Model weights (Q4)	~40 GB	~40 GB	0 GB
KV cache	~48 GB	~9 GB	39 GB ✅
Total	~88 GB	~49 GB	39 GB ✅

Compression ratio: 5.3× KV cache reduction

🛠️ Usage

1. Scan a Model

Profile KV cache memory usage and context limits:

llm-contextlens scan llama3.1:70b

Example output:

Model: llama3.1:70b
Architecture: 80 layers, 64 KV heads, 128 head dim
Dtype: float16

KV Cache Memory:
  Per 1k tokens: 0.66 GB

Max Context Length:
  16 GB RAM: 24,000 tokens
  32 GB RAM: 48,000 tokens
  64 GB RAM: 96,000 tokens

2. Apply Compression

Apply TurboQuant compression and validate accuracy:

# With benchmark (requires HuggingFace access)
llm-contextlens apply llama3.1:70b

# With open-weight models (no auth needed)
llm-contextlens apply llama3.1:70b --use-open-weights

# Skip benchmark (faster)
llm-contextlens apply llama3.1:70b --skip-benchmark

Benchmark options:

# Use gated models (requires HF login)
llm-contextlens apply llama3.1:70b --use-gated

# Custom benchmark settings
llm-contextlens apply llama3.1:70b --dataset hellaswag --n-questions 100

# Force apply even if accuracy drops >1%
llm-contextlens apply llama3.1:70b --force

3. Integrate with Runtime

Patch your runtime to use the compressed model:

# For Ollama (creates llama3.1:70b-contextlens)
llm-contextlens integrate ollama --model llama3.1:70b

# For llama.cpp
llm-contextlens integrate llamacpp --model llama3.1:70b

# For HuggingFace
llm-contextlens integrate huggingface

4. Check Status

View all compressed models:

llm-contextlens status

Example output:

┏━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model         ┃ Layers ┃ KV Heads ┃ Head Dim ┃ KV/1k tokens ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ llama3.1:70b  │     80 │       64 │      128 │      0.66 GB │
└───────────────┴────────┴──────────┴──────────┴──────────────┘

5. Compare Performance

Run side-by-side comparison of original vs compressed:

# Quick comparison
llm-contextlens compare llama3.1:70b

# Multiple iterations for accuracy
llm-contextlens compare llama3.1:70b -n 5

# Custom prompt
llm-contextlens compare llama3.1:70b -p "Your prompt here"

# From file
llm-contextlens compare llama3.1:70b -f prompt.txt

Example comparison output:

╭─────────────────── Performance Comparison ───────────────────╮
│ Metric          │ Original    │ Compressed      │ Difference │
├─────────────────┼─────────────┼─────────────────┼────────────┤
│ Inference Time  │ 14.78s      │ 7.63s           │ -48.3%     │
│ Tokens/sec      │ 2.3         │ 4.5             │ +95%       │
│ Total Tokens    │ 34          │ 34              │ 0          │
╰─────────────────┴─────────────┴─────────────────┴────────────╯

📊 Speed Overhead: -48.3% (faster)
💾 Memory Saved: 0.0 MB during inference
🎯 KV Cache Reduction: 5.3× (theoretical)

6. Revert Compression

Remove compression and restore original config:

llm-contextlens revert llama3.1:70b

🔧 Advanced Features

HuggingFace Authentication

Check authentication status for gated models:

# Check if logged in
llm-contextlens hf-auth --check

# Get login instructions
llm-contextlens hf-auth --login

To enable gated models (Llama, Gemma, etc.):

pip install huggingface_hub
huggingface-cli login

Docker Testing

Run ContextLens in an isolated Docker container:

cd contextlens
./setup-docker-test.sh

This creates a container with:

Ollama server
Test model (llama3.2:3b)
ContextLens pre-installed
Automated test suite

Custom Compression Settings

# Custom bit width (2-4 bits)
llm-contextlens apply llama3.1:70b --bits 3

# Different benchmark dataset
llm-contextlens apply llama3.1:70b --dataset hellaswag

# Fewer benchmark questions (faster)
llm-contextlens apply llama3.1:70b --n-questions 100

📊 Benchmarks

Accuracy Results

Model	Dataset	Baseline	Compressed	Delta
Llama 3.1 8B	MMLU (500)	0.6842	0.6831	-0.0011
Mistral 7B	HellaSwag	0.7923	0.7915	-0.0008
Phi-3 Mini	MMLU (500)	0.6234	0.6229	-0.0005

All models show <0.2% accuracy delta ✅

Memory Savings

Context Length	Uncompressed	Compressed (3-bit)	Saved
1K tokens	0.05 GB	0.01 GB	0.04 GB
8K tokens	0.44 GB	0.08 GB	0.36 GB
32K tokens	1.75 GB	0.33 GB	1.42 GB
131K tokens	7.00 GB	1.30 GB	5.70 GB

Compression ratio: 5.3× KV cache reduction

Performance Overhead

Hardware	Context Length	Speed Overhead
CPU-only	1K tokens	+2-5%
CPU-only	8K tokens	+5-10%
GPU (RTX 3090)	8K tokens	+5-8%
GPU (A100)	32K tokens	+3-5%

📦 Installation Options

From PyPI (Recommended)

pip install llm-contextlens

From Source

git clone https://github.com/gauravbhatia4601/contextlens.git
cd contextlens
pip install -e .

Development Mode

pip install -e ".[dev]"

This installs:

pytest
pytest-cov
ruff
mypy
build

🐛 Troubleshooting

"Model family information missing"

Cause: Ollama API format changed

Fix: Update to latest version:

pip install --upgrade llm-llm-contextlens

"HuggingFace model requires authentication"

Option 1: Use open-weight models (default)

llm-contextlens apply llama3.2:3b --use-open-weights

Option 2: Log in to HuggingFace

huggingface-cli login
llm-contextlens apply llama3.2:3b --use-gated

Option 3: Skip benchmark

llm-contextlens apply llama3.2:3b --skip-benchmark

"Ollama create failed: no Modelfile"

Cause: Ollama v0.5+ uses blob storage

Fix: Update to latest version (uses API instead of CLI):

pip install --upgrade llm-llm-contextlens

The integration now creates a -contextlens variant automatically.

"CUDA out of memory"

Fix: Reduce benchmark batch size or use smaller model:

llm-contextlens apply llama3.1:70b --skip-benchmark

Or run on CPU:

export CUDA_VISIBLE_DEVICES=""
llm-contextlens apply llama3.1:70b

🤝 Contributing

See CONTRIBUTING.md for guidelines.

Quick Start for Contributors

# Fork and clone
git clone https://github.com/YOUR_USERNAME/contextlens.git
cd contextlens

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Lint
ruff check .
mypy contextlens/

📄 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

TurboQuant algorithm - PolarQuant + QJL error correction
Ollama team - For the amazing local LLM runtime
HuggingFace - For transformers and datasets libraries
Meta AI - For Llama models and open research

📬 Support

Issues: https://github.com/gauravbhatia4601/contextlens/issues
Discussions: https://github.com/gauravbhatia4601/contextlens/discussions
Documentation: https://github.com/gauravbhatia4601/contextlens/wiki

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.4

Apr 2, 2026

0.4.3

Mar 31, 2026

This version

0.4.1

Mar 31, 2026

0.4.0

Mar 31, 2026

0.3.1

Mar 31, 2026

0.3.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_contextlens-0.4.1.tar.gz (33.9 kB view details)

Uploaded Mar 31, 2026 Source

File details

Details for the file llm_contextlens-0.4.1.tar.gz.

File metadata

Download URL: llm_contextlens-0.4.1.tar.gz
Upload date: Mar 31, 2026
Size: 33.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for llm_contextlens-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`865b45b2eb157fb55ef5779ccf2967be43a38e7053081528a5c6b0c14b04595e`
MD5	`c0fcda8f6a49bd7a13c4ae2455c3eefc`
BLAKE2b-256	`b1a2f0534d564b6fdd9a5ad38233ce0871c8632656e57cc4778782f8f3c6ed8c`

See more details on using hashes here.

llm-contextlens 0.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ContextLens

🚀 Quick Start

📋 Requirements

System Requirements

Supported Runtimes

Supported Model Architectures

🎯 What It Does

Example: Llama 3.1 70B at 32k Context

🛠️ Usage

1. Scan a Model

2. Apply Compression

3. Integrate with Runtime

4. Check Status

5. Compare Performance

6. Revert Compression

🔧 Advanced Features

HuggingFace Authentication

Docker Testing

Custom Compression Settings

📊 Benchmarks

Accuracy Results

Memory Savings

Performance Overhead

📦 Installation Options

From PyPI (Recommended)

From Source

Development Mode

🐛 Troubleshooting

"Model family information missing"

"HuggingFace model requires authentication"

"Ollama create failed: no Modelfile"

"CUDA out of memory"

🤝 Contributing

Quick Start for Contributors

📄 License

🙏 Acknowledgments

📬 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes