AttentionEcho: Cross-request attention pattern reuse for LLM inference optimization
Project description
๐ AttentionEcho
Cross-request attention pattern reuse for LLM inference optimization
๐ฏ What is AttentionEcho?
AttentionEcho is a novel inference optimization technique that reuses attention patterns across semantically similar requests. Unlike traditional prefix caching (which caches KV tensors), AttentionEcho caches the actual attention weights and adjusts them for new queries.
Key Innovation
Standard Inference:
Request 1: "You are helpful. What is Python?" โ Compute Q @ K.T (expensive)
Request 2: "You are helpful. What is Java?" โ Compute Q @ K.T (expensive)
With AttentionEcho:
Request 1: "You are helpful. What is Python?" โ Compute Q @ K.T โ Cache pattern
Request 2: "You are helpful. What is Java?" โ Reuse pattern (fast!) โ
โจ Features
- Semantic Matching: Uses embedding similarity (not exact token match)
- Pattern Adjustment: First-order Taylor expansion for query differences
- Cross-Request Sharing: One user's cached pattern helps another
- Framework Agnostic: Works with PyTorch, NumPy, or any tensor library
- Production Ready: Thread-safe, LRU eviction, comprehensive stats
๐ฆ Installation
pip install attention-echo
# With PyTorch support
pip install attention-echo[torch]
# For development
pip install attention-echo[dev]
๐ Quick Start
Basic Usage (NumPy)
from attention_echo import AttentionEchoCache, EchoConfig
# Create cache
config = EchoConfig(
capacity=1000,
similarity_threshold=0.85
)
cache = AttentionEchoCache(config)
# First request - computes and caches
output1, meta1 = cache.attention_with_echo(
query=q1, key=k1, value=v1,
prefix_length=10,
prefix_embeddings=embeddings1
)
print(meta1) # {'echo_hit': False, 'tokens_computed': 15}
# Second request with similar prefix - reuses pattern!
output2, meta2 = cache.attention_with_echo(
query=q2, key=k2, value=v2,
prefix_length=10,
prefix_embeddings=embeddings2 # Similar to embeddings1
)
print(meta2) # {'echo_hit': True, 'similarity': 0.95, 'tokens_echoed': 10}
PyTorch Integration
import torch
from attention_echo.torch import EchoAttention
# Wrap your attention layer
attention = EchoAttention(
hidden_dim=768,
num_heads=12,
cache_capacity=1000
)
# Use like normal attention
output = attention(
query=q,
key=k,
value=v,
prefix_length=prefix_len
)
# Check stats
print(attention.cache.stats)
# {'hits': 150, 'misses': 20, 'hit_rate': 0.88}
๐ How It Works
1. Semantic Hashing
When a request arrives, we compute a semantic hash of the prefix:
semantic_key = normalize(mean_pool(prefix_embeddings))
2. Cache Lookup
Search for similar cached patterns using cosine similarity:
for cached_key, entry in cache:
similarity = cosine_sim(query_key, cached_key)
if similarity > threshold:
return entry # Cache hit!
3. Echo Transform
Adjust the cached pattern for the new query:
# First-order Taylor adjustment
delta_q = new_query - cached_query
pattern_adjusted = cached_pattern + alpha * delta_q @ jacobian
pattern_final = softmax(pattern_adjusted)
๐ Performance
| Scenario | Cache Hit Rate | Speedup |
|---|---|---|
| Chatbots (same system prompt) | 90-95% | 8-10x |
| RAG (same context) | 70-85% | 3-5x |
| Code assistants | 60-80% | 2-3x |
๐ง Configuration
from attention_echo import EchoConfig
config = EchoConfig(
# Cache settings
capacity=1000, # Max cached patterns
similarity_threshold=0.85, # Min similarity for hit
# Pattern adjustment
adjustment_strength=0.1, # How much to adjust patterns
enable_jacobian=True, # Use first-order adjustment
# Semantic hashing
hash_dim=128, # Dimension of semantic keys
)
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AttentionEcho Pipeline โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Input โ Embeddings โ Semantic Hash โ Cache Lookup โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ โ
โ โ โ โ
โ [HIT] [MISS] โ
โ โ โ โ
โ Echo Transform Full Attention โ
โ (adjust cached pattern) (Q @ K.T) โ
โ โ โ โ
โ โ Store in cache โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ โ
โ โ โ
โ pattern @ V โ
โ โ โ
โ Output โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐งช Running Tests
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# With coverage
pytest tests/ --cov=attention_echo --cov-report=html
๐ Examples
See the examples/ directory:
basic_usage.py- Simple NumPy exampletorch_integration.py- PyTorch model integrationbenchmark.py- Performance benchmarkingmulti_user_serving.py- Simulated serving scenario
๐ค Contributing
Contributions are welcome! Please read our contributing guidelines first.
๐ License
MIT License - see LICENSE for details.
๐ Related Work
- Prefix Caching - Caches KV tensors (we cache patterns)
- EchoAtt - Shares attention across layers (we share across requests)
- AttMEMO - Memoization within sequences (we do cross-request)
Created by Dev-Forge ๐ฌ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file attention_echo-0.1.0.tar.gz.
File metadata
- Download URL: attention_echo-0.1.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceb590f977e1a1c59e3d3a179bc054ad4cc7935ba13e59c5f62f8c95dc4b6ce0
|
|
| MD5 |
1ca711fe729ca9af2ced08be0cf48a46
|
|
| BLAKE2b-256 |
a4cd5b752cd096f5b89a61652dd619713f08806e563e8884af42329b8b17da3e
|
File details
Details for the file attention_echo-0.1.0-py3-none-any.whl.
File metadata
- Download URL: attention_echo-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed9007c3206eaeaa4e9094af96c379f567190e0ad7f1d8ad0526af03e38489c
|
|
| MD5 |
667e8eb4071b63896ab2a540ea9d3f1e
|
|
| BLAKE2b-256 |
706dd32f584184550492edf316737f057b43a2a9ba7b7067107d830464a51bd5
|