AttentionEcho: Cross-request attention pattern reuse for LLM inference optimization

These details have not been verified by PyPI

Project links

Project description

🔊 AttentionEcho

Cross-request attention pattern reuse for LLM inference optimization

🎯 What is AttentionEcho?

AttentionEcho is a novel inference optimization technique that reuses attention patterns across semantically similar requests. Unlike traditional prefix caching (which caches KV tensors), AttentionEcho caches the actual attention weights and adjusts them for new queries.

Key Innovation

Standard Inference:
  Request 1: "You are helpful. What is Python?"  → Compute Q @ K.T (expensive)
  Request 2: "You are helpful. What is Java?"    → Compute Q @ K.T (expensive)
  
With AttentionEcho:
  Request 1: "You are helpful. What is Python?"  → Compute Q @ K.T → Cache pattern
  Request 2: "You are helpful. What is Java?"    → Reuse pattern (fast!) ✓

✨ Features

Semantic Matching: Uses embedding similarity (not exact token match)
Pattern Adjustment: First-order Taylor expansion for query differences
Cross-Request Sharing: One user's cached pattern helps another
Framework Agnostic: Works with PyTorch, NumPy, or any tensor library
Production Ready: Thread-safe, LRU eviction, comprehensive stats

📦 Installation

pip install attention-echo

# With PyTorch support
pip install attention-echo[torch]

# For development
pip install attention-echo[dev]

🚀 Quick Start

Basic Usage (NumPy)

from attention_echo import AttentionEchoCache, EchoConfig

# Create cache
config = EchoConfig(
    capacity=1000,
    similarity_threshold=0.85
)
cache = AttentionEchoCache(config)

# First request - computes and caches
output1, meta1 = cache.attention_with_echo(
    query=q1, key=k1, value=v1,
    prefix_length=10,
    prefix_embeddings=embeddings1
)
print(meta1)  # {'echo_hit': False, 'tokens_computed': 15}

# Second request with similar prefix - reuses pattern!
output2, meta2 = cache.attention_with_echo(
    query=q2, key=k2, value=v2,
    prefix_length=10,
    prefix_embeddings=embeddings2  # Similar to embeddings1
)
print(meta2)  # {'echo_hit': True, 'similarity': 0.95, 'tokens_echoed': 10}

PyTorch Integration

import torch
from attention_echo.torch import EchoAttention

# Wrap your attention layer
attention = EchoAttention(
    hidden_dim=768,
    num_heads=12,
    cache_capacity=1000
)

# Use like normal attention
output = attention(
    query=q,
    key=k,
    value=v,
    prefix_length=prefix_len
)

# Check stats
print(attention.cache.stats)
# {'hits': 150, 'misses': 20, 'hit_rate': 0.88}

📊 How It Works

1. Semantic Hashing

When a request arrives, we compute a semantic hash of the prefix:

semantic_key = normalize(mean_pool(prefix_embeddings))

2. Cache Lookup

Search for similar cached patterns using cosine similarity:

for cached_key, entry in cache:
    similarity = cosine_sim(query_key, cached_key)
    if similarity > threshold:
        return entry  # Cache hit!

3. Echo Transform

Adjust the cached pattern for the new query:

# First-order Taylor adjustment
delta_q = new_query - cached_query
pattern_adjusted = cached_pattern + alpha * delta_q @ jacobian
pattern_final = softmax(pattern_adjusted)

📈 Performance

Scenario	Cache Hit Rate	Speedup
Chatbots (same system prompt)	90-95%	8-10x
RAG (same context)	70-85%	3-5x
Code assistants	60-80%	2-3x

🔧 Configuration

from attention_echo import EchoConfig

config = EchoConfig(
    # Cache settings
    capacity=1000,              # Max cached patterns
    similarity_threshold=0.85,  # Min similarity for hit
    
    # Pattern adjustment
    adjustment_strength=0.1,    # How much to adjust patterns
    enable_jacobian=True,       # Use first-order adjustment
    
    # Semantic hashing
    hash_dim=128,               # Dimension of semantic keys
)

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    AttentionEcho Pipeline                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input → Embeddings → Semantic Hash → Cache Lookup               │
│                                           │                      │
│              ┌────────────────────────────┴─────────────┐        │
│              │                                          │        │
│           [HIT]                                     [MISS]       │
│              │                                          │        │
│    Echo Transform                              Full Attention    │
│    (adjust cached pattern)                     (Q @ K.T)         │
│              │                                          │        │
│              │                                    Store in cache │
│              │                                          │        │
│              └────────────────────────────┬─────────────┘        │
│                                           │                      │
│                                    pattern @ V                   │
│                                           │                      │
│                                       Output                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

🧪 Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=attention_echo --cov-report=html

📚 Examples

See the examples/ directory:

basic_usage.py - Simple NumPy example
torch_integration.py - PyTorch model integration
benchmark.py - Performance benchmarking
multi_user_serving.py - Simulated serving scenario

🤝 Contributing

Contributions are welcome! Please read our contributing guidelines first.

📄 License

MIT License - see LICENSE for details.

🔗 Related Work

Prefix Caching - Caches KV tensors (we cache patterns)
EchoAtt - Shares attention across layers (we share across requests)
AttMEMO - Memoization within sequences (we do cross-request)

Created by Dev-Forge 🔬

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attention_echo-0.1.0.tar.gz (15.3 kB view details)

Uploaded Jan 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

attention_echo-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Jan 5, 2026 Python 3

File details

Details for the file attention_echo-0.1.0.tar.gz.

File metadata

Download URL: attention_echo-0.1.0.tar.gz
Upload date: Jan 5, 2026
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for attention_echo-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ceb590f977e1a1c59e3d3a179bc054ad4cc7935ba13e59c5f62f8c95dc4b6ce0`
MD5	`1ca711fe729ca9af2ced08be0cf48a46`
BLAKE2b-256	`a4cd5b752cd096f5b89a61652dd619713f08806e563e8884af42329b8b17da3e`

See more details on using hashes here.

File details

Details for the file attention_echo-0.1.0-py3-none-any.whl.

File metadata

Download URL: attention_echo-0.1.0-py3-none-any.whl
Upload date: Jan 5, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for attention_echo-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fed9007c3206eaeaa4e9094af96c379f567190e0ad7f1d8ad0526af03e38489c`
MD5	`667e8eb4071b63896ab2a540ea9d3f1e`
BLAKE2b-256	`706dd32f584184550492edf316737f057b43a2a9ba7b7067107d830464a51bd5`

See more details on using hashes here.

attention-echo 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔊 AttentionEcho

🎯 What is AttentionEcho?

Key Innovation

✨ Features

📦 Installation

🚀 Quick Start

Basic Usage (NumPy)

PyTorch Integration

📊 How It Works

1. Semantic Hashing

2. Cache Lookup

3. Echo Transform

📈 Performance

🔧 Configuration

🏗️ Architecture

🧪 Running Tests

📚 Examples

🤝 Contributing

📄 License

🔗 Related Work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes