Skip to main content

AttentionEcho: Cross-request attention pattern reuse for LLM inference optimization

Project description

๐Ÿ”Š AttentionEcho

Cross-request attention pattern reuse for LLM inference optimization

Python 3.9+ License: MIT


๐ŸŽฏ What is AttentionEcho?

AttentionEcho is a novel inference optimization technique that reuses attention patterns across semantically similar requests. Unlike traditional prefix caching (which caches KV tensors), AttentionEcho caches the actual attention weights and adjusts them for new queries.

Key Innovation

Standard Inference:
  Request 1: "You are helpful. What is Python?"  โ†’ Compute Q @ K.T (expensive)
  Request 2: "You are helpful. What is Java?"    โ†’ Compute Q @ K.T (expensive)
  
With AttentionEcho:
  Request 1: "You are helpful. What is Python?"  โ†’ Compute Q @ K.T โ†’ Cache pattern
  Request 2: "You are helpful. What is Java?"    โ†’ Reuse pattern (fast!) โœ“

โœจ Features

  • Semantic Matching: Uses embedding similarity (not exact token match)
  • Pattern Adjustment: First-order Taylor expansion for query differences
  • Cross-Request Sharing: One user's cached pattern helps another
  • Framework Agnostic: Works with PyTorch, NumPy, or any tensor library
  • Production Ready: Thread-safe, LRU eviction, comprehensive stats

๐Ÿ“ฆ Installation

pip install attention-echo

# With PyTorch support
pip install attention-echo[torch]

# For development
pip install attention-echo[dev]

๐Ÿš€ Quick Start

Basic Usage (NumPy)

from attention_echo import AttentionEchoCache, EchoConfig

# Create cache
config = EchoConfig(
    capacity=1000,
    similarity_threshold=0.85
)
cache = AttentionEchoCache(config)

# First request - computes and caches
output1, meta1 = cache.attention_with_echo(
    query=q1, key=k1, value=v1,
    prefix_length=10,
    prefix_embeddings=embeddings1
)
print(meta1)  # {'echo_hit': False, 'tokens_computed': 15}

# Second request with similar prefix - reuses pattern!
output2, meta2 = cache.attention_with_echo(
    query=q2, key=k2, value=v2,
    prefix_length=10,
    prefix_embeddings=embeddings2  # Similar to embeddings1
)
print(meta2)  # {'echo_hit': True, 'similarity': 0.95, 'tokens_echoed': 10}

PyTorch Integration

import torch
from attention_echo.torch import EchoAttention

# Wrap your attention layer
attention = EchoAttention(
    hidden_dim=768,
    num_heads=12,
    cache_capacity=1000
)

# Use like normal attention
output = attention(
    query=q,
    key=k,
    value=v,
    prefix_length=prefix_len
)

# Check stats
print(attention.cache.stats)
# {'hits': 150, 'misses': 20, 'hit_rate': 0.88}

๐Ÿ“Š How It Works

1. Semantic Hashing

When a request arrives, we compute a semantic hash of the prefix:

semantic_key = normalize(mean_pool(prefix_embeddings))

2. Cache Lookup

Search for similar cached patterns using cosine similarity:

for cached_key, entry in cache:
    similarity = cosine_sim(query_key, cached_key)
    if similarity > threshold:
        return entry  # Cache hit!

3. Echo Transform

Adjust the cached pattern for the new query:

# First-order Taylor adjustment
delta_q = new_query - cached_query
pattern_adjusted = cached_pattern + alpha * delta_q @ jacobian
pattern_final = softmax(pattern_adjusted)

๐Ÿ“ˆ Performance

Scenario Cache Hit Rate Speedup
Chatbots (same system prompt) 90-95% 8-10x
RAG (same context) 70-85% 3-5x
Code assistants 60-80% 2-3x

๐Ÿ”ง Configuration

from attention_echo import EchoConfig

config = EchoConfig(
    # Cache settings
    capacity=1000,              # Max cached patterns
    similarity_threshold=0.85,  # Min similarity for hit
    
    # Pattern adjustment
    adjustment_strength=0.1,    # How much to adjust patterns
    enable_jacobian=True,       # Use first-order adjustment
    
    # Semantic hashing
    hash_dim=128,               # Dimension of semantic keys
)

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AttentionEcho Pipeline                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Input โ†’ Embeddings โ†’ Semantic Hash โ†’ Cache Lookup               โ”‚
โ”‚                                           โ”‚                      โ”‚
โ”‚              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚              โ”‚                                          โ”‚        โ”‚
โ”‚           [HIT]                                     [MISS]       โ”‚
โ”‚              โ”‚                                          โ”‚        โ”‚
โ”‚    Echo Transform                              Full Attention    โ”‚
โ”‚    (adjust cached pattern)                     (Q @ K.T)         โ”‚
โ”‚              โ”‚                                          โ”‚        โ”‚
โ”‚              โ”‚                                    Store in cache โ”‚
โ”‚              โ”‚                                          โ”‚        โ”‚
โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚                                           โ”‚                      โ”‚
โ”‚                                    pattern @ V                   โ”‚
โ”‚                                           โ”‚                      โ”‚
โ”‚                                       Output                     โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿงช Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=attention_echo --cov-report=html

๐Ÿ“š Examples

See the examples/ directory:

  • basic_usage.py - Simple NumPy example
  • torch_integration.py - PyTorch model integration
  • benchmark.py - Performance benchmarking
  • multi_user_serving.py - Simulated serving scenario

๐Ÿค Contributing

Contributions are welcome! Please read our contributing guidelines first.

๐Ÿ“„ License

MIT License - see LICENSE for details.

๐Ÿ”— Related Work

  • Prefix Caching - Caches KV tensors (we cache patterns)
  • EchoAtt - Shares attention across layers (we share across requests)
  • AttMEMO - Memoization within sequences (we do cross-request)

Created by Dev-Forge ๐Ÿ”ฌ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

attention_echo-0.1.0.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

attention_echo-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file attention_echo-0.1.0.tar.gz.

File metadata

  • Download URL: attention_echo-0.1.0.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for attention_echo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ceb590f977e1a1c59e3d3a179bc054ad4cc7935ba13e59c5f62f8c95dc4b6ce0
MD5 1ca711fe729ca9af2ced08be0cf48a46
BLAKE2b-256 a4cd5b752cd096f5b89a61652dd619713f08806e563e8884af42329b8b17da3e

See more details on using hashes here.

File details

Details for the file attention_echo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: attention_echo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for attention_echo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fed9007c3206eaeaa4e9094af96c379f567190e0ad7f1d8ad0526af03e38489c
MD5 667e8eb4071b63896ab2a540ea9d3f1e
BLAKE2b-256 706dd32f584184550492edf316737f057b43a2a9ba7b7067107d830464a51bd5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page