Skip to main content

Semantic caching for LLMs

Project description

Reminiscence

License: AGPL v3 Python 3.9+ Tests

Semantic cache for LLMs and multi-agent systems

Reminiscence eliminates redundant computations by matching queries semantically instead of exact strings. Perfect for LLM applications, RAG pipelines, and agent workflows.

# These queries hit the same cache entry:
"Analyze Q3 sales data"
"Show me third quarter sales analysis"
"What were Q3 revenues?"

Why semantic caching?

Traditional caches fail for AI systems because users express the same intent differently. Reminiscence uses FastEmbed with multilingual sentence transformers to recognize equivalent queries, reducing API costs and latency.

Quick Start

pip install reminiscence
from reminiscence import Reminiscence

cache = Reminiscence()

# Check cache before expensive operation
result = cache.lookup(
    query="Analyze Q3 2024 sales",
    context={"agent": "analyst", "db": "prod"}
)

if result.is_hit:
    print(f"Cache hit! Similarity: {result.similarity:.2f}")
    data = result.result
else:
    # Execute and cache
    data = expensive_operation()
    cache.store(query, context, data)

Decorator API

Automatic caching with hybrid matching (semantic + exact params):

@cache.cached(query_param="prompt", strict_params=["model"])
def call_llm(prompt: str, model: str):
    return expensive_llm_call(prompt, model)

# Similar prompts with same model hit cache
call_llm("Explain quantum physics", "gpt-4")      # Executes
call_llm("Can you explain quantum mechanics?", "gpt-4")  # Cache hit ✓

# Different model = cache miss
call_llm("Explain quantum physics", "claude-3")   # Executes (different context)

Auto-strict mode

Non-string parameters are automatically treated as strict:

@cache.cached(query_param="prompt", auto_strict=True)
def ask_llm(prompt: str, temperature: float, max_tokens: int):
    # temperature and max_tokens auto-detected as strict
    return llm_call(prompt, temperature, max_tokens)

Key Features

  • 🎯 Semantic matching - FastEmbed + cosine similarity (multilingual support)
  • 🔀 Hybrid caching - Semantic similarity + exact context matching
  • 🏗️ Production ready - LRU/LFU/FIFO eviction, TTL, health checks
  • 📊 OpenTelemetry native - Metrics, tracing, and spans out of the box
  • 🔒 Type safe - Handles DataFrames, numpy arrays, nested dicts (10MB+)
  • Zero config - Works instantly, scales to 100K+ entries with auto-indexing
  • 🔄 Background tasks - Automatic cleanup scheduler and metrics export

Configuration

from reminiscence import Reminiscence, ReminiscenceConfig

# Development (in-memory, defaults)
cache = Reminiscence()

# Production (persistent, optimized)
config = ReminiscenceConfig(
    db_uri="./cache.lance",
    ttl_seconds=3600,
    eviction_policy="lru",
    max_entries=50_000,
    auto_create_index=True
)
cache = Reminiscence(config)

# With OpenTelemetry
config = ReminiscenceConfig(
    otel_enabled=True,
    otel_service_name="my-service",
    otel_endpoint="http://localhost:4317"
)
cache = Reminiscence(config)

# Docker/Kubernetes (environment variables)
cache = Reminiscence(ReminiscenceConfig.load())

See Configuration Guide for all environment variables.

Background Tasks

Automatic cleanup and metrics export:

cache = Reminiscence(ReminiscenceConfig(
    ttl_seconds=3600,
    otel_enabled=True
))

# Start background tasks
cache.start_scheduler(
    interval_seconds=1800,              # Cleanup every 30 min
    metrics_export_interval_seconds=60  # Export metrics every minute
)

# ... use cache ...

# Stop when done (or use context manager)
cache.stop_scheduler()

Context Manager

with Reminiscence() as cache:
    cache.start_scheduler()
    # ... use cache ...
    # Automatically stops scheduler on exit

Use Cases

  • LLM applications - Cache similar prompts to reduce API costs (OpenAI, Anthropic, etc.)
  • Multi-agent systems - Share cache across agents with context isolation
  • RAG pipelines - Cache retrieved documents, embeddings, and search results
  • Data analysis - Cache expensive SQL queries, pandas transformations

Observability

Built-in OpenTelemetry support for production monitoring:

# Automatic metrics collection
config = ReminiscenceConfig(
    enable_metrics=True,
    otel_enabled=True
)
cache = Reminiscence(config)

# Get current stats
stats = cache.get_stats()
print(f"Cache entries: {stats['cache_entries']}")
print(f"Hit rate: {stats['hit_rate']}")
print(f"Schedulers: {stats.get('schedulers', {})}")

Available metrics:

  • Cache hits/misses and hit rate
  • Lookup and store latency
  • Total entries and evictions
  • Error counts by operation
  • Scheduler execution stats

Compatible with Prometheus, Grafana, Datadog, New Relic, and any OTLP-compatible backend.

Health Checks

Production-ready health monitoring:

health = cache.health_check()

# Returns comprehensive status
{
    "status": "healthy",  # or "unhealthy"
    "checks": {
        "embedding": {"ok": true, "error": null},
        "database": {"ok": true, "error": null},
        "error_rate": {"ok": true, "details": "..."},
        "schedulers": {"ok": true, "details": "2/2 schedulers running"},
        "opentelemetry": {"ok": true, "details": "Enabled (...)"}
    },
    "metrics": {...},
    "timestamp": 1696512000000
}

Requirements

  • Python 3.9+
  • Core: lancedb, fastembed, orjson, pyarrow, structlog
  • Optional: pandas, polars, numpy (for DataFrame/array caching)

Performance

Typical latencies on consumer hardware (M1/M2, AMD Ryzen):

  • Lookup: 5-15ms (with index), 10-50ms (without)
  • Store: 5-10ms
  • Embedding: 20-50ms (cached in-memory after first use)

Scales to 100K+ entries with automatic vector indexing (IVF-PQ).

License

AGPL v3 - See LICENSE


Built with

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reminiscence-0.2.0.tar.gz (70.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reminiscence-0.2.0-py3-none-any.whl (56.7 kB view details)

Uploaded Python 3

File details

Details for the file reminiscence-0.2.0.tar.gz.

File metadata

  • Download URL: reminiscence-0.2.0.tar.gz
  • Upload date:
  • Size: 70.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reminiscence-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3048583083c9b8d8bf89401c7e9a6de4afd937cb5227662c05dc327137ff2ed8
MD5 f7bf10f052124f4fa798c550b1519bd7
BLAKE2b-256 6b5836f6ac5635884429beb33f67dfdebd85c47fa546102545e6a322a2b64729

See more details on using hashes here.

Provenance

The following attestation bundles were made for reminiscence-0.2.0.tar.gz:

Publisher: publish.yml on demiotic/reminiscence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file reminiscence-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: reminiscence-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 56.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for reminiscence-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dff1ccd0b55a6ab1c06078d460b7359dfee6d20464a1fb32555cf38c603781de
MD5 af948912d0ed3b5a2b14faf17065b265
BLAKE2b-256 773db390e0436e4f3013bfd3eaa2de24e5e88070429bf0e1f22f37d165114a01

See more details on using hashes here.

Provenance

The following attestation bundles were made for reminiscence-0.2.0-py3-none-any.whl:

Publisher: publish.yml on demiotic/reminiscence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page