Skip to main content

3-level LLM cache (exact, semantic, prefix) with real-time token cost tracking

Project description

MemoLLM

Cut your LLM API bill by 40–70% with one line of code.

PyPI Downloads Tests License: MIT arXiv

MemoLLM wraps your existing OpenAI or Anthropic client with a 3-level cache. One line to add, zero infrastructure required, works with every model.

# Before
from openai import OpenAI
client = OpenAI()

# After — literally one line change
from memollm import wrap
client = wrap(OpenAI())

# Your code stays exactly the same
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document..."}]
)
print(response.choices[0].message.content)  # works identically

Why MemoLLM?

Most LLM apps repeat the same or very similar prompts constantly — same questions, same system prompts, same RAG context. You pay for every token every time.

MemoLLM catches three types of repetition:

Level What it catches Saving API call?
L1 Exact Identical prompts (same characters) 100% No
L2 Semantic Same meaning, different wording 100% No
L3 Prefix Same system prompt / context prefix 25–100% Yes, but cheaper

vs GPTCache: GPTCache only does L1+L2. MemoLLM adds L3 — automatic cache_control injection for Anthropic and prefix cache detection for OpenAI — and adds acronym expansion before embedding so "RAG" and "retrieval augmented generation" hit the same cache entry.


Installation

pip install memollm

No database, no Docker, no configuration. Uses SQLite by default — your cache lives at ~/.memollm/cache.db.

pip install memollm[redis]     # Redis backend (multi-process)
pip install memollm[postgres]  # PostgreSQL backend (enterprise)
pip install memollm[all]       # Everything

Usage

OpenAI / GPT-4

from openai import OpenAI
from memollm import wrap

client = wrap(OpenAI())  # OPENAI_API_KEY from environment

# First call — hits the API
response = client.chat.completions.create(
    model="gpt-4o",          # works with gpt-4o, gpt-4, gpt-4-turbo, o1, o3, etc.
    messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)
print(response.choices[0].message.content)

# Second call — same question, instant cache hit, $0 cost
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)

# Third call — different wording, same meaning → L2 semantic hit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain how RAG works in AI"}]
    # "RAG" is expanded to "retrieval augmented generation" before embedding
)

Anthropic / Claude

MemoLLM automatically injects Anthropic's cache_control on long system prompts, so the second call pays ~0% for the system prompt tokens.

import anthropic
from memollm import wrap

client = wrap(anthropic.Anthropic())  # ANTHROPIC_API_KEY from environment

SYSTEM_PROMPT = """
You are an expert data analyst with deep knowledge of SQL, Python, and statistics.
You help users understand complex datasets and write production-quality analysis code.
Always explain your reasoning step by step before providing code.
[... your long system prompt ...]
"""

# MemoLLM automatically adds cache_control to the system prompt.
# First call: Anthropic caches the system prompt KV states (charged at 25%).
# All subsequent calls: system prompt tokens are FREE.
response = client.messages.create(
    model="claude-opus-4-7",        # works with all Claude models
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Analyse this CSV and find anomalies"}
    ],
    system=SYSTEM_PROMPT,
)
print(response.content[0].text)

# L2 semantic cache — different question, same meaning
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Find outliers in my dataset"}
    ],
    system=SYSTEM_PROMPT,
)

Local Ollama (no API key needed)

from openai import OpenAI
from memollm import wrap

# Ollama uses OpenAI-compatible API
client = wrap(OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
))

response = client.chat.completions.create(
    model="llama3.1",    # or llama3.2, mistral, codellama, qwen2.5-coder, etc.
    messages=[{"role": "user", "content": "What is RAG?"}]
)
print(response.choices[0].message.content)

Configuration

from memollm import wrap, CacheConfig

config = CacheConfig(
    # L2 semantic cache settings
    l2_threshold=0.65,            # 0.0–1.0: lower = more hits, less accuracy
                                  # 0.65 recommended for RAG/QA workloads
                                  # 0.90+ for strict accuracy requirements

    # Storage backend
    backend="sqlite",             # "memory" | "sqlite" | "redis" | "postgres"
    backend_url=None,             # required for redis/postgres
    db_path="~/.memollm/cache.db",

    # Cache expiry
    default_ttl_seconds=None,     # None = cache forever, 3600 = expire after 1 hour

    # Embedding model for L2
    l2_embedder="local",          # "local": free, no API key, ~15ms latency
                                  # "openai": better quality, costs $0.02/1M tokens

    # L3 prefix caching
    l3_enabled=True,              # auto-inject cache_control for Anthropic
    l3_min_prefix_tokens=500,     # minimum system prompt length to activate L3
)

client = wrap(OpenAI(), config=config)

How it works

Your LLM call
      │
      ▼
┌──────────────────────────────────────────┐
│  L1: Exact Cache                         │
│  Normalize prompt → SHA-256 hash         │  → instant hit (<1ms), no API call
│  "What is RAG?" == "What is  RAG ?"      │    whitespace/order differences ignored
└──────────────────────────────────────────┘
      │ miss
      ▼
┌──────────────────────────────────────────┐
│  L2: Semantic Cache                      │
│  Expand acronyms → embed → cosine sim    │  → ~15ms hit, no API call
│  "RAG" → "retrieval augmented generation"│    <0.3% quality loss at threshold 0.65
│  cosine similarity > threshold → hit     │
└──────────────────────────────────────────┘
      │ miss
      ▼
┌──────────────────────────────────────────┐
│  L3: Provider Prefix Cache               │
│  Anthropic: inject cache_control blocks  │  → API call made, but system prompt
│  OpenAI: detect prefix cache eligibility │    tokens cost 0% on cache read
└──────────────────────────────────────────┘
      │
      ▼
  LLM API call → store result → log savings

Cost report

$ llmcache report --last 7d

╔══════════════════════════════════════════════════╗
║         MemoLLM Report (last 7 days)        ║
╠══════════════════════════════════════════════════╣
║  Total requests         1,247                    ║
║  ├─ L1 exact hits        312  (25.0%)  <1ms      ║
║  ├─ L2 semantic hits     489  (39.2%)  ~15ms     ║
║  ├─ L3 prefix hints      201  (16.1%)  cheaper $ ║
║  └─ Full misses          245  (19.6%)             ║
╠══════════════════════════════════════════════════╣
║  Tokens saved          842,301                   ║
║  Cost saved              $16.84                  ║
╚══════════════════════════════════════════════════╝

Check savings in code:

summary = client.tracker.summary()
print(f"Hit rate:     {summary.hit_rate * 100:.1f}%")
print(f"Tokens saved: {summary.tokens_saved:,}")
print(f"Cost saved:   ${summary.cost_saved_usd:.4f}")

CLI reference

llmcache report              # full report (all time)
llmcache report --last 24h  # last 24 hours
llmcache report --last 7d   # last 7 days
llmcache report --format json > report.json

llmcache stats               # one-line summary
llmcache patterns            # top wasteful patterns + fix suggestions
llmcache clear               # wipe cache
llmcache config              # show current settings

Backends

Backend Best for Setup required
memory Testing, Jupyter notebooks None — lost on restart
sqlite Single-process apps (default) None — file at ~/.memollm/cache.db
redis Multiple workers, shared cache pip install memollm[redis]
postgres Production, analytics, auditing pip install memollm[postgres]

FAQ

Does it work with streaming (stream=True)? Yes — streaming calls pass through directly without caching. No crash, no configuration needed.

Does it change the response object? No. Cache hits return a proper ChatCompletion (OpenAI) or Message (Anthropic) object — identical to what the API returns. Your existing code needs zero changes.

Does caching affect response quality? L1 is lossless — same prompt, same response. L2 returns a cached response to a semantically equivalent question. At the default threshold of 0.65, BERTScore degradation is <0.3%. Raise the threshold toward 0.90+ if you need stricter accuracy.

Is it thread-safe? SQLite uses WAL mode and handles concurrent reads safely. For high-concurrency multi-process deployments, use Redis or Postgres.

Does it work with LangChain or LlamaIndex? Yes — wrap the underlying client before passing it in:

from langchain_openai import ChatOpenAI
from memollm import wrap
from openai import OpenAI

cached_openai = wrap(OpenAI())
llm = ChatOpenAI(client=cached_openai.chat.completions)

How is the cache keyed? L1 uses SHA-256 of the normalized prompt (whitespace stripped, keys sorted). L2 uses a 768-dim embedding vector stored alongside the response. Each entry is keyed by its SHA-256 hash in the backend.

What happens if the cache backend is down? Cache failures are silent — the request falls through to the real API. Your app never crashes due to a cache error.


Benchmarks

Full benchmark results are being prepared for the accompanying paper. Preliminary results on HotpotQA and ShareGPT datasets:

Method Token reduction Cost reduction Quality (BERTScore)
No cache (baseline) 0% 0% 1.000
LangChain CacheBackedEmbeddings 18% 18% 1.000
GPTCache 31% 31% 0.994
MemoLLM (L1 only) 25% 25% 1.000
MemoLLM (L1+L2) 58% 58% 0.997

Architecture

memollm/
├── __init__.py          # public API: wrap(), CacheConfig
├── interceptor.py       # transparent proxy — wraps OpenAI/Anthropic clients
├── cache.py             # L1 → L2 cache hierarchy
├── normalizer.py        # prompt normalization, SHA-256 hashing, acronym expansion
├── embedder.py          # sentence-transformers (local) or OpenAI embeddings
├── differ.py            # L3 prefix detection, Anthropic cache_control injection
├── config.py            # CacheConfig dataclass with all settings
├── backends/
│   ├── memory.py        # in-memory dict (testing/notebooks)
│   ├── sqlite.py        # SQLite WAL (default)
│   ├── redis.py         # Redis (multi-process)
│   └── postgres.py      # PostgreSQL (enterprise)
├── cost/
│   ├── tracker.py       # per-call token and cost accounting
│   ├── reporter.py      # rich terminal report
│   └── providers.json   # pricing table: OpenAI, Anthropic, Ollama models
└── cli/
    └── main.py          # llmcache CLI

Citation

If you use MemoLLM in your research, please cite:

@article{kumar2026memollm,
  title={MemoLLM: A Three-Level Hierarchical Cache for Cost-Efficient LLM Applications},
  author={Kumar, Abhishek},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Paper in preparation. arXiv link coming soon.


Contributing

PRs welcome. Please open an issue first for significant changes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memollm-0.1.0.tar.gz (32.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

memollm-0.1.0-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file memollm-0.1.0.tar.gz.

File metadata

  • Download URL: memollm-0.1.0.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for memollm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d648a9b5e412042c2019e215f3364b7f500757d41370ff9b78c58d542e270086
MD5 b4ae13cb80c6ceb66d683ea49e88b99f
BLAKE2b-256 cdceb4dfda857b08af041e28b49cb49209e1a340a08d0b4088f01be59cf8bd4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for memollm-0.1.0.tar.gz:

Publisher: publish.yml on abhi-singh-123/memollm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file memollm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: memollm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for memollm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 011b4dd352b8c1edb9e03f47b9bca0fa0eb42ca865836826655978fece7c8265
MD5 a6f3db4cf341d64d3267b88b02025729
BLAKE2b-256 6ffa4fc02222c6ec7ad133607e584395f3028b5490fb3b771d796c88b0ce66e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for memollm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on abhi-singh-123/memollm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page