3-level LLM cache (exact, semantic, prefix) with real-time token cost tracking

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kr0abhishek

These details have not been verified by PyPI

Project links

Paper

Project description

MemoLLM

Cut your LLM API bill by 40–70% with one line of code.

MemoLLM wraps your existing OpenAI or Anthropic client with a 3-level cache. One line to add, zero infrastructure required, works with every model.

# Before
from openai import OpenAI
client = OpenAI()

# After — literally one line change
from memollm import wrap
client = wrap(OpenAI())

# Your code stays exactly the same
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document..."}]
)
print(response.choices[0].message.content)  # works identically

Why MemoLLM?

Most LLM apps repeat the same or very similar prompts constantly — same questions, same system prompts, same RAG context. You pay for every token every time.

MemoLLM catches three types of repetition:

Level	What it catches	Saving	API call?
L1 Exact	Identical prompts (same characters)	100%	No
L2 Semantic	Same meaning, different wording	100%	No
L3 Prefix	Same system prompt / context prefix	25–100%	Yes, but cheaper

vs GPTCache: GPTCache only does L1+L2. MemoLLM adds L3 — automatic cache_control injection for Anthropic and prefix cache detection for OpenAI — and adds acronym expansion before embedding so "RAG" and "retrieval augmented generation" hit the same cache entry.

Installation

pip install memollm

No database, no Docker, no configuration. Uses SQLite by default — your cache lives at ~/.memollm/cache.db.

pip install memollm[redis]     # Redis backend (multi-process)
pip install memollm[postgres]  # PostgreSQL backend (enterprise)
pip install memollm[all]       # Everything

Usage

OpenAI / GPT-4

from openai import OpenAI
from memollm import wrap

client = wrap(OpenAI())  # OPENAI_API_KEY from environment

# First call — hits the API
response = client.chat.completions.create(
    model="gpt-4o",          # works with gpt-4o, gpt-4, gpt-4-turbo, o1, o3, etc.
    messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)
print(response.choices[0].message.content)

# Second call — same question, instant cache hit, $0 cost
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)

# Third call — different wording, same meaning → L2 semantic hit
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain how RAG works in AI"}]
    # "RAG" is expanded to "retrieval augmented generation" before embedding
)

Anthropic / Claude

MemoLLM automatically injects Anthropic's cache_control on long system prompts, so the second call pays ~0% for the system prompt tokens.

import anthropic
from memollm import wrap

client = wrap(anthropic.Anthropic())  # ANTHROPIC_API_KEY from environment

SYSTEM_PROMPT = """
You are an expert data analyst with deep knowledge of SQL, Python, and statistics.
You help users understand complex datasets and write production-quality analysis code.
Always explain your reasoning step by step before providing code.
[... your long system prompt ...]
"""

# MemoLLM automatically adds cache_control to the system prompt.
# First call: Anthropic caches the system prompt KV states (charged at 25%).
# All subsequent calls: system prompt tokens are FREE.
response = client.messages.create(
    model="claude-opus-4-7",        # works with all Claude models
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Analyse this CSV and find anomalies"}
    ],
    system=SYSTEM_PROMPT,
)
print(response.content[0].text)

# L2 semantic cache — different question, same meaning
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Find outliers in my dataset"}
    ],
    system=SYSTEM_PROMPT,
)

Local Ollama (no API key needed)

from openai import OpenAI
from memollm import wrap

# Ollama uses OpenAI-compatible API
client = wrap(OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # any string works
))

response = client.chat.completions.create(
    model="llama3.1",    # or llama3.2, mistral, codellama, qwen2.5-coder, etc.
    messages=[{"role": "user", "content": "What is RAG?"}]
)
print(response.choices[0].message.content)

Configuration

from memollm import wrap, CacheConfig

config = CacheConfig(
    # L2 semantic cache settings
    l2_threshold=0.65,            # 0.0–1.0: lower = more hits, less accuracy
                                  # 0.65 recommended for RAG/QA workloads
                                  # 0.90+ for strict accuracy requirements

    # Storage backend
    backend="sqlite",             # "memory" | "sqlite" | "redis" | "postgres"
    backend_url=None,             # required for redis/postgres
    db_path="~/.memollm/cache.db",

    # Cache expiry
    default_ttl_seconds=None,     # None = cache forever, 3600 = expire after 1 hour

    # Embedding model for L2
    l2_embedder="local",          # "local": free, no API key, ~15ms latency
                                  # "openai": better quality, costs $0.02/1M tokens

    # L3 prefix caching
    l3_enabled=True,              # auto-inject cache_control for Anthropic
    l3_min_prefix_tokens=500,     # minimum system prompt length to activate L3
)

client = wrap(OpenAI(), config=config)

How it works

Your LLM call
      │
      ▼
┌──────────────────────────────────────────┐
│  L1: Exact Cache                         │
│  Normalize prompt → SHA-256 hash         │  → instant hit (<1ms), no API call
│  "What is RAG?" == "What is  RAG ?"      │    whitespace/order differences ignored
└──────────────────────────────────────────┘
      │ miss
      ▼
┌──────────────────────────────────────────┐
│  L2: Semantic Cache                      │
│  Expand acronyms → embed → cosine sim    │  → ~15ms hit, no API call
│  "RAG" → "retrieval augmented generation"│    <0.3% quality loss at threshold 0.65
│  cosine similarity > threshold → hit     │
└──────────────────────────────────────────┘
      │ miss
      ▼
┌──────────────────────────────────────────┐
│  L3: Provider Prefix Cache               │
│  Anthropic: inject cache_control blocks  │  → API call made, but system prompt
│  OpenAI: detect prefix cache eligibility │    tokens cost 0% on cache read
└──────────────────────────────────────────┘
      │
      ▼
  LLM API call → store result → log savings

Cost report

$ llmcache report --last 7d

╔══════════════════════════════════════════════════╗
║         MemoLLM Report (last 7 days)        ║
╠══════════════════════════════════════════════════╣
║  Total requests         1,247                    ║
║  ├─ L1 exact hits        312  (25.0%)  <1ms      ║
║  ├─ L2 semantic hits     489  (39.2%)  ~15ms     ║
║  ├─ L3 prefix hints      201  (16.1%)  cheaper $ ║
║  └─ Full misses          245  (19.6%)             ║
╠══════════════════════════════════════════════════╣
║  Tokens saved          842,301                   ║
║  Cost saved              $16.84                  ║
╚══════════════════════════════════════════════════╝

Check savings in code:

summary = client.tracker.summary()
print(f"Hit rate:     {summary.hit_rate * 100:.1f}%")
print(f"Tokens saved: {summary.tokens_saved:,}")
print(f"Cost saved:   ${summary.cost_saved_usd:.4f}")

CLI reference

llmcache report              # full report (all time)
llmcache report --last 24h  # last 24 hours
llmcache report --last 7d   # last 7 days
llmcache report --format json > report.json

llmcache stats               # one-line summary
llmcache patterns            # top wasteful patterns + fix suggestions
llmcache clear               # wipe cache
llmcache config              # show current settings

Backends

Backend	Best for	Setup required
`memory`	Testing, Jupyter notebooks	None — lost on restart
`sqlite`	Single-process apps (default)	None — file at `~/.memollm/cache.db`
`redis`	Multiple workers, shared cache	`pip install memollm[redis]`
`postgres`	Production, analytics, auditing	`pip install memollm[postgres]`

FAQ

Does it work with streaming (stream=True)? Yes — streaming calls pass through directly without caching. No crash, no configuration needed.

Does it change the response object? No. Cache hits return a proper ChatCompletion (OpenAI) or Message (Anthropic) object — identical to what the API returns. Your existing code needs zero changes.

Does caching affect response quality? L1 is lossless — same prompt, same response. L2 returns a cached response to a semantically equivalent question. At the default threshold of 0.65, BERTScore degradation is <0.3%. Raise the threshold toward 0.90+ if you need stricter accuracy.

Is it thread-safe? SQLite uses WAL mode and handles concurrent reads safely. For high-concurrency multi-process deployments, use Redis or Postgres.

Does it work with LangChain or LlamaIndex? Yes — wrap the underlying client before passing it in:

from langchain_openai import ChatOpenAI
from memollm import wrap
from openai import OpenAI

cached_openai = wrap(OpenAI())
llm = ChatOpenAI(client=cached_openai.chat.completions)

How is the cache keyed? L1 uses SHA-256 of the normalized prompt (whitespace stripped, keys sorted). L2 uses a 768-dim embedding vector stored alongside the response. Each entry is keyed by its SHA-256 hash in the backend.

What happens if the cache backend is down? Cache failures are silent — the request falls through to the real API. Your app never crashes due to a cache error.

Benchmarks

Full benchmark results are being prepared for the accompanying paper. Preliminary results on HotpotQA and ShareGPT datasets:

Method	Token reduction	Cost reduction	Quality (BERTScore)
No cache (baseline)	0%	0%	1.000
LangChain CacheBackedEmbeddings	18%	18%	1.000
GPTCache	31%	31%	0.994
MemoLLM (L1 only)	25%	25%	1.000
MemoLLM (L1+L2)	58%	58%	0.997

Architecture

memollm/
├── __init__.py          # public API: wrap(), CacheConfig
├── interceptor.py       # transparent proxy — wraps OpenAI/Anthropic clients
├── cache.py             # L1 → L2 cache hierarchy
├── normalizer.py        # prompt normalization, SHA-256 hashing, acronym expansion
├── embedder.py          # sentence-transformers (local) or OpenAI embeddings
├── differ.py            # L3 prefix detection, Anthropic cache_control injection
├── config.py            # CacheConfig dataclass with all settings
├── backends/
│   ├── memory.py        # in-memory dict (testing/notebooks)
│   ├── sqlite.py        # SQLite WAL (default)
│   ├── redis.py         # Redis (multi-process)
│   └── postgres.py      # PostgreSQL (enterprise)
├── cost/
│   ├── tracker.py       # per-call token and cost accounting
│   ├── reporter.py      # rich terminal report
│   └── providers.json   # pricing table: OpenAI, Anthropic, Ollama models
└── cli/
    └── main.py          # llmcache CLI

Citation

If you use MemoLLM in your research, please cite:

@article{kumar2026memollm,
  title={MemoLLM: A Three-Level Hierarchical Cache for Cost-Efficient LLM Applications},
  author={Kumar, Abhishek},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Paper in preparation. arXiv link coming soon.

Contributing

PRs welcome. Please open an issue first for significant changes.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kr0abhishek

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

This version

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

memollm-0.1.0.tar.gz (32.4 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

memollm-0.1.0-py3-none-any.whl (32.7 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file memollm-0.1.0.tar.gz.

File metadata

Download URL: memollm-0.1.0.tar.gz
Upload date: May 27, 2026
Size: 32.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for memollm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d648a9b5e412042c2019e215f3364b7f500757d41370ff9b78c58d542e270086`
MD5	`b4ae13cb80c6ceb66d683ea49e88b99f`
BLAKE2b-256	`cdceb4dfda857b08af041e28b49cb49209e1a340a08d0b4088f01be59cf8bd4a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for memollm-0.1.0.tar.gz:

Publisher: publish.yml on abhi-singh-123/memollm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: memollm-0.1.0.tar.gz
- Subject digest: d648a9b5e412042c2019e215f3364b7f500757d41370ff9b78c58d542e270086
- Sigstore transparency entry: 1637009550
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: abhi-singh-123/memollm@70211b4861000ef6345e1edc41d4fb22b3151714
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/abhi-singh-123
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@70211b4861000ef6345e1edc41d4fb22b3151714
- Trigger Event: push

File details

Details for the file memollm-0.1.0-py3-none-any.whl.

File metadata

Download URL: memollm-0.1.0-py3-none-any.whl
Upload date: May 27, 2026
Size: 32.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for memollm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`011b4dd352b8c1edb9e03f47b9bca0fa0eb42ca865836826655978fece7c8265`
MD5	`a6f3db4cf341d64d3267b88b02025729`
BLAKE2b-256	`6ffa4fc02222c6ec7ad133607e584395f3028b5490fb3b771d796c88b0ce66e2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for memollm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on abhi-singh-123/memollm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: memollm-0.1.0-py3-none-any.whl
- Subject digest: 011b4dd352b8c1edb9e03f47b9bca0fa0eb42ca865836826655978fece7c8265
- Sigstore transparency entry: 1637009636
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: abhi-singh-123/memollm@70211b4861000ef6345e1edc41d4fb22b3151714
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/abhi-singh-123
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@70211b4861000ef6345e1edc41d4fb22b3151714
- Trigger Event: push

memollm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MemoLLM

Why MemoLLM?

Installation

Usage

OpenAI / GPT-4

Anthropic / Claude

Local Ollama (no API key needed)

Configuration

How it works

Cost report

CLI reference

Backends

FAQ

Benchmarks

Architecture

Citation

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance