3-level LLM cache (exact, semantic, partial) with real-time token cost tracking
Project description
RagCache
Cut your LLM API bill by 40–70% with one line of code.
RagCache wraps your existing OpenAI or Anthropic client with a 3-level cache. One line to add, zero infrastructure required, works with every model.
# Before
from openai import OpenAI
client = OpenAI()
# After — literally one line change
from ragcache import wrap
client = wrap(OpenAI())
# Your code stays exactly the same
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this document..."}]
)
print(response.choices[0].message.content) # works identically
Why RagCache?
Most LLM apps repeat the same or very similar prompts constantly — same questions, same system prompts, same RAG context. You pay for every token every time.
RagCache catches three types of repetition:
| Level | What it catches | Saving | API call? |
|---|---|---|---|
| L1 Exact | Identical prompts (same characters) | 100% | No |
| L2 Semantic | Same meaning, different wording | 100% | No |
| L3 Prefix | Same system prompt / context prefix | 25–100% | Yes, but cheaper |
vs GPTCache: GPTCache only does L1+L2. RagCache adds L3 — automatic cache_control injection for Anthropic and prefix cache detection for OpenAI — and adds acronym expansion before embedding so "RAG" and "retrieval augmented generation" hit the same cache entry.
Installation
pip install ragcache
No database, no Docker, no configuration. Uses SQLite by default — your cache lives at ~/.ragcache/cache.db.
pip install ragcache[redis] # Redis backend (multi-process)
pip install ragcache[postgres] # PostgreSQL backend (enterprise)
pip install ragcache[all] # Everything
Usage
OpenAI / GPT-4
from openai import OpenAI
from ragcache import wrap
client = wrap(OpenAI()) # OPENAI_API_KEY from environment
# First call — hits the API
response = client.chat.completions.create(
model="gpt-4o", # works with gpt-4o, gpt-4, gpt-4-turbo, o1, o3, etc.
messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)
print(response.choices[0].message.content)
# Second call — same question, instant cache hit, $0 cost
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is retrieval augmented generation?"}]
)
# Third call — different wording, same meaning → L2 semantic hit
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain how RAG works in AI"}]
# "RAG" is expanded to "retrieval augmented generation" before embedding
)
Anthropic / Claude
RagCache automatically injects Anthropic's cache_control on long system prompts, so the second call pays ~0% for the system prompt tokens.
import anthropic
from ragcache import wrap
client = wrap(anthropic.Anthropic()) # ANTHROPIC_API_KEY from environment
SYSTEM_PROMPT = """
You are an expert data analyst with deep knowledge of SQL, Python, and statistics.
You help users understand complex datasets and write production-quality analysis code.
Always explain your reasoning step by step before providing code.
[... your long system prompt ...]
"""
# RagCache automatically adds cache_control to the system prompt.
# First call: Anthropic caches the system prompt KV states (charged at 25%).
# All subsequent calls: system prompt tokens are FREE.
response = client.messages.create(
model="claude-opus-4-7", # works with all Claude models
max_tokens=1024,
messages=[
{"role": "user", "content": "Analyse this CSV and find anomalies"}
],
system=SYSTEM_PROMPT,
)
print(response.content[0].text)
# L2 semantic cache — different question, same meaning
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[
{"role": "user", "content": "Find outliers in my dataset"}
],
system=SYSTEM_PROMPT,
)
Local Ollama (no API key needed)
from openai import OpenAI
from ragcache import wrap
# Ollama uses OpenAI-compatible API
client = wrap(OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # any string works
))
response = client.chat.completions.create(
model="llama3.1", # or llama3.2, mistral, codellama, qwen2.5-coder, etc.
messages=[{"role": "user", "content": "What is RAG?"}]
)
print(response.choices[0].message.content)
Configuration
from ragcache import wrap, CacheConfig
config = CacheConfig(
# L2 semantic cache settings
l2_threshold=0.65, # 0.0–1.0: lower = more hits, less accuracy
# 0.65 recommended for RAG/QA workloads
# 0.90+ for strict accuracy requirements
# Storage backend
backend="sqlite", # "memory" | "sqlite" | "redis" | "postgres"
backend_url=None, # required for redis/postgres
db_path="~/.ragcache/cache.db",
# Cache expiry
default_ttl_seconds=None, # None = cache forever, 3600 = expire after 1 hour
# Embedding model for L2
l2_embedder="local", # "local": free, no API key, ~15ms latency
# "openai": better quality, costs $0.02/1M tokens
# L3 prefix caching
l3_enabled=True, # auto-inject cache_control for Anthropic
l3_min_prefix_tokens=500, # minimum system prompt length to activate L3
)
client = wrap(OpenAI(), config=config)
How it works
Your LLM call
│
▼
┌──────────────────────────────────────────┐
│ L1: Exact Cache │
│ Normalize prompt → SHA-256 hash │ → instant hit (<1ms), no API call
│ "What is RAG?" == "What is RAG ?" │ whitespace/order differences ignored
└──────────────────────────────────────────┘
│ miss
▼
┌──────────────────────────────────────────┐
│ L2: Semantic Cache │
│ Expand acronyms → embed → cosine sim │ → ~15ms hit, no API call
│ "RAG" → "retrieval augmented generation"│ <0.3% quality loss at threshold 0.65
│ cosine similarity > threshold → hit │
└──────────────────────────────────────────┘
│ miss
▼
┌──────────────────────────────────────────┐
│ L3: Provider Prefix Cache │
│ Anthropic: inject cache_control blocks │ → API call made, but system prompt
│ OpenAI: detect prefix cache eligibility │ tokens cost 0% on cache read
└──────────────────────────────────────────┘
│
▼
LLM API call → store result → log savings
Cost report
$ ragcache report --last 7d
╔══════════════════════════════════════════════════╗
║ RagCache Report (last 7 days) ║
╠══════════════════════════════════════════════════╣
║ Total requests 1,247 ║
║ ├─ L1 exact hits 312 (25.0%) <1ms ║
║ ├─ L2 semantic hits 489 (39.2%) ~15ms ║
║ ├─ L3 prefix hints 201 (16.1%) cheaper $ ║
║ └─ Full misses 245 (19.6%) ║
╠══════════════════════════════════════════════════╣
║ Tokens saved 842,301 ║
║ Cost saved $16.84 ║
╚══════════════════════════════════════════════════╝
Check savings in code:
summary = client.tracker.summary()
print(f"Hit rate: {summary.hit_rate * 100:.1f}%")
print(f"Tokens saved: {summary.tokens_saved:,}")
print(f"Cost saved: ${summary.cost_saved_usd:.4f}")
CLI reference
ragcache report # full report (all time)
ragcache report --last 24h # last 24 hours
ragcache report --last 7d # last 7 days
ragcache report --format json > report.json
ragcache stats # one-line summary
ragcache patterns # top wasteful patterns + fix suggestions
ragcache clear # wipe cache
ragcache config # show current settings
Backends
| Backend | Best for | Setup required |
|---|---|---|
memory |
Testing, Jupyter notebooks | None — lost on restart |
sqlite |
Single-process apps (default) | None — file at ~/.ragcache/cache.db |
redis |
Multiple workers, shared cache | pip install ragcache[redis] |
postgres |
Production, analytics, auditing | pip install ragcache[postgres] |
FAQ
Does it work with streaming (stream=True)?
Yes — streaming calls pass through directly without caching. No crash, no configuration needed.
Does it change the response object?
No. Cache hits return a proper ChatCompletion (OpenAI) or Message (Anthropic) object — identical to what the API returns. Your existing code needs zero changes.
Does caching affect response quality? L1 is lossless — same prompt, same response. L2 returns a cached response to a semantically equivalent question. At the default threshold of 0.65, BERTScore degradation is <0.3%. Raise the threshold toward 0.90+ if you need stricter accuracy.
Is it thread-safe? SQLite uses WAL mode and handles concurrent reads safely. For high-concurrency multi-process deployments, use Redis or Postgres.
Does it work with LangChain or LlamaIndex? Yes — wrap the underlying client before passing it in:
from langchain_openai import ChatOpenAI
from ragcache import wrap
from openai import OpenAI
cached_openai = wrap(OpenAI())
llm = ChatOpenAI(client=cached_openai.chat.completions)
How is the cache keyed? L1 uses SHA-256 of the normalized prompt (whitespace stripped, keys sorted). L2 uses a 768-dim embedding vector stored alongside the response. Each entry is keyed by its SHA-256 hash in the backend.
What happens if the cache backend is down? Cache failures are silent — the request falls through to the real API. Your app never crashes due to a cache error.
Benchmarks
Full benchmark results are being prepared for the accompanying paper. Preliminary results on HotpotQA and ShareGPT datasets:
| Method | Token reduction | Cost reduction | Quality (BERTScore) |
|---|---|---|---|
| No cache (baseline) | 0% | 0% | 1.000 |
| LangChain CacheBackedEmbeddings | 18% | 18% | 1.000 |
| GPTCache | 31% | 31% | 0.994 |
| RagCache (L1 only) | 25% | 25% | 1.000 |
| RagCache (L1+L2) | 58% | 58% | 0.997 |
Architecture
ragcache/
├── __init__.py # public API: wrap(), CacheConfig
├── interceptor.py # transparent proxy — wraps OpenAI/Anthropic clients
├── cache.py # L1 → L2 cache hierarchy
├── normalizer.py # prompt normalization, SHA-256 hashing, acronym expansion
├── embedder.py # sentence-transformers (local) or OpenAI embeddings
├── differ.py # L3 prefix detection, Anthropic cache_control injection
├── config.py # CacheConfig dataclass with all settings
├── backends/
│ ├── memory.py # in-memory dict (testing/notebooks)
│ ├── sqlite.py # SQLite WAL (default)
│ ├── redis.py # Redis (multi-process)
│ └── postgres.py # PostgreSQL (enterprise)
├── cost/
│ ├── tracker.py # per-call token and cost accounting
│ ├── reporter.py # rich terminal report
│ └── providers.json # pricing table: OpenAI, Anthropic, Ollama models
└── cli/
└── main.py # ragcache CLI
Citation
If you use RagCache in your research, please cite:
@article{kumar2026ragcache,
title={RagCache: A Three-Level Hierarchical Cache for Cost-Efficient LLM Applications},
author={Kumar, Abhishek},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}
Paper in preparation. arXiv link coming soon.
Contributing
PRs welcome. Please open an issue first for significant changes.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragcache-0.1.0.tar.gz.
File metadata
- Download URL: ragcache-0.1.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca117878c019a3946d1e60ae308d78a781ee784869f109ce7b890528c26b0139
|
|
| MD5 |
cb2ebdeff001c1abd2a271a162d465ce
|
|
| BLAKE2b-256 |
f347362f4d1fd75c17734252a5a8ab090ce2214ddb5846c89f376c6cc32a3c21
|
Provenance
The following attestation bundles were made for ragcache-0.1.0.tar.gz:
Publisher:
publish.yml on abhi-singh-123/ragcache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ragcache-0.1.0.tar.gz -
Subject digest:
ca117878c019a3946d1e60ae308d78a781ee784869f109ce7b890528c26b0139 - Sigstore transparency entry: 1635363262
- Sigstore integration time:
-
Permalink:
abhi-singh-123/ragcache@b3290da202dbab7cf7afc5dcb7d3d8d3989fdfef -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/abhi-singh-123
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3290da202dbab7cf7afc5dcb7d3d8d3989fdfef -
Trigger Event:
push
-
Statement type:
File details
Details for the file ragcache-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ragcache-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83a5da999dba526c4451d7ac4280bf4d8b4723e91c4f88ee960ee833d32f70ed
|
|
| MD5 |
16198581e6f1f613fe1896cf5b50028b
|
|
| BLAKE2b-256 |
6468ec37cc02d26dd6e45ffb27fdf3dbcdc11dfc20bf305e0c5b4ff8246073ba
|
Provenance
The following attestation bundles were made for ragcache-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on abhi-singh-123/ragcache
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ragcache-0.1.0-py3-none-any.whl -
Subject digest:
83a5da999dba526c4451d7ac4280bf4d8b4723e91c4f88ee960ee833d32f70ed - Sigstore transparency entry: 1635363270
- Sigstore integration time:
-
Permalink:
abhi-singh-123/ragcache@b3290da202dbab7cf7afc5dcb7d3d8d3989fdfef -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/abhi-singh-123
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3290da202dbab7cf7afc5dcb7d3d8d3989fdfef -
Trigger Event:
push
-
Statement type: