Skip to main content

Semantic caching layer for LLM applications — stop paying for the same call twice.

Project description

ThriftLM

Semantic cache layer for LLM applications. Redis-fast exact hits. Numpy-powered near-miss matching. PII-scrubbed by default.

PyPI version Python 3.10+ License: MIT

pip install thriftlm

Why ThriftLM

Every repeated or semantically similar LLM query burns tokens and adds latency. ThriftLM intercepts these calls with a three-tier cache — exact hash match in Redis, cosine similarity search in a local numpy index, and HNSW vector search in Supabase — before any request reaches your LLM provider.

73.5% hit rate at threshold=0.82 on the Quora Question Pairs benchmark. The median semantic cache hit returns in ~1ms vs. 2–12 seconds for a live LLM call.


How It Works

query
  │
  ▼
┌─────────────────┐     HIT → return instantly (~0.5ms)
│   Redis         │
│  (exact hash)   │
└────────┬────────┘
         │ MISS
         ▼
┌─────────────────┐     HIT → Supabase PK fetch → return (~50ms)
│  Local Numpy    │
│  Index (cosine) │
└────────┬────────┘
         │ MISS
         ▼
┌─────────────────┐
│   LLM Call      │     Your llm_fn() called here
│  (your function)│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  PII Scrubbing  │     Presidio strips names, emails, phone numbers
│  (response only)│
└────────┬────────┘
         │
         ▼
   Store in Supabase + LocalIndex + Redis

Cache hit order:

  1. Redis — exact embedding hash, microseconds, no DB call
  2. Local numpy index — cosine similarity matmul, ~1ms, Supabase PK fetch for response
  3. LLM — cache miss only, full latency, stored after Presidio scrub

Quickstart

Prerequisites

  • Python 3.10+
  • Supabase project with pgvector enabled
  • Redis (local via Docker or Upstash)

1. Install

pip install thriftlm

2. Set up Supabase

Run supabase/setup.sql in your Supabase SQL editor. It creates:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE cache_entries (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    api_key     TEXT NOT NULL,
    query       TEXT NOT NULL,
    response    TEXT NOT NULL,
    embedding   VECTOR(384) NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT now(),
    last_hit_at TIMESTAMPTZ,
    hit_count   INTEGER DEFAULT 0
);

CREATE INDEX cache_entries_embedding_idx
    ON cache_entries
    USING hnsw (embedding vector_cosine_ops);

Plus two RPC functions (match_cache_entries, increment_api_key_counters) — see the full file for those.

3. Configure environment

SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key
REDIS_URL=redis://localhost:6379

4. Run Redis

docker compose up -d

5. Integrate

from thriftlm import SemanticCache
import openai

# Initialize once per process
cache = SemanticCache(threshold=0.85, api_key="your-key")

def call_llm(query: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}]
    )
    return response.choices[0].message.content

# Drop-in wrapper
response = cache.get_or_call("Explain semantic caching", call_llm)

# Near-duplicate → instant cache hit, no LLM called
response2 = cache.get_or_call("What is semantic caching?", call_llm)

Configuration

Parameter Default Description
threshold 0.85 Cosine similarity cutoff. Lower = more aggressive matching.
api_key required Namespaces cache per tenant. Each key has its own LocalIndex.

Threshold guide:

Threshold Hit Rate (QQP) Use case
0.70 92.5% Aggressive — high savings, some false positives
0.82 73.5% Balanced — recommended for most apps
0.85 62.5% Default — conservative
0.90 40.0% Near-exact only

Architecture

Embedding: all-MiniLM-L6-v2 (384-dim). Runs locally, no API cost.

Local numpy index: On SemanticCache() init, all stored embeddings are bulk-fetched into a (N, 384) float32 matrix. Cosine similarity is a single matrix @ query_vec matmul — ~1ms regardless of cache size. New entries append via np.vstack.

Supabase HNSW: pgvector with HNSW index for accurate ANN at scale. Used for cold-start loading and as fallback.

PII scrubbing: Presidio + spaCy en_core_web_lg. Applied to LLM responses only before storage. Queries are not scrubbed — scrubbing before embedding causes embedding drift and kills recall.


Benchmark

200 duplicate question pairs from Quora Question Pairs.

Threshold | Hit Rate | Hits / 200
----------|----------|------------
0.70      |  92.5%   |   185
0.75      |  86.0%   |   172
0.80      |  78.0%   |   156
0.82 ←    |  73.5%   |   147   (recommended)
0.85      |  62.5%   |   125   (default)
0.90      |  40.0%   |    80

Model: all-MiniLM-L6-v2 · Index: HNSW (Supabase pgvector)
Dataset: mean sim=0.859, min=0.550, max=0.999

Project Structure

ThriftLM/
├── thriftlm/
│   ├── __init__.py              # Public API: SemanticCache
│   ├── cache.py                 # Core lookup/store logic
│   ├── config.py                # Env config
│   ├── embedder.py              # SBERT wrapper
│   ├── privacy.py               # Presidio PII scrubbing
│   └── backends/
│       ├── local_index.py       # Numpy cosine index
│       ├── redis_backend.py     # Exact hash cache
│       └── supabase_backend.py  # Vector storage + PK fetch
├── api/
│   ├── main.py                  # FastAPI app
│   ├── auth.py                  # API key auth
│   └── routes/
│       ├── cache.py             # /lookup, /store
│       ├── metrics.py           # /metrics
│       └── keys.py              # /keys
├── tests/                       # 66 passing tests
├── scratch/
│   ├── smoke_test.py
│   ├── openai_test.py
│   ├── populate_test.py
│   └── qqp_benchmark.py
├── supabase/setup.sql
├── docker-compose.yml
└── pyproject.toml

REST API

uvicorn api.main:app --reload
POST /lookup    { "embedding": [...], "api_key": "..." }           → { "response": "..." | null }
POST /store     { "embedding": [...], "query": "...", "response": "...", "api_key": "..." }  → 200
GET  /metrics   header: X-API-Key                                  → { hit_rate, tokens_saved, cost_saved, total_queries }
POST /keys      { "email": "..." }                                 → { "api_key": "sc_..." }
GET  /health                                                       → { "status": "ok" }

Development

git clone https://github.com/samujure/ThriftLM
cd ThriftLM
pip install -e ".[dev]"
cp .env.example .env
docker compose up -d
pytest tests/ -v
python scratch/smoke_test.py
python scratch/qqp_benchmark.py

Roadmap

V1 — Shipped ✓

  • Three-tier cache: Redis → LocalIndex → HNSW
  • Presidio PII scrubbing on responses
  • Multi-tenant FastAPI + API key auth
  • pip install thriftlm

V2 — Agentic Plan Caching (next)

V1 caches individual responses. V2 caches entire agent plans — multi-step action sequences generated for agentic loops. When intent repeats, skip re-planning and replay the cached plan. Built for Claude Code SDK and long-running agent workflows.

Key papers: APC (arxiv 2506.14852) · GenCache

Later

  • ClawHub / OpenClaw distribution
  • Per-model cost analytics dashboard
  • Precision benchmark (false positive rate on non-duplicate pairs)

License

MIT


Built by Srivamsi Amujure & Ivan Thomas Shen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thriftlm-0.1.3.tar.gz (47.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thriftlm-0.1.3-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file thriftlm-0.1.3.tar.gz.

File metadata

  • Download URL: thriftlm-0.1.3.tar.gz
  • Upload date:
  • Size: 47.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for thriftlm-0.1.3.tar.gz
Algorithm Hash digest
SHA256 aa2f0bd88067009e52b57be11e0525a98f7b5f76d5e4c0028090a4162f41ec30
MD5 f0274d4b366560c5e8bd2093719ec960
BLAKE2b-256 6134b8192586b1993dff055a3af6a73722972c594d4e4fb09d9c7af7ba142726

See more details on using hashes here.

File details

Details for the file thriftlm-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: thriftlm-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for thriftlm-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 9168ac4a4e8f8fea97798a56684bb82d22f99a8d205d662388b48e2962c16eeb
MD5 64eafeab866982aaf21b219478275123
BLAKE2b-256 bfc5686b1694b4ecb328784a01e9a77e3b799fcb9ef8c2b4343d0609c4351d69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page