Plan cache for agentic pipelines — reuse workflow skeletons, skip repeated planning.
Project description
ThriftLM
Stop paying for the same LLM call twice just because users phrased it differently.
0.2.0 adds plan caching on top of response caching — repeated agent workflows now skip the planner entirely.
pip install thriftlm
What ThriftLM is
ThriftLM is a two-layer caching system for LLM applications.
V1 — response cache (thriftlm==0.1.x, stable)
Same query → same answer. Intercepts repeated or semantically similar LLM calls before they reach your provider. Three-tier stack: Redis exact hash → local numpy cosine index → Supabase pgvector HNSW.
V2 — plan cache (thriftlm==0.2.0, new)
Same job → same execution plan, filled with fresh context. Intercepts agent tasks before planning. If a semantically similar task was planned before, V2 returns a validated, slot-filled FilledPlan — no planner call, no LLM call. If it misses, your planner runs and the result can be stored for next time.
Both layers can run together. V1 sits underneath V2 to cache repeated leaf LLM calls inside agent steps.
What's new in 0.2.0
- Plan-level cache (
thriftlm.v2) — reuse reasoning skeletons across task families, not just identical queries - Intent canonicalization — tasks are routed to deterministic buckets via a structured
IntentKey(gpt-4o-mini, cached 1h in Redis) - Composite candidate reranking —
0.7 × semantic_similarity + 0.3 × structural_scoreover fetched candidates - Slot filling + 7-stage validation — plans are filled with caller context and validated before being returned; bad fills are silently discarded
- Automatic plan extraction — after a planner runs on a miss,
extract_plan_template()generalizes the trace into a reusable template (deterministic, no LLM) scripts/— seed, smoke-test, and extract-and-store helpers for developer workflow- 364 tests passing
Architecture
V1 — response cache
query
│
▼
┌─────────────────┐ HIT → return (~0.5ms)
│ Redis │ exact embedding hash
└────────┬────────┘
│ MISS
▼
┌─────────────────┐ HIT → Supabase PK fetch → return (~50ms)
│ Local Numpy │ cosine similarity matmul
│ Index │
└────────┬────────┘
│ MISS
▼
┌─────────────────┐
│ Your LLM fn │ llm_fn() called here
└────────┬────────┘
│
▼
PII scrub (Presidio, responses only) → store → return
V2 — plan cache
Lookup path:
task + context + runtime_caps
│
▼
canonicalize(task) → IntentKey + intent_bucket_hash
└── Redis 1h TTL (no second OpenAI call on repeat tasks)
│
▼
bucket fetch (Supabase) → candidates matching intent_bucket_hash
│
▼
composite rerank → 0.7 × sem_sim + 0.3 × structural_score
│
▼
adapt_plan() → fill SlotSpecs from context + transforms
│
▼
validate_plan() → 7-stage pipeline; discard + try next on fail
│
├── HIT → return FilledPlan (planner never ran)
└── MISS → return miss signal → caller runs planner
Miss → extract → store path:
caller planner runs → execution trace
│
▼
extract_plan_template() → generalize trace to PlanTemplate (deterministic)
│
▼
POST /v2/plan/store → server verifies bucket hash → stores in Supabase
│
▼
next similar task hits the plan cache
Installation
pip install thriftlm
Prerequisites:
- Python 3.10+
- Supabase project with pgvector (
supabase/setup.sqlto provision tables) - Redis (local or Upstash)
OPENAI_API_KEYfor V2 canonicalization (gpt-4o-mini)
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key
REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...
V1 Quickstart — response cache
from thriftlm import SemanticCache
import openai
cache = SemanticCache(threshold=0.85, api_key="your-key")
def call_llm(query: str) -> str:
resp = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": query}],
)
return resp.choices[0].message.content
# Cache check + LLM fallback in one call
response = cache.get_or_call("Explain semantic caching", call_llm)
# Near-duplicate → instant cache hit, no LLM called
response2 = cache.get_or_call("What is semantic caching?", call_llm)
That's the entire integration. No architecture changes — wrap the existing LLM call.
V2 Quickstart — plan cache
Start the V2 server:
py -m uvicorn thriftlm.v2._server:app --port 8000
Use ThriftLMPlanCache in your agent:
from thriftlm.v2.adapters.generic import ThriftLMPlanCache
cache = ThriftLMPlanCache(
api_key="tlm_xxx",
base_url="http://localhost:8000",
timeout=30.0, # first call does OpenAI canonicalization, allow extra time
)
task = "summarize open PRs for org/myrepo"
context = {"repo": "org/myrepo"}
runtime_caps = {"tool_families": ["github"], "allow_side_effects": False}
result = cache.lookup(task=task, context=context, runtime_caps=runtime_caps)
if result["status"] == "hit":
# Planner skipped entirely — use the validated, slot-filled plan
filled_plan = result["filled_plan"]
executor.run(filled_plan, context)
else:
# Cache miss — run your planner, then store the trace for next time
planner_output = my_planner(task, context)
executor.run(planner_output, context)
# Optional: extract and store for future reuse
from scripts.extract_and_store import extract_and_store
extract_and_store(
task=task,
context=context,
execution_trace=planner_output["trace"],
canonicalization_result=result.get("canonicalization_result"),
api_key="tlm_xxx",
base_url="http://localhost:8000",
)
On the second call with a semantically similar task (same intent, different repo), V2 returns a hit — slots are filled with the new context, validation passes, planner never runs.
Core V2 concepts
IntentKey — structured decomposition of a task into action, target, goal, time_scope, and optional metadata fields (domain, format, audience, tool_family). Produced by the canonicalizer (gpt-4o-mini at temperature=0).
intent_bucket_hash — 16-char SHA-256 of the 4 core fields only (action, target, goal, time_scope). Optional fields are excluded from the hash intentionally: LLMs vary them across invocations even for the same task. The hash is the routing key for plan lookup.
PlanTemplate — a stored execution skeleton: ordered steps with typed I/O, SlotSpec declarations for caller-supplied values, output schema, and version metadata. Retrieved from Supabase by bucket hash.
FilledPlan — a PlanTemplate with all SlotSpec values resolved from the caller's context. Step inputs referencing {slot_name} are substituted. Prior-step output references ({prs}, {grouped}) are left for the executor.
Structural scoring — composite score used to rank candidates within a bucket:
final_score = 0.7 × semantic_similarity + 0.3 × structural_score
structural_score =
0.35 × slot_overlap (required context keys present)
+ 0.25 × tool_family_match (plan needs tools the runtime has)
+ 0.20 × format_audience (format/audience fields match)
+ 0.20 × side_effect_compat (side-effecting steps allowed?)
Validation — 7 ordered stages:
| Stage | What it checks |
|---|---|
| 1 | All required slots resolved from context |
| 2 | Slot values match declared type hints |
| 3 | All step inputs satisfied by prior outputs or slots |
| 4 | Required tool_family values present in runtime_caps |
| 5 | No unsubstituted {placeholder} strings remain |
| 6 | Every non-optional output schema field has a producing step |
| 7 | Side-effecting steps permitted by runtime_caps.allow_side_effects |
A candidate that fails any stage is discarded silently. The next ranked candidate is tried. If all top_k candidates fail, V2 returns a miss.
Safety and invariants
- V2 never executes plans. It returns a validated
FilledPlan. The caller owns execution. - Bucket hash is recomputed server-side on store. Caller-supplied
intent_bucket_hashis not trusted — a mismatch returns400 hash_mismatch. - Plans are tenant-isolated. Every plan is scoped to
api_key. No cross-tenant reads or writes. - Extractor is deterministic.
extract_plan_template()calls no LLM and makes no network requests. It generalizes a trace using reverse context mapping. It will refuse extraction (returnok=False) if the trace has fewer than 2 steps, all steps are side-effecting with no slots extracted, or extraction confidence is below 0.5. - Canonicalization is cached. Once a task string is canonicalized, the result is stored in Redis for 1 hour. The same task never triggers two OpenAI calls within that window.
Current scope and limitations
- Text-first. V2 is designed for text-input agent tasks. Multimodal support (
EvidenceProfile) is designed but not yet built. - Shallow slot extraction. The extractor handles exact top-level context value → placeholder substitution. Nested placeholder extraction and fuzzy abstraction are not supported in v0.1.
- No benchmark yet. V2 hit rate and latency benchmarks across diverse task families are planned for Phase 3.
- plan_threshold = 0.60. SBERT cosine similarity between short task strings and plan descriptions typically lands in 0.50–0.70. The threshold may need tuning as your plan bank grows.
- seed_task vs description split is future polish. Currently
seed_v2_plans.pycanonicalizes the plan description string, which works but conflates routing vocabulary with reranking text. Not a blocker.
Developer scripts
| Script | What it does |
|---|---|
scripts/seed_v2_plans.py --api-key tlm_xxx |
Seeds Supabase with canonical plan templates. Calls canonicalize() on each description to get live bucket hashes — no hardcoded intent keys. |
scripts/smoke_v2_lookup.py --api-key tlm_xxx --base-url http://localhost:8000 --task "..." --context '{}' --timeout 30 |
Fires a single lookup and prints the full JSON response. Use --timeout 30 on cold starts. |
scripts/extract_and_store.py --api-key tlm_xxx --base-url http://localhost:8000 --task "..." --context '{}' --trace trace.json --canon canon.json |
Extracts a PlanTemplate from an execution trace and stores it via /v2/plan/store. |
scripts/debug_v2_lookup.py --api-key tlm_xxx --bucket <hash> --task "..." --context '{}' |
Fetches raw DB rows for a bucket, scores them, and runs adapt_plan + validate_plan — useful for diagnosing misses. |
V2 API endpoints
| Method + Path | Description |
|---|---|
POST /v2/plan/lookup |
Main entry: task + context → FilledPlan or miss |
POST /v2/plan/store |
Store a template (server recomputes and verifies bucket hash) |
GET /v2/plan/bucket/:hash |
List templates for a bucket |
DELETE /v2/plan/:id |
Evict a single plan |
DELETE /v2/plan/bucket/:hash |
Evict an entire bucket |
POST /v2/plan/invalidate-by-version |
Bulk soft-invalidate by version string |
GET /v2/metrics |
Server health + version |
V1 metrics dashboard
thriftlm serve --api-key your-key
# → http://localhost:8000 (opens automatically)
Shows hit rate, tokens saved, estimated cost saved, and top cached queries. Reads directly from your Supabase.
V1 benchmark
Threshold | Hit Rate | Hits / 200
----------|----------|------------
0.70 | 92.5% | 185
0.75 | 86.0% | 172
0.80 | 78.0% | 156
0.82 | 73.5% | 147 ← recommended
0.85 | 62.5% | 125 (default)
0.90 | 40.0% | 80
Model: all-MiniLM-L6-v2 · Dataset: Quora Question Pairs (200 pairs)
Project structure
ThriftLM/
├── thriftlm/
│ ├── cache.py # V1 SemanticCache
│ ├── embedder.py # SBERT all-MiniLM-L6-v2
│ ├── privacy.py # Presidio PII scrubbing
│ ├── _server.py # V1 FastAPI (thriftlm serve)
│ ├── cli.py # CLI entry point
│ ├── backends/
│ │ ├── local_index.py # Numpy cosine index
│ │ ├── redis_backend.py # Exact hash cache
│ │ └── supabase_backend.py # pgvector HNSW store
│ └── v2/
│ ├── schemas.py # TypedDicts: IntentKey, PlanTemplate, FilledPlan, …
│ ├── intent.py # canonicalize() → IntentKey + bucket hash
│ ├── canonicalization_cache.py # Redis cache for canonicalization results
│ ├── plan_cache.py # bucket fetch + composite rerank
│ ├── adapter.py # slot filling + TransformRegistry
│ ├── validator.py # 7-stage validation pipeline
│ ├── extractor.py # trace → PlanTemplate (deterministic)
│ ├── _server.py # V2 FastAPI endpoints
│ └── adapters/
│ ├── base.py # BasePlanCache ABC
│ └── generic.py # ThriftLMPlanCache HTTP client
├── scripts/
│ ├── seed_v2_plans.py
│ ├── smoke_v2_lookup.py
│ ├── extract_and_store.py
│ └── debug_v2_lookup.py
├── tests/ # 364 passing
├── supabase/setup.sql
├── api/ # Multi-tenant self-hosted backend
└── pyproject.toml
Roadmap
| Item | Status |
|---|---|
| V1 response cache | Shipped (0.1.x) |
| V2 plan cache | Shipped (0.2.0) |
| V2 benchmark (200 tasks, 5 intent buckets) | Phase 3 |
| Fly.io deploy + hosted endpoint | Phase 3 |
| Claude Code MCP adapter / Codex CLI hook | Roadmap |
seed_task vs description split in seed script |
Post-0.2.0 polish |
V2.5 multimodal EvidenceProfile |
Future |
Development
git clone https://github.com/samujure/ThriftLM
cd ThriftLM
pip install -e ".[dev]"
cp .env.example .env # fill in SUPABASE_URL, SUPABASE_KEY, REDIS_URL, OPENAI_API_KEY
docker compose up -d # local Redis
pytest tests/ -q # 364 tests
py scripts/seed_v2_plans.py --api-key tlm_test
py scripts/smoke_v2_lookup.py --api-key tlm_test --base-url http://localhost:8000 \
--task "summarize open PRs for org/myrepo" --context '{"repo":"org/myrepo"}' \
--runtime-caps '{"tool_families":["github"],"allow_side_effects":false}' --timeout 30
This README reflects
thriftlm==0.2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thriftlm-0.2.2.tar.gz.
File metadata
- Download URL: thriftlm-0.2.2.tar.gz
- Upload date:
- Size: 97.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ead826fcd32758a57c30d4cc4c71fc33ea875b335c52d107cdbb26fdb8ff9a0e
|
|
| MD5 |
f30606415976af1391fb09a44d52ab18
|
|
| BLAKE2b-256 |
5dd7ad0a370f47b642c724aadfeb27e1a1f335053f9bb977878deeccf97afe43
|
File details
Details for the file thriftlm-0.2.2-py3-none-any.whl.
File metadata
- Download URL: thriftlm-0.2.2-py3-none-any.whl
- Upload date:
- Size: 56.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09aeeac2e89722b9b1aa964ff79ad8e2290eef29afd309a6008021f87e75f7fa
|
|
| MD5 |
5c491aec1f62a477d6322d58d86d0df8
|
|
| BLAKE2b-256 |
1bb79827f92fb041039f16824782f6f0340256ce83299fbaf07b63ce8f35be50
|