Skip to main content

Plan cache for agentic pipelines — reuse workflow skeletons, skip repeated planning.

Project description

ThriftLM

Stop paying for the same LLM call twice just because users phrased it differently.

0.2.0 adds plan caching on top of response caching — repeated agent workflows now skip the planner entirely.

PyPI version Python 3.10+ License: MIT Docs

pip install thriftlm

What ThriftLM is

ThriftLM is a two-layer caching system for LLM applications.

V1 — response cache (thriftlm==0.1.x, stable) Same query → same answer. Intercepts repeated or semantically similar LLM calls before they reach your provider. Three-tier stack: Redis exact hash → local numpy cosine index → Supabase pgvector HNSW.

V2 — plan cache (thriftlm==0.2.0, new) Same job → same execution plan, filled with fresh context. Intercepts agent tasks before planning. If a semantically similar task was planned before, V2 returns a validated, slot-filled FilledPlan — no planner call, no LLM call. If it misses, your planner runs and the result can be stored for next time.

Both layers can run together. V1 sits underneath V2 to cache repeated leaf LLM calls inside agent steps.


What's new in 0.2.0

  • Plan-level cache (thriftlm.v2) — reuse reasoning skeletons across task families, not just identical queries
  • Intent canonicalization — tasks are routed to deterministic buckets via a structured IntentKey (gpt-4o-mini, cached 1h in Redis)
  • Composite candidate reranking0.7 × semantic_similarity + 0.3 × structural_score over fetched candidates
  • Slot filling + 7-stage validation — plans are filled with caller context and validated before being returned; bad fills are silently discarded
  • Automatic plan extraction — after a planner runs on a miss, extract_plan_template() generalizes the trace into a reusable template (deterministic, no LLM)
  • scripts/ — seed, smoke-test, and extract-and-store helpers for developer workflow
  • 364 tests passing

Architecture

V1 — response cache

query
  │
  ▼
┌─────────────────┐   HIT → return (~0.5ms)
│  Redis          │   exact embedding hash
└────────┬────────┘
         │ MISS
         ▼
┌─────────────────┐   HIT → Supabase PK fetch → return (~50ms)
│  Local Numpy    │   cosine similarity matmul
│  Index          │
└────────┬────────┘
         │ MISS
         ▼
┌─────────────────┐
│  Your LLM fn    │   llm_fn() called here
└────────┬────────┘
         │
         ▼
   PII scrub (Presidio, responses only) → store → return

V2 — plan cache

Lookup path:

task + context + runtime_caps
  │
  ▼
canonicalize(task)          → IntentKey + intent_bucket_hash
  └── Redis 1h TTL          (no second OpenAI call on repeat tasks)
  │
  ▼
bucket fetch (Supabase)     → candidates matching intent_bucket_hash
  │
  ▼
composite rerank            → 0.7 × sem_sim + 0.3 × structural_score
  │
  ▼
adapt_plan()                → fill SlotSpecs from context + transforms
  │
  ▼
validate_plan()             → 7-stage pipeline; discard + try next on fail
  │
  ├── HIT  → return FilledPlan (planner never ran)
  └── MISS → return miss signal → caller runs planner

Miss → extract → store path:

caller planner runs → execution trace
  │
  ▼
extract_plan_template()     → generalize trace to PlanTemplate (deterministic)
  │
  ▼
POST /v2/plan/store         → server verifies bucket hash → stores in Supabase
  │
  ▼
next similar task hits the plan cache

Installation

pip install thriftlm

Prerequisites:

  • Python 3.10+
  • Supabase project with pgvector (supabase/setup.sql to provision tables)
  • Redis (local or Upstash)
  • OPENAI_API_KEY for V2 canonicalization (gpt-4o-mini)
SUPABASE_URL=https://your-project.supabase.co
SUPABASE_KEY=your-anon-key
REDIS_URL=redis://localhost:6379
OPENAI_API_KEY=sk-...

V1 Quickstart — response cache

from thriftlm import SemanticCache
import openai

cache = SemanticCache(threshold=0.85, api_key="your-key")

def call_llm(query: str) -> str:
    resp = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
    )
    return resp.choices[0].message.content

# Cache check + LLM fallback in one call
response = cache.get_or_call("Explain semantic caching", call_llm)

# Near-duplicate → instant cache hit, no LLM called
response2 = cache.get_or_call("What is semantic caching?", call_llm)

That's the entire integration. No architecture changes — wrap the existing LLM call.


V2 Quickstart — plan cache

Start the V2 server:

py -m uvicorn thriftlm.v2._server:app --port 8000

Use ThriftLMPlanCache in your agent:

from thriftlm.v2.adapters.generic import ThriftLMPlanCache

cache = ThriftLMPlanCache(
    api_key="tlm_xxx",
    base_url="http://localhost:8000",
    timeout=30.0,   # first call does OpenAI canonicalization, allow extra time
)

task = "summarize open PRs for org/myrepo"
context = {"repo": "org/myrepo"}
runtime_caps = {"tool_families": ["github"], "allow_side_effects": False}

result = cache.lookup(task=task, context=context, runtime_caps=runtime_caps)

if result["status"] == "hit":
    # Planner skipped entirely — use the validated, slot-filled plan
    filled_plan = result["filled_plan"]
    executor.run(filled_plan, context)

else:
    # Cache miss — run your planner, then store the trace for next time
    planner_output = my_planner(task, context)
    executor.run(planner_output, context)

    # Optional: extract and store for future reuse
    from scripts.extract_and_store import extract_and_store
    extract_and_store(
        task=task,
        context=context,
        execution_trace=planner_output["trace"],
        canonicalization_result=result.get("canonicalization_result"),
        api_key="tlm_xxx",
        base_url="http://localhost:8000",
    )

On the second call with a semantically similar task (same intent, different repo), V2 returns a hit — slots are filled with the new context, validation passes, planner never runs.


Core V2 concepts

IntentKey — structured decomposition of a task into action, target, goal, time_scope, and optional metadata fields (domain, format, audience, tool_family). Produced by the canonicalizer (gpt-4o-mini at temperature=0).

intent_bucket_hash — 16-char SHA-256 of the 4 core fields only (action, target, goal, time_scope). Optional fields are excluded from the hash intentionally: LLMs vary them across invocations even for the same task. The hash is the routing key for plan lookup.

PlanTemplate — a stored execution skeleton: ordered steps with typed I/O, SlotSpec declarations for caller-supplied values, output schema, and version metadata. Retrieved from Supabase by bucket hash.

FilledPlan — a PlanTemplate with all SlotSpec values resolved from the caller's context. Step inputs referencing {slot_name} are substituted. Prior-step output references ({prs}, {grouped}) are left for the executor.

Structural scoring — composite score used to rank candidates within a bucket:

final_score = 0.7 × semantic_similarity + 0.3 × structural_score

structural_score =
    0.35 × slot_overlap      (required context keys present)
  + 0.25 × tool_family_match (plan needs tools the runtime has)
  + 0.20 × format_audience   (format/audience fields match)
  + 0.20 × side_effect_compat (side-effecting steps allowed?)

Validation — 7 ordered stages:

Stage What it checks
1 All required slots resolved from context
2 Slot values match declared type hints
3 All step inputs satisfied by prior outputs or slots
4 Required tool_family values present in runtime_caps
5 No unsubstituted {placeholder} strings remain
6 Every non-optional output schema field has a producing step
7 Side-effecting steps permitted by runtime_caps.allow_side_effects

A candidate that fails any stage is discarded silently. The next ranked candidate is tried. If all top_k candidates fail, V2 returns a miss.


Safety and invariants

  • V2 never executes plans. It returns a validated FilledPlan. The caller owns execution.
  • Bucket hash is recomputed server-side on store. Caller-supplied intent_bucket_hash is not trusted — a mismatch returns 400 hash_mismatch.
  • Plans are tenant-isolated. Every plan is scoped to api_key. No cross-tenant reads or writes.
  • Extractor is deterministic. extract_plan_template() calls no LLM and makes no network requests. It generalizes a trace using reverse context mapping. It will refuse extraction (return ok=False) if the trace has fewer than 2 steps, all steps are side-effecting with no slots extracted, or extraction confidence is below 0.5.
  • Canonicalization is cached. Once a task string is canonicalized, the result is stored in Redis for 1 hour. The same task never triggers two OpenAI calls within that window.

Current scope and limitations

  • Text-first. V2 is designed for text-input agent tasks. Multimodal support (EvidenceProfile) is designed but not yet built.
  • Shallow slot extraction. The extractor handles exact top-level context value → placeholder substitution. Nested placeholder extraction and fuzzy abstraction are not supported in v0.1.
  • No benchmark yet. V2 hit rate and latency benchmarks across diverse task families are planned for Phase 3.
  • plan_threshold = 0.60. SBERT cosine similarity between short task strings and plan descriptions typically lands in 0.50–0.70. The threshold may need tuning as your plan bank grows.
  • seed_task vs description split is future polish. Currently seed_v2_plans.py canonicalizes the plan description string, which works but conflates routing vocabulary with reranking text. Not a blocker.

Developer scripts

Script What it does
scripts/seed_v2_plans.py --api-key tlm_xxx Seeds Supabase with canonical plan templates. Calls canonicalize() on each description to get live bucket hashes — no hardcoded intent keys.
scripts/smoke_v2_lookup.py --api-key tlm_xxx --base-url http://localhost:8000 --task "..." --context '{}' --timeout 30 Fires a single lookup and prints the full JSON response. Use --timeout 30 on cold starts.
scripts/extract_and_store.py --api-key tlm_xxx --base-url http://localhost:8000 --task "..." --context '{}' --trace trace.json --canon canon.json Extracts a PlanTemplate from an execution trace and stores it via /v2/plan/store.
scripts/debug_v2_lookup.py --api-key tlm_xxx --bucket <hash> --task "..." --context '{}' Fetches raw DB rows for a bucket, scores them, and runs adapt_plan + validate_plan — useful for diagnosing misses.

V2 API endpoints

Method + Path Description
POST /v2/plan/lookup Main entry: task + context → FilledPlan or miss
POST /v2/plan/store Store a template (server recomputes and verifies bucket hash)
GET /v2/plan/bucket/:hash List templates for a bucket
DELETE /v2/plan/:id Evict a single plan
DELETE /v2/plan/bucket/:hash Evict an entire bucket
POST /v2/plan/invalidate-by-version Bulk soft-invalidate by version string
GET /v2/metrics Server health + version

V1 metrics dashboard

thriftlm serve --api-key your-key
# → http://localhost:8000  (opens automatically)

Shows hit rate, tokens saved, estimated cost saved, and top cached queries. Reads directly from your Supabase.


V1 benchmark

Threshold | Hit Rate | Hits / 200
----------|----------|------------
0.70      |  92.5%   |   185
0.75      |  86.0%   |   172
0.80      |  78.0%   |   156
0.82      |  73.5%   |   147   ← recommended
0.85      |  62.5%   |   125   (default)
0.90      |  40.0%   |    80

Model: all-MiniLM-L6-v2  ·  Dataset: Quora Question Pairs (200 pairs)

Project structure

ThriftLM/
├── thriftlm/
│   ├── cache.py                 # V1 SemanticCache
│   ├── embedder.py              # SBERT all-MiniLM-L6-v2
│   ├── privacy.py               # Presidio PII scrubbing
│   ├── _server.py               # V1 FastAPI (thriftlm serve)
│   ├── cli.py                   # CLI entry point
│   ├── backends/
│   │   ├── local_index.py       # Numpy cosine index
│   │   ├── redis_backend.py     # Exact hash cache
│   │   └── supabase_backend.py  # pgvector HNSW store
│   └── v2/
│       ├── schemas.py           # TypedDicts: IntentKey, PlanTemplate, FilledPlan, …
│       ├── intent.py            # canonicalize() → IntentKey + bucket hash
│       ├── canonicalization_cache.py  # Redis cache for canonicalization results
│       ├── plan_cache.py        # bucket fetch + composite rerank
│       ├── adapter.py           # slot filling + TransformRegistry
│       ├── validator.py         # 7-stage validation pipeline
│       ├── extractor.py         # trace → PlanTemplate (deterministic)
│       ├── _server.py           # V2 FastAPI endpoints
│       └── adapters/
│           ├── base.py          # BasePlanCache ABC
│           └── generic.py       # ThriftLMPlanCache HTTP client
├── scripts/
│   ├── seed_v2_plans.py
│   ├── smoke_v2_lookup.py
│   ├── extract_and_store.py
│   └── debug_v2_lookup.py
├── tests/                       # 364 passing
├── supabase/setup.sql
├── api/                         # Multi-tenant self-hosted backend
└── pyproject.toml

Roadmap

Item Status
V1 response cache Shipped (0.1.x)
V2 plan cache Shipped (0.2.0)
V2 benchmark (200 tasks, 5 intent buckets) Phase 3
Fly.io deploy + hosted endpoint Phase 3
Claude Code MCP adapter / Codex CLI hook Roadmap
seed_task vs description split in seed script Post-0.2.0 polish
V2.5 multimodal EvidenceProfile Future

Development

git clone https://github.com/samujure/ThriftLM
cd ThriftLM
pip install -e ".[dev]"
cp .env.example .env      # fill in SUPABASE_URL, SUPABASE_KEY, REDIS_URL, OPENAI_API_KEY
docker compose up -d      # local Redis
pytest tests/ -q          # 364 tests
py scripts/seed_v2_plans.py --api-key tlm_test
py scripts/smoke_v2_lookup.py --api-key tlm_test --base-url http://localhost:8000 \
  --task "summarize open PRs for org/myrepo" --context '{"repo":"org/myrepo"}' \
  --runtime-caps '{"tool_families":["github"],"allow_side_effects":false}' --timeout 30

This README reflects thriftlm==0.2.0.


Built by Srivamsi Amujure & Ivan Thomas Shen

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thriftlm-0.2.2.tar.gz (97.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thriftlm-0.2.2-py3-none-any.whl (56.7 kB view details)

Uploaded Python 3

File details

Details for the file thriftlm-0.2.2.tar.gz.

File metadata

  • Download URL: thriftlm-0.2.2.tar.gz
  • Upload date:
  • Size: 97.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for thriftlm-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ead826fcd32758a57c30d4cc4c71fc33ea875b335c52d107cdbb26fdb8ff9a0e
MD5 f30606415976af1391fb09a44d52ab18
BLAKE2b-256 5dd7ad0a370f47b642c724aadfeb27e1a1f335053f9bb977878deeccf97afe43

See more details on using hashes here.

File details

Details for the file thriftlm-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: thriftlm-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 56.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for thriftlm-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 09aeeac2e89722b9b1aa964ff79ad8e2290eef29afd309a6008021f87e75f7fa
MD5 5c491aec1f62a477d6322d58d86d0df8
BLAKE2b-256 1bb79827f92fb041039f16824782f6f0340256ce83299fbaf07b63ce8f35be50

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page