Skip to main content

Production-grade semantic caching for LLM agents — a learned equivalence classifier replaces the naive cosine threshold that causes false-positive failures.

Project description

SemanticMemo

Semantic caching framework using FAISS retrieval, learned equivalence classification, Cross-Encoder verification, and entity-drift detection.

Tests PyPI Python License: MIT


The Problem: Cosine Caches Fail in Production

Every semantic cache today makes the same mistake: it decides cache hits by cosine similarity.

"Should I approve the refund?"   ── cosine: 0.97 ──▶  CACHE HIT  ← wrong
"Should I deny the refund?"      ─────────────────────────────────────────

These two prompts are 97% similar in embedding space. They require opposite responses. A cosine threshold cannot tell them apart — and there is no threshold that fixes this. In the medical domain, this means a cached "increase dosage" response gets served for "decrease dosage". In finance, "buy 500 shares" gets served for "sell 500 shares".

The result: Teams adopt semantic caching in development, hit a false-positive incident, rip it out, and go back to paying full LLM cost on every call. The cycle repeats.

SemanticMemo replaces the cosine threshold with a learned, four-stage verification pipeline that cuts hard-negative false positive rates from 33.3% to 0%.


Benchmark Results

Hard-Negative False Positive Rate

The critical number. Hard negatives are semantically near-identical prompt pairs that require opposite actions — the case that breaks cosine caches.

Method Hard-Negative FPR False Positives / 12 pairs
Cosine Baseline (threshold=0.90) 33.3% 4/12
MLP Classifier 0.0% 0/12
Double Verification 0.0% 0/12
SemanticMemo 0.0% 0/12

Domain Comparison Matrix

Evaluated over 80 prompt pairs (4 domains × 20 pairs, 10 positive + 10 hard-negative each):

Method Domain Precision Recall F1 FPR
Cosine Baseline Finance 0.000 0.000 0.000 0.200
MLP Classifier Finance 1.000 0.700 0.824 0.000
SemanticMemo Finance 1.000 0.500 0.667 0.000
Cosine Baseline Medical 0.400 0.200 0.267 0.300
MLP Classifier Medical 0.667 0.600 0.632 0.300
SemanticMemo Medical 0.714 0.500 0.588 0.200
Cosine Baseline Security 0.000 0.000 0.000 0.200
MLP Classifier Security 1.000 0.500 0.667 0.000
SemanticMemo Security 1.000 0.500 0.667 0.000

Classifier vs Cosine on Gold Set (84 held-out pairs)

Method Precision Recall F1 False Positives
Cosine (at equal recall) 0.527 0.935 0.674 26
equivalence-net-v1 0.829 0.935 0.879 6

+30.2 precision points at equal recall.

Full results, latency breakdown, cost savings model, and threshold sweep report: docs/results.md


Architecture

SemanticMemo chains four stages. The first stage is permissive (high recall). Each stage narrows the candidates. Only the final confirmed hit goes to cache.

flowchart LR
    P["Incoming Prompt"] --> E["① Embed\nall-MiniLM-L6-v2\n~20ms"]
    E --> R["② Retrieve\nFAISS top-K\n<1ms"]
    R --> M["③ MLP Classifier\nequivalence-net-v1\n~1ms"]
    M -->|"score ≥ 0.995\n94.7% of hits"| HIT["✅ Cache Hit\n~27ms total"]
    M -->|"uncertain\n5.3% of hits"| CE["④ Cross-Encoder\nms-marco-MiniLM\n~3–8ms"]
    CE -->|"score ≥ threshold"| HIT
    CE -->|"score < threshold"| MISS["❌ Cache Miss\nCall LLM → Store"]
    M -->|"opposite-action\nveto"| MISS
Stage Component Latency Purpose
① Embed all-MiniLM-L6-v2 ~20ms Dense prompt vector
② Retrieve FAISS IndexFlatIP <1ms Top-K candidates
③ MLP equivalence-net-v1.pt ~1ms Fast pair equivalence
Veto Rule-based patterns <0.1ms Block opposite-action pairs
Bypass MLP ≥ 0.995 0ms Skip CE for certain hits
④ Cross-Encoder ms-marco-MiniLM-L-6-v2 ~3–8ms Deep re-ranking

Risk-Aware Policies

High-stakes domains (medical, finance, security, legal) use stricter thresholds automatically:

Domain MLP Threshold CE Threshold
Customer Support / General 0.90 0.85
Medical / Finance / Security / Legal 0.99 0.95

Installation

pip install "semanticmemo[ml]"

The [ml] extra includes PyTorch, FAISS, and SentenceTransformers — required for the embedding model and bundled classifier.

For local development:

git clone https://github.com/rajveer100704/semanticmemo
cd semanticmemo
uv sync --all-extras
uv run pytest          # 91 tests
uv run ruff check
uv run pyright

Quickstart

Drop-in wrapper — zero config

from semanticmemo import SemanticMemo, ClassifierConfig

cache = SemanticMemo(
    domain="customer-support",
    classifier=ClassifierConfig.bundled(),   # ships with the package
)

async def call_llm(prompt: str) -> str:
    # your existing LLM call here
    return "fresh response"

result = await cache.get_or_call(
    prompt="Where is my order?",
    llm_function=call_llm,
)

print(result.response)          # cached or fresh
print(result.was_cache_hit)     # True / False
print(result.cost_saved_usd)    # Decimal, $0 on miss
print(result.latency_ms)        # full round-trip latency

Production config — risk-aware, domain-aware

from semanticmemo import (
    SemanticMemo, CacheConfig, ClassifierConfig,
    CrossEncoderConfig, RiskPolicy, RiskTier,
)

cache = SemanticMemo(
    domain="medical",
    config=CacheConfig(
        cross_encoder=CrossEncoderConfig(
            model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        ),
        risk_policy=RiskPolicy(
            domain_tiers={
                "medical":  RiskTier.HIGH,
                "finance":  RiskTier.HIGH,
                "security": RiskTier.HIGH,
                "legal":    RiskTier.HIGH,
            },
            # LOW tier: customer-support, general
            low_risk_classifier_threshold=0.90,
            low_risk_cross_encoder_threshold=0.85,
            # HIGH tier: medical, finance, security, legal
            high_risk_classifier_threshold=0.99,
            high_risk_cross_encoder_threshold=0.95,
        ),
        high_precision_skip_threshold=0.995,  # bypass CE when MLP is near-certain
    ),
    classifier=ClassifierConfig.bundled(),
)

result = await cache.get_or_call(prompt="...", llm_function=call_llm)

# Explainable decision trace on every result
if result.decision:
    print(result.decision.reason)              # "mlp_bypass" / "passed_all_thresholds" / ...
    print(result.decision.risk_tier)           # "high" / "low"
    print(result.decision.classifier_score)    # float
    print(result.decision.cross_encoder_score) # float | None

# Latency profiling
print(result.embedding_latency_ms)       # ~20ms
print(result.retrieval_latency_ms)       # ~0.05ms
print(result.mlp_latency_ms)             # ~1ms
print(result.cross_encoder_latency_ms)   # ~3-8ms or 0 (bypassed)

Feedback & Retraining

Every cache hit is recorded. Report bad hits to generate labeled training data:

result = await cache.get_or_call(
    prompt="Approve the customer's refund request",
    llm_function=call_llm,
)

if result.was_cache_hit and user_rejected_answer:
    await cache.report_bad_hit(result.query_id, reason="wrong decision")

# Export feedback as training pairs
written = cache.export_feedback_pairs("data/feedback_pairs.jsonl")

Retrain a candidate classifier when feedback accumulates:

uv run semanticmemo retrain \
  --out models/classifier-candidate.pt \
  --validation-data data/validation_pairs.jsonl \
  --domain medical \
  --min-precision 0.95 \
  --promote-to models/classifier-active.pt

Implicit Feedback (opt-in)

Re-issuing the same prompt shortly after a cache hit is auto-flagged as a bad hit:

from semanticmemo import CacheConfig, ImplicitFeedbackConfig, SemanticMemo

cache = SemanticMemo(
    domain="customer-support",
    config=CacheConfig(
        implicit_feedback=ImplicitFeedbackConfig(window_seconds=30.0),
    ),
)

Benchmarks

# Full comparison: Cosine vs MLP vs Double Verification vs SemanticMemo
uv run python benchmarks/run_benchmarks.py

# Threshold sweep (40×30 grid, per-domain FPR constraints)
uv run python benchmarks/sweep_thresholds.py

# Classifier vs cosine on gold set
uv run python benchmarks/classifier_vs_cosine.py

# High-stakes opposite-action evaluation
uv run python benchmarks/false_positive_eval.py

Results are saved to benchmarks/results/.


Project Structure

semanticmemo/
├── src/semanticmemo/
│   ├── cache.py                    # SemanticMemo public API
│   ├── orchestrator.py             # 4-stage decision engine
│   ├── models.py                   # CacheResult, CacheDecision, CacheConfig, RiskPolicy
│   ├── domain_detector.py          # Embedding-based domain routing
│   ├── classifier/
│   │   ├── model.py                # PairClassifier (MLP nn.Module)
│   │   ├── service.py              # ClassifierService wrapper
│   │   ├── cross_encoder_service.py # CrossEncoderService + _MODEL_CACHE
│   │   ├── train.py                # Training loop
│   │   └── evaluate.py             # Evaluation metrics
│   ├── embedding/service.py        # EmbeddingService + FAISS/in-memory index
│   ├── store/sqlite_store.py       # SQLite persistence (WAL mode)
│   ├── feedback/                   # Feedback ledger + retraining trigger
│   ├── _models/equivalence-net-v1.pt   # Bundled pretrained classifier
│   └── cli.py                      # semanticmemo retrain / stats / export-feedback
├── benchmarks/
│   ├── run_benchmarks.py           # 4-method × 4-domain comparison matrix
│   ├── sweep_thresholds.py         # 1,200-config threshold grid search
│   ├── false_positive_eval.py      # High-stakes opposite-action evaluation
│   ├── data/                       # 20-pair datasets per domain + hard_negatives.jsonl
│   └── results/                    # JSON + MD benchmark outputs
├── docs/
│   ├── results.md                  # Full benchmark results (this project's showpiece)
│   ├── architecture.md             # System design
│   ├── decision_engine.md          # 4-stage pipeline deep-dive
│   └── benchmark_methodology.md    # Evaluation framework
└── tests/                          # 91 tests (pytest)

Roadmap

v1.1.0

  • Qdrant production backend (currently available via CacheConfig.vector_store_type="qdrant", full hardening coming)
  • Automated nightly retraining pipeline triggered by feedback accumulation threshold
  • Multi-domain classifier training (single model, domain-conditioned)

v1.2.0

  • OpenTelemetry trace export for CacheDecision spans
  • Redis cache store backend (distributed caching for multi-instance deployments)
  • semanticmemo compare CLI command: before/after precision/recall diff on any JSONL dataset

Reliability

  • Resource cleanupasync with SemanticMemo(...) as cache: or cache.close()
  • RetriesCacheConfig(retry=RetryConfig(...)) for transient LLM failures (off by default)
  • WAL mode SQLite — safe for multi-threaded single-process use
  • Logging — silent by default under semanticmemo logger namespace; configure to opt in
  • Type-safe — full Pydantic v2 models throughout; pyright basic mode passes clean

Release

Published to PyPI as semanticmemo. Internal import name is unchanged: import semanticmemo.

pip install "semanticmemo[ml]"

Tagged releases publish automatically via GitHub Actions trusted publishing.

git tag v1.0.0
git push origin v1.0.0

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanticmemo-1.2.0.tar.gz (4.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semanticmemo-1.2.0-py3-none-any.whl (820.1 kB view details)

Uploaded Python 3

File details

Details for the file semanticmemo-1.2.0.tar.gz.

File metadata

  • Download URL: semanticmemo-1.2.0.tar.gz
  • Upload date:
  • Size: 4.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semanticmemo-1.2.0.tar.gz
Algorithm Hash digest
SHA256 ec2df21e0783317c635336b8ab09bb68c6801a8084bdecb265b240c01bb60724
MD5 6dc66100115bb31ff8ce4bd02ba657bc
BLAKE2b-256 0abc19951a89b0c86c47e46ce0ea965da7b66e7a8a4ac9d225613b78c52ebf37

See more details on using hashes here.

Provenance

The following attestation bundles were made for semanticmemo-1.2.0.tar.gz:

Publisher: publish-pypi.yml on rajveer100704/semanticmemo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file semanticmemo-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: semanticmemo-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 820.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semanticmemo-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 43999a345ab2f97f40ec0eb5ad7fad8c0a146522822ee8fbc7bc762411d891c5
MD5 196e71884a557b0a9038f56ade4c8007
BLAKE2b-256 d32d8901ea5d6a1bc48c6634b05d90cd283e4304363e4acb95cf1ab787b00332

See more details on using hashes here.

Provenance

The following attestation bundles were made for semanticmemo-1.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on rajveer100704/semanticmemo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page