Production-grade semantic caching for LLM agents — a learned equivalence classifier replaces the naive cosine threshold that causes false-positive failures.

These details have not been verified by PyPI

Project description

SemanticMemo

Semantic caching framework using FAISS retrieval, learned equivalence classification, Cross-Encoder verification, and entity-drift detection.

The Problem: Cosine Caches Fail in Production

Every semantic cache today makes the same mistake: it decides cache hits by cosine similarity.

"Should I approve the refund?"   ── cosine: 0.97 ──▶  CACHE HIT  ← wrong
"Should I deny the refund?"      ─────────────────────────────────────────

These two prompts are 97% similar in embedding space. They require opposite responses. A cosine threshold cannot tell them apart — and there is no threshold that fixes this. In the medical domain, this means a cached "increase dosage" response gets served for "decrease dosage". In finance, "buy 500 shares" gets served for "sell 500 shares".

The result: Teams adopt semantic caching in development, hit a false-positive incident, rip it out, and go back to paying full LLM cost on every call. The cycle repeats.

SemanticMemo replaces the cosine threshold with a learned, four-stage verification pipeline that cuts hard-negative false positive rates from 33.3% to 0%.

Benchmark Results

Hard-Negative False Positive Rate

The critical number. Hard negatives are semantically near-identical prompt pairs that require opposite actions — the case that breaks cosine caches.

Method	Hard-Negative FPR	False Positives / 12 pairs
Cosine Baseline (threshold=0.90)	33.3% ❌	4/12
MLP Classifier	0.0% ✅	0/12
Double Verification	0.0% ✅	0/12
SemanticMemo	0.0% ✅	0/12

Domain Comparison Matrix

Evaluated over 80 prompt pairs (4 domains × 20 pairs, 10 positive + 10 hard-negative each):

Method	Domain	Precision	Recall	F1	FPR
Cosine Baseline	Finance	0.000	0.000	0.000	0.200
MLP Classifier	Finance	1.000	0.700	0.824	0.000
SemanticMemo	Finance	1.000	0.500	0.667	0.000
Cosine Baseline	Medical	0.400	0.200	0.267	0.300
MLP Classifier	Medical	0.667	0.600	0.632	0.300
SemanticMemo	Medical	0.714	0.500	0.588	0.200
Cosine Baseline	Security	0.000	0.000	0.000	0.200
MLP Classifier	Security	1.000	0.500	0.667	0.000
SemanticMemo	Security	1.000	0.500	0.667	0.000

Classifier vs Cosine on Gold Set (84 held-out pairs)

Method	Precision	Recall	F1	False Positives
Cosine (at equal recall)	0.527	0.935	0.674	26
`equivalence-net-v1`	0.829	0.935	0.879	6

+30.2 precision points at equal recall.

Full results, latency breakdown, cost savings model, and threshold sweep report: docs/results.md

Architecture

SemanticMemo chains four stages. The first stage is permissive (high recall). Each stage narrows the candidates. Only the final confirmed hit goes to cache.

flowchart LR
    P["Incoming Prompt"] --> E["① Embed\nall-MiniLM-L6-v2\n~20ms"]
    E --> R["② Retrieve\nFAISS top-K\n<1ms"]
    R --> M["③ MLP Classifier\nequivalence-net-v1\n~1ms"]
    M -->|"score ≥ 0.995\n94.7% of hits"| HIT["✅ Cache Hit\n~27ms total"]
    M -->|"uncertain\n5.3% of hits"| CE["④ Cross-Encoder\nms-marco-MiniLM\n~3–8ms"]
    CE -->|"score ≥ threshold"| HIT
    CE -->|"score < threshold"| MISS["❌ Cache Miss\nCall LLM → Store"]
    M -->|"opposite-action\nveto"| MISS

Stage	Component	Latency	Purpose
① Embed	`all-MiniLM-L6-v2`	~20ms	Dense prompt vector
② Retrieve	FAISS IndexFlatIP	<1ms	Top-K candidates
③ MLP	`equivalence-net-v1.pt`	~1ms	Fast pair equivalence
Veto	Rule-based patterns	<0.1ms	Block opposite-action pairs
Bypass	MLP ≥ 0.995	0ms	Skip CE for certain hits
④ Cross-Encoder	ms-marco-MiniLM-L-6-v2	~3–8ms	Deep re-ranking

Risk-Aware Policies

High-stakes domains (medical, finance, security, legal) use stricter thresholds automatically:

Domain	MLP Threshold	CE Threshold
Customer Support / General	0.90	0.85
Medical / Finance / Security / Legal	0.99	0.95

Installation

pip install "semanticmemo[ml]"

The [ml] extra includes PyTorch, FAISS, and SentenceTransformers — required for the embedding model and bundled classifier.

For local development:

git clone https://github.com/rajveer100704/semanticmemo
cd semanticmemo
uv sync --all-extras
uv run pytest          # 91 tests
uv run ruff check
uv run pyright

Quickstart

Drop-in wrapper — zero config

from semanticmemo import SemanticMemo, ClassifierConfig

cache = SemanticMemo(
    domain="customer-support",
    classifier=ClassifierConfig.bundled(),   # ships with the package
)

async def call_llm(prompt: str) -> str:
    # your existing LLM call here
    return "fresh response"

result = await cache.get_or_call(
    prompt="Where is my order?",
    llm_function=call_llm,
)

print(result.response)          # cached or fresh
print(result.was_cache_hit)     # True / False
print(result.cost_saved_usd)    # Decimal, $0 on miss
print(result.latency_ms)        # full round-trip latency

Production config — risk-aware, domain-aware

from semanticmemo import (
    SemanticMemo, CacheConfig, ClassifierConfig,
    CrossEncoderConfig, RiskPolicy, RiskTier,
)

cache = SemanticMemo(
    domain="medical",
    config=CacheConfig(
        cross_encoder=CrossEncoderConfig(
            model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        ),
        risk_policy=RiskPolicy(
            domain_tiers={
                "medical":  RiskTier.HIGH,
                "finance":  RiskTier.HIGH,
                "security": RiskTier.HIGH,
                "legal":    RiskTier.HIGH,
            },
            # LOW tier: customer-support, general
            low_risk_classifier_threshold=0.90,
            low_risk_cross_encoder_threshold=0.85,
            # HIGH tier: medical, finance, security, legal
            high_risk_classifier_threshold=0.99,
            high_risk_cross_encoder_threshold=0.95,
        ),
        high_precision_skip_threshold=0.995,  # bypass CE when MLP is near-certain
    ),
    classifier=ClassifierConfig.bundled(),
)

result = await cache.get_or_call(prompt="...", llm_function=call_llm)

# Explainable decision trace on every result
if result.decision:
    print(result.decision.reason)              # "mlp_bypass" / "passed_all_thresholds" / ...
    print(result.decision.risk_tier)           # "high" / "low"
    print(result.decision.classifier_score)    # float
    print(result.decision.cross_encoder_score) # float | None

# Latency profiling
print(result.embedding_latency_ms)       # ~20ms
print(result.retrieval_latency_ms)       # ~0.05ms
print(result.mlp_latency_ms)             # ~1ms
print(result.cross_encoder_latency_ms)   # ~3-8ms or 0 (bypassed)

Feedback & Retraining

Every cache hit is recorded. Report bad hits to generate labeled training data:

result = await cache.get_or_call(
    prompt="Approve the customer's refund request",
    llm_function=call_llm,
)

if result.was_cache_hit and user_rejected_answer:
    await cache.report_bad_hit(result.query_id, reason="wrong decision")

# Export feedback as training pairs
written = cache.export_feedback_pairs("data/feedback_pairs.jsonl")

Retrain a candidate classifier when feedback accumulates:

uv run semanticmemo retrain \
  --out models/classifier-candidate.pt \
  --validation-data data/validation_pairs.jsonl \
  --domain medical \
  --min-precision 0.95 \
  --promote-to models/classifier-active.pt

Implicit Feedback (opt-in)

Re-issuing the same prompt shortly after a cache hit is auto-flagged as a bad hit:

from semanticmemo import CacheConfig, ImplicitFeedbackConfig, SemanticMemo

cache = SemanticMemo(
    domain="customer-support",
    config=CacheConfig(
        implicit_feedback=ImplicitFeedbackConfig(window_seconds=30.0),
    ),
)

Benchmarks

# Full comparison: Cosine vs MLP vs Double Verification vs SemanticMemo
uv run python benchmarks/run_benchmarks.py

# Threshold sweep (40×30 grid, per-domain FPR constraints)
uv run python benchmarks/sweep_thresholds.py

# Classifier vs cosine on gold set
uv run python benchmarks/classifier_vs_cosine.py

# High-stakes opposite-action evaluation
uv run python benchmarks/false_positive_eval.py

Results are saved to benchmarks/results/.

Project Structure

semanticmemo/
├── src/semanticmemo/
│   ├── cache.py                    # SemanticMemo public API
│   ├── orchestrator.py             # 4-stage decision engine
│   ├── models.py                   # CacheResult, CacheDecision, CacheConfig, RiskPolicy
│   ├── domain_detector.py          # Embedding-based domain routing
│   ├── classifier/
│   │   ├── model.py                # PairClassifier (MLP nn.Module)
│   │   ├── service.py              # ClassifierService wrapper
│   │   ├── cross_encoder_service.py # CrossEncoderService + _MODEL_CACHE
│   │   ├── train.py                # Training loop
│   │   └── evaluate.py             # Evaluation metrics
│   ├── embedding/service.py        # EmbeddingService + FAISS/in-memory index
│   ├── store/sqlite_store.py       # SQLite persistence (WAL mode)
│   ├── feedback/                   # Feedback ledger + retraining trigger
│   ├── _models/equivalence-net-v1.pt   # Bundled pretrained classifier
│   └── cli.py                      # semanticmemo retrain / stats / export-feedback
├── benchmarks/
│   ├── run_benchmarks.py           # 4-method × 4-domain comparison matrix
│   ├── sweep_thresholds.py         # 1,200-config threshold grid search
│   ├── false_positive_eval.py      # High-stakes opposite-action evaluation
│   ├── data/                       # 20-pair datasets per domain + hard_negatives.jsonl
│   └── results/                    # JSON + MD benchmark outputs
├── docs/
│   ├── results.md                  # Full benchmark results (this project's showpiece)
│   ├── architecture.md             # System design
│   ├── decision_engine.md          # 4-stage pipeline deep-dive
│   └── benchmark_methodology.md    # Evaluation framework
└── tests/                          # 91 tests (pytest)

Roadmap

v1.1.0

Qdrant production backend (currently available via CacheConfig.vector_store_type="qdrant", full hardening coming)
Automated nightly retraining pipeline triggered by feedback accumulation threshold
Multi-domain classifier training (single model, domain-conditioned)

v1.2.0

OpenTelemetry trace export for CacheDecision spans
Redis cache store backend (distributed caching for multi-instance deployments)
semanticmemo compare CLI command: before/after precision/recall diff on any JSONL dataset

Reliability

Resource cleanup — async with SemanticMemo(...) as cache: or cache.close()
Retries — CacheConfig(retry=RetryConfig(...)) for transient LLM failures (off by default)
WAL mode SQLite — safe for multi-threaded single-process use
Logging — silent by default under semanticmemo logger namespace; configure to opt in
Type-safe — full Pydantic v2 models throughout; pyright basic mode passes clean

Release

Published to PyPI as semanticmemo. Internal import name is unchanged: import semanticmemo.

pip install "semanticmemo[ml]"

Tagged releases publish automatically via GitHub Actions trusted publishing.

git tag v1.0.0
git push origin v1.0.0

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.2.0

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanticmemo-1.2.0.tar.gz (4.7 MB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semanticmemo-1.2.0-py3-none-any.whl (820.1 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file semanticmemo-1.2.0.tar.gz.

File metadata

Download URL: semanticmemo-1.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 4.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semanticmemo-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ec2df21e0783317c635336b8ab09bb68c6801a8084bdecb265b240c01bb60724`
MD5	`6dc66100115bb31ff8ce4bd02ba657bc`
BLAKE2b-256	`0abc19951a89b0c86c47e46ce0ea965da7b66e7a8a4ac9d225613b78c52ebf37`

See more details on using hashes here.

Provenance

The following attestation bundles were made for semanticmemo-1.2.0.tar.gz:

Publisher: publish-pypi.yml on rajveer100704/semanticmemo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: semanticmemo-1.2.0.tar.gz
- Subject digest: ec2df21e0783317c635336b8ab09bb68c6801a8084bdecb265b240c01bb60724
- Sigstore transparency entry: 1780757843
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: rajveer100704/semanticmemo@de471036abf68885e05b3976a5ad0faa30d32d20
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/rajveer100704
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@de471036abf68885e05b3976a5ad0faa30d32d20
- Trigger Event: push

File details

Details for the file semanticmemo-1.2.0-py3-none-any.whl.

File metadata

Download URL: semanticmemo-1.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 820.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for semanticmemo-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43999a345ab2f97f40ec0eb5ad7fad8c0a146522822ee8fbc7bc762411d891c5`
MD5	`196e71884a557b0a9038f56ade4c8007`
BLAKE2b-256	`d32d8901ea5d6a1bc48c6634b05d90cd283e4304363e4acb95cf1ab787b00332`

See more details on using hashes here.

Provenance

The following attestation bundles were made for semanticmemo-1.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on rajveer100704/semanticmemo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: semanticmemo-1.2.0-py3-none-any.whl
- Subject digest: 43999a345ab2f97f40ec0eb5ad7fad8c0a146522822ee8fbc7bc762411d891c5
- Sigstore transparency entry: 1780757976
- Sigstore integration time: Jun 10, 2026
Source repository:
- Permalink: rajveer100704/semanticmemo@de471036abf68885e05b3976a5ad0faa30d32d20
- Branch / Tag: refs/tags/v1.2.0
- Owner: https://github.com/rajveer100704
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@de471036abf68885e05b3976a5ad0faa30d32d20
- Trigger Event: push

semanticmemo 1.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SemanticMemo

The Problem: Cosine Caches Fail in Production

Benchmark Results

Hard-Negative False Positive Rate

Domain Comparison Matrix

Classifier vs Cosine on Gold Set (84 held-out pairs)

Architecture

Risk-Aware Policies

Installation

Quickstart

Drop-in wrapper — zero config

Production config — risk-aware, domain-aware

Feedback & Retraining

Implicit Feedback (opt-in)

Benchmarks

Project Structure

Roadmap

Reliability

Release

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance