The AI native context-aware semantic caching for LLM apps — stop paying for the same answer twice

These details have not been verified by PyPI

Project links

Project description

Sulci

The AI native context-aware semantic caching for LLM apps — stop paying for the same answer twice

Sulci is a drop-in Python library that caches LLM responses by semantic meaning, not exact string match. When a user asks "How do I deploy to AWS?" and someone else later asks "What's the process for deploying on AWS?", Sulci returns the cached answer instead of calling the LLM again — saving cost and latency.

Why Sulci

Without Sulci	With Sulci
Every query hits the LLM API	Semantically similar queries return instantly from cache
$0.005 per call, every time	Cache hits cost ~$0.0001 (embedding only)
1–3 second response time	Cache hits return in <10ms
No memory across sessions	Context-aware: understands conversation history

Benchmark results (v0.2.1, 5,000 queries):

Overall hit rate: 85.9%
Hit latency p50: 0.74ms (vs ~1,840ms for a live LLM call)
Cost saved per 10k queries: $21.47
Context-aware mode: +20.8pp resolution accuracy over stateless

Install

pip install "sulci[sqlite]"    # SQLite — zero infra, local dev
pip install "sulci[chroma]"    # ChromaDB
pip install "sulci[faiss]"     # FAISS
pip install "sulci[qdrant]"    # Qdrant
pip install "sulci[redis]"     # Redis + RedisVL
pip install "sulci[milvus]"    # Milvus Lite

zsh users: always wrap extras in quotes — ".[sqlite]" not .[sqlite].

Quickstart

Stateless (v0.1 style)

from sulci import Cache

cache = Cache(backend="sqlite", threshold=0.85)

# store a response
cache.set("How do I deploy to AWS?", "Use the AWS CLI with 'aws deploy'...")

# exact or semantic hit — returns 3-tuple
response, similarity, context_depth = cache.get("What's the process for deploying on AWS?")

if response:
    print(f"Cache hit (sim={similarity:.2f}): {response}")
else:
    # call your LLM here
    pass

Context-aware (v0.2 style)

from sulci import Cache

cache = Cache(
    backend        = "sqlite",
    threshold      = 0.85,
    context_window = 4,     # remember last 4 turns
    query_weight   = 0.70,  # α — weight of current query vs context
    context_decay  = 0.50,  # halve weight per older turn
)

# turn 1
cache.set("What is Python?", "Python is a high-level programming language.", session_id="s1")

# turn 2 — context from turn 1 blended into the lookup vector
response, sim, depth = cache.get("Tell me more about it", session_id="s1")

Drop-in with `cached_call`

import anthropic
from sulci import Cache

cache = Cache(backend="sqlite", threshold=0.85, context_window=4)
client = anthropic.Anthropic()

def call_llm(prompt: str) -> str:
    msg = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return msg.content[0].text

result = cache.cached_call(
    query         = "How do I deploy to AWS?",
    llm_fn        = call_llm,
    session_id    = "user-123",
    cost_per_call = 0.005,
)

print(result["response"])
print(f"Source:  {result['source']}")        # "cache" or "llm"
print(f"Latency: {result['latency_ms']}ms")
print(f"Saved:   ${result['saved_cost']:.4f}")

API Reference

Constructor

cache = Cache(
    backend         = "sqlite",   # sqlite | chroma | faiss | qdrant | redis | milvus
    threshold       = 0.85,       # cosine similarity cutoff (0–1)
    embedding_model = "minilm",   # minilm | openai
    ttl_seconds     = None,       # None = no expiry
    personalized    = False,      # partition cache per user_id
    db_path         = "./sulci",  # on-disk path for sqlite / faiss
    context_window  = 0,          # turns to remember; 0 = stateless
    query_weight    = 0.70,       # α in blending formula
    context_decay   = 0.50,       # per-turn decay weight
    session_ttl     = 3600,       # session expiry in seconds
)

Methods

Method	Returns	Description
`cache.get(query, user_id=None, session_id=None)`	`(str\|None, float, int)`	response, similarity, context_depth
`cache.set(query, response, user_id=None, session_id=None)`	`None`	Store entry, advance context window
`cache.cached_call(query, llm_fn, session_id=None, user_id=None, cost_per_call=0.005)`	`dict`	response, source, similarity, latency_ms, cache_hit, context_depth
`cache.get_context(session_id)`	`ContextWindow`	Return session's context window
`cache.clear_context(session_id)`	`None`	Reset session history
`cache.context_summary(session_id=None)`	`dict`	Snapshot of one or all sessions
`cache.stats()`	`dict`	hits, misses, hit_rate, saved_cost, total_queries, active_sessions
`cache.clear()`	`None`	Evict all entries, reset stats and sessions

Important: cache.get() returns a 3-tuple (response, similarity, context_depth) — not a 2-tuple like v0.1. Always unpack all three values.

Context-Aware Blending

When context_window > 0, Sulci blends the current query vector with recent conversation history before performing the similarity lookup:

lookup_vec = α · embed(query) + (1−α) · Σ(decay^i · turn_i)

α = query_weight (default 0.70) — how much the current query dominates
decay = context_decay (default 0.50) — halves weight per older turn
Only user query vectors are stored in context (not LLM responses)
Raw un-blended vectors stored in cache; blending happens at lookup time only

Context-aware benchmark results (800 conversation pairs, context_window=4):

Domain	Stateless	Context-aware	Δ
customer_support	32%	88%	+56pp
developer_qa	80%	96%	+16pp
medical_information	40%	60%	+20pp
overall	64.0%	81.6%	+17.6pp

Backends

Backend	ID	Hit latency	Best for
SQLite	`sqlite`	<8ms	Local dev, edge, serverless, zero infra
ChromaDB	`chroma`	<10ms	Fastest path to working, Python-native
FAISS	`faiss`	<3ms	GPU acceleration, massive scale
Qdrant	`qdrant`	<5ms	Production, metadata filtering
Redis + RedisVL	`redis`	<1ms	Existing Redis infra, lowest latency
Milvus Lite	`milvus`	<7ms	Dev-to-prod without code changes

All backends are free tier or self-hostable at zero cost.

Embedding Models

ID	Model	Dims	Latency	Notes
`minilm`	all-MiniLM-L6-v2	384	14ms	Default — free, local, excellent quality
`openai`	text-embedding-3-small	1536	~100ms	Requires `OPENAI_API_KEY`

The default minilm model runs entirely locally via sentence-transformers. No network calls are made unless you explicitly configure embedding_model="openai".

Project Structure

.
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── LOCAL_SETUP.md
├── README.md
├── benchmark
│   ├── README.md               ← benchmark methodology and results
│   └── run.py                  ← benchmark CLI
├── examples
│   ├── anthropic_example.py    ← requires ANTHROPIC_API_KEY
│   ├── basic_usage.py          ← stateless cache demo, no API key needed
│   ├── context_aware.py        ← 4-demo walkthrough, fully offline
│   └── context_aware_example.py← additional context-aware patterns
├── pyproject.toml              ← name="sulci", version="0.2.3"
├── setup.py
├── sulci
│   ├── __init__.py             ← exports Cache, ContextWindow, SessionStore
│   ├── backends
│   │   ├── __init__.py
│   │   ├── chroma.py
│   │   ├── faiss.py
│   │   ├── milvus.py
│   │   ├── qdrant.py
│   │   ├── redis.py
│   │   └── sqlite.py
│   ├── context.py              ← ContextWindow + SessionStore
│   ├── core.py                 ← Cache engine (context-aware)
│   └── embeddings
│       ├── __init__.py
│       ├── minilm.py           ← default: all-MiniLM-L6-v2 (free, local)
│       └── openai.py           ← requires OPENAI_API_KEY
└── tests
    ├── test_backends.py        —  9 tests: per-backend contract + persistence
    ├── test_context.py         — 35 tests: ContextWindow, SessionStore, integration
    └── test_core.py            — 27 tests: cache.get/set, TTL, stats, personalization

7 directories, 29 files

Running Tests

# full suite — 71 tests total
python -m pytest tests/ -v

# by file
python -m pytest tests/test_core.py -v       # 27 tests
python -m pytest tests/test_context.py -v    # 35 tests
python -m pytest tests/test_backends.py -v   #  9 tests (skipped if dep missing)

# single backend only
python -m pytest tests/test_backends.py -v -k sqlite
python -m pytest tests/test_backends.py -v -k chroma

# with coverage
python -m pytest tests/ -v --cov=sulci --cov-report=term-missing

Backend tests are skipped — not failed when their dependency isn't installed. Install the backend extra to run its tests: pip install -e ".[chroma]".

See LOCAL_SETUP.md for the full local development guide including venv setup, backend installation, smoke testing, and troubleshooting.

Benchmark

# fast run (~30 seconds)
python benchmark/run.py --no-sweep --queries 1000

# with context-aware pass
python benchmark/run.py --no-sweep --queries 1000 --context

# full benchmark
python benchmark/run.py --context

See benchmark/README.md for full methodology and results.

Contributing

See CONTRIBUTING.md for branching model, PR process, and coding standards.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.4

May 4, 2026

0.5.3

May 4, 2026

0.5.2

May 3, 2026

0.5.1

Apr 28, 2026

0.5.0

Apr 28, 2026

0.4.0

Apr 27, 2026

0.3.7

Apr 12, 2026

0.3.6

Apr 10, 2026

0.3.5

Apr 10, 2026

0.3.4

Apr 9, 2026

0.3.3

Apr 8, 2026

0.3.2

Mar 28, 2026

0.3.1

Mar 27, 2026

0.3.0

Mar 25, 2026

0.2.5

Mar 17, 2026

0.2.4

Mar 17, 2026

This version

0.2.3

Mar 17, 2026

0.2.2

Mar 15, 2026

0.2.1

Mar 11, 2026

0.2.0

Mar 11, 2026

0.1.1

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sulci-0.2.3.tar.gz (31.8 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sulci-0.2.3-py3-none-any.whl (26.4 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file sulci-0.2.3.tar.gz.

File metadata

Download URL: sulci-0.2.3.tar.gz
Upload date: Mar 17, 2026
Size: 31.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for sulci-0.2.3.tar.gz
Algorithm	Hash digest
SHA256	`29d9af0518916a4dbd0607f4cd2653a3f4578e6b100d6aa578fca2135d1bbd89`
MD5	`06d2310fc4be588a0dc1cbe581af0e5e`
BLAKE2b-256	`93c1bc060df5da1b5e3ecc2aac4ab86703301961b199e6a1f1641224d11e60ed`

See more details on using hashes here.

File details

Details for the file sulci-0.2.3-py3-none-any.whl.

File metadata

Download URL: sulci-0.2.3-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for sulci-0.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`654ef5bc60ed674ebe0aadd29086e04549f6d99a7b43df660d73ed6946895cf3`
MD5	`a02ab33943f23ac5f724756063fc3340`
BLAKE2b-256	`5e7884a6cc6947d7fdb666d635a9788f0d87e030207bd9f95907d9288495098e`

See more details on using hashes here.

sulci 0.2.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Sulci

Why Sulci

Install

Quickstart

Stateless (v0.1 style)

Context-aware (v0.2 style)

Drop-in with cached_call

API Reference

Constructor

Methods

Context-Aware Blending

Backends

Embedding Models

Project Structure

Running Tests

Benchmark

Contributing

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Drop-in with `cached_call`