Bring-your-own-LLM context curation: memory with provenance, implied-intent inference, multilingual deep research — any model, even small local ones, acts smarter.
Project description
Sherlock
Sherlock is a context layer for any LLM. Wrap one chat function — keep your model, keep your stack — and get persistent memory, compacted context, implied-intent inference, multilingual deep research, and a live inspector that shows exactly what your model saw and why.
agent = Sherlock.with_callable(main_chat=my_llm, system_prompt="...")
agent.chat("hi") # that's the whole integration
Sherlock never picks models for you and never trims results to save money. You own the model choice; Sherlock owns the context. Token savings come exclusively from eliminating waste (re-sent material, duplicated context, lost calls) — never from capping what you get back.
When to use Sherlock
Use Sherlock if you are building an assistant that needs to:
- remember users across long conversations — and keep the per-turn prompt bounded while it does (compaction frontier: old turns leave the prompt, never the database);
- make small / local models behave smarter (Ollama, LM Studio, vLLM, llama.cpp) without switching providers — plain-text tag protocol, honest 8K/16K/32K context budgets, JSON-repair retries;
- debug exactly what context the model saw on any turn — live inspector + one-click session export, plus a built-in A/B mode that runs the same prompt against the same model without Sherlock;
- do real research: planned multilingual search, source triangulation, citation verification, approval-gated deep research.
Where Sherlock fits
| You need | Best fit |
|---|---|
| A full stateful agent runtime / platform | Letta |
| A drop-in managed memory API | Mem0 / Supermemory |
| Enterprise temporal knowledge-graph memory | Zep / Graphiti |
| LangGraph-native memory | LangMem |
| BYO-LLM context assembly + a live context inspector, no framework lock-in | Sherlock |
Sherlock deliberately does not compete on hosted-memory-API convenience or agent-runtime breadth — it wins when you want to keep your own callable and see (and debug) every token of context your model receives.
How it works
Three LLM roles, all wired by you (they can be one function or three):
┌────────────────────────────────────────────────┐
user ──────► LLM-1 · main chat │
│ answers using the assembled context slot: │
│ system msg: your prompt + protocol + pinned │
│ facts · persona · highlights ── cached ──┐ │
│ + prior conversation (verbatim turns) ──cached┘ │
│ + final user msg: THIS-turn hypotheses · │
│ fresh search · the user's question (volatile)│
└───────┬────────────────────────────┬───────────┘
│ background │ background
┌───────▼──────────┐ ┌───────▼──────────┐
│ LLM-2 · compactor │ │ LLM-3 · inferrer │
│ summaries, facts, │ │ ≥3 hypotheses on │
│ persona profile │ │ the real ask; │
│ (pin/active/drop) │ │ search planning │
└───────┬──────────┘ └───────┬──────────┘
└────────► memory ◄──────────┘
SQLite + vector store · decay
(fresh → warm → cold → forgotten)
- Memory with provenance — user-stated facts are never confused with
system inferences; each prompt block carries the source and the turn it
was learned (
(user t12)), so newer facts win on conflict. - Slot budgets — context is assembled against real token ceilings per tier; the raw-turn tail takes whole turns only (no mid-thought truncation).
- Tag protocol — your LLM drives everything with plain-text tags
(
<<sherlock-companions: …>>,<<sherlock-tool: …>>); no native function-calling required, which is exactly what small models handle best. Native tool-calling adapters exist too. - Deep research — an approval-gated, multilingual, multi-round web research loop with a token-frugal shared-state protocol (details below).
Install
pip install sherlock-context # base — incl. free DuckDuckGo search + page fetch
pip install "sherlock-context[embeddings]" # + real local semantic memory (recommended)
pip install "sherlock-context[embeddings,search]" # + Tavily provider (Brave/Valyu need only a key)
pip install "sherlock-context[playground]" # + the Live Inspector web app
Distribution name is
sherlock-context(PyPIsherlockis an unrelated locks library); the import staysimport sherlock.
Latest from source (no PyPI needed):
pip install "git+https://github.com/MinwooKim1990/sherlock.git"
pip install "sherlock-context[embeddings,playground] @ git+https://github.com/MinwooKim1990/sherlock.git"
Or develop from a checkout:
git clone https://github.com/MinwooKim1990/sherlock.git && cd sherlock
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e ".[embeddings,playground]"
Targets Python 3.12 (3.11 / 3.13 also work). litellm is imported
lazily so import sherlock stays fast. The embedding default is
"auto": real local embeddings (fastembed, multilingual, no API key)
when the [embeddings] extra is installed, graceful fallback to a
deterministic hash embedder (with a warning) otherwise. DuckDuckGo
search + page fetch work from the base install.
Quick start (30 seconds)
from sherlock import Sherlock
def my_llm(messages):
"""Receive list of {"role": ..., "content": ...}; return text."""
import anthropic
client = anthropic.Anthropic()
sys = "\n".join(m["content"] for m in messages if m["role"] == "system")
chat = [m for m in messages if m["role"] != "system"]
r = client.messages.create(
model="claude-haiku-4-5", max_tokens=2048, system=sys, messages=chat,
)
return r.content[0].text
agent = Sherlock.with_callable(
main_chat=my_llm,
system_prompt="You are a candid, casual assistant.",
)
print(agent.chat("hi"))
print(agent.chat("what did i just say?")) # Sherlock will have the history
That's it. Sherlock handles the per-turn message store (SQLite), background compaction (LLM-2), Sherlock-style inference (LLM-3 — ≥3 hypotheses about the user's underlying ask whenever surface meaning ≠ actual ask), provenance tracking, and memory decay.
Use different models per role
def chat_via_main(messages): ... # e.g. a strong model for user-facing replies
def chat_via_companion(messages): ... # e.g. a small/cheap model for compaction + inference
agent = Sherlock.with_callable(
main_chat=chat_via_main,
summary_chat=chat_via_companion,
inference_chat=chat_via_companion,
system_prompt="You are a helpful assistant.",
)
OpenAI
from openai import OpenAI
client = OpenAI()
def my_llm(messages):
r = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
return r.choices[0].message.content
agent = Sherlock.with_callable(main_chat=my_llm, system_prompt="You are concise.")
Ollama / any local model
import requests
def my_llm(messages):
r = requests.post(
"http://localhost:11434/api/chat",
json={"model": "llama3", "messages": messages, "stream": False},
timeout=120,
).json()
return r["message"]["content"]
agent = Sherlock.with_callable(main_chat=my_llm, system_prompt="…")
Async
import anthropic
aio = anthropic.AsyncAnthropic()
async def my_llm(messages):
sys = "\n".join(m["content"] for m in messages if m["role"] == "system")
chat = [m for m in messages if m["role"] != "system"]
r = await aio.messages.create(
model="claude-haiku-4-5", max_tokens=2048, system=sys, messages=chat,
)
return r.content[0].text
agent = Sherlock.with_callable(main_chat=my_llm, system_prompt="…")
agent.chat("hi") # sync entry point runs the async fn under the hood
# await agent.achat("…") # native async entry point
LLM-1 is awaited synchronously (it gates the reply); LLM-2/LLM-3 + decay run after the reply in the background.
🔍 The playground — Sherlock Live Inspector
A single-page web app for watching Sherlock think in real time — the fastest way to verify the system end to end with real models.
.venv/bin/python -m uvicorn playground.server:app --reload
# → open http://localhost:8000
Bring any provider — and mix them per role. Connect one or more:
| Provider | Credential | Notes |
|---|---|---|
| Gemini | AI Studio key (AIza…) |
live model list from your key |
| OpenAI | API key (sk-…) |
chat-capable models, newest first |
| Anthropic | API key (sk-ant-…) |
official models list |
| DeepInfra / Together / OpenRouter | API key | open-source-model hosts (Llama, Qwen, DeepSeek, Mixtral…); paste a key and the live model list loads |
| Local | base URL (e.g. http://localhost:11434) |
any OpenAI-compatible server: Ollama, LM Studio, vLLM, llama.cpp |
Then pick a model for each role — e.g. a Together-hosted Llama-3.3-70B for LLM-1 with a small Qwen for the companions, a local Qwen everywhere, or Gemini Flash with GPT-4o-mini. Selections can be changed mid-session from the top bar (takes effect next turn). API keys stay in the server-side session and are never echoed back to the browser.
The three aggregators are OpenAI-compatible and route through litellm's native prefixes (
deepinfra/,together_ai/,openrouter/), so they work as a package too —ModelConfig(provider="together", model="…")or the YAMLmodels:block. Any other OpenAI-compatible host works via the Local tile (just give its base URL).
Chat experience. Replies stream token-by-token; reasoning models surface
their thinking in a collapsible 💭 panel; the Send button becomes a Stop
mid-generation. The companion mode (off / cold_start / turbo) is
switchable live from the top bar, there's a dark-mode toggle, and the UI is
available in 7 languages (English · 한국어 · 中文 · 日本語 · Français · Deutsch ·
Español) — the language setting only affects the chrome, never the model's
replies.
What the inspector shows (one tab per concern):
- ⚡ Flow — every event in order: turn start, retrieval, slot assembly, LLM calls with latency/tokens, tool runs, background work.
- 🧱 Slot — the exact assembled context LLM-1 received this turn: TIER-highlighted system prompt, per-block token budget, K-turn tail.
- 💬 LLM I/O — the verbatim prompts + responses for all three roles, including every internal call of a multi-call turn.
- 🧠 Inference / 🗜 Compaction — LLM-3 hypotheses with confidence bars; LLM-2 summary, facts table (with pin recommendations), persona.
- 🗃 Memory — the live memory table with decay-state chips, provenance, confidence, and use counts.
- 🔬 Research — deep-research progress: the multilingual search plan, per-round cards (queries, new sources/fragments, LLM-3-generated questions, facts so far), live 🪙 token usage by stage, and the final cited synthesis + session documents.
- The top bar shows cumulative session tokens per role (
L1 · L2 · L3 · Σ).
The "Always run reasoning" toggle force-fires LLM-2 + LLM-3 every turn so the panels always fill, even with models that under-emit the companion tag. Web search defaults to free DuckDuckGo; Brave/Tavily/ Valyu keys can be entered for better results.
Security & privacy
Sherlock handles transcripts and long-term memory, so the guardrails are explicit:
- Tags execute only from LLM output. Tool/companion tags pasted by a user are never parsed — a user cannot trigger searches, fetches, or memory reads by typing a tag.
- Deep research never auto-runs. It requires an explicit user "yes",
a UI approve click, or your own
deep_research_approvercallback. - Secret/PII redaction (
redact_secrets=True) scrubs content before it enters long-term memory and RAG; the raw transcript is never altered. - Local-first by default: SQLite + on-device embeddings (fastembed) — nothing leaves your machine except the LLM/search calls you wired. In the playground, provider API keys live in the server-side session and are never echoed to the browser or the event stream.
- Real deletion:
delete_session()cascade-deletes messages, memory entries, and vectors. Memory corrections are non-destructive and auditable (superseded facts keep their history, marked with the turn they were invalidated). - Untrusted web content is fenced in the prompt ("data, not instructions") and every cited URL is verified against the actually-gathered sources.
- Not yet built (on the roadmap): encryption at rest, audit logs.
What your LLM can do via tags
At the end of any reply, your LLM may emit (each on its own line):
<<sherlock-companions: compact, infer>>
<<sherlock-tool: search "Seoul weather today">>
<<sherlock-tool: search "nvidia earnings" k=8>> # set result count (1–10)
<<sherlock-tool: fetch https://example.com/article>>
<<sherlock-tool: fetch raw https://example.com/article>> # raw HTML
<<sherlock-tool: memory lookup "Yujin 알레르기">> # semantic + entity recall
<<sherlock-tool: memory entity "Yujin">> # deterministic entity match
<<sherlock-tool: memory timeline last 10>> # raw recent turns
<<sherlock-tool: memory pinned>> # all pinned facts
<<sherlock-tool: deep_research "compare EU vs US AI regulation">> # approval-gated
Sherlock parses tags only from LLM output, never from user input (a
user pasting a tag cannot trigger anything), runs the tool, feeds
results back as a synthetic message, and re-calls your LLM — up to
execution.max_tool_rounds (default 3) per turn. Tags are always
stripped from the user-visible reply.
Companion gating — when LLM-2/LLM-3 actually run (v1.6)
By default Sherlock runs in cold_start mode: a signal-driven gate
keeps each turn single-model (LLM-1 only) until the conversation genuinely
needs a companion — then it escalates LLM-2/LLM-3 and de-escalates on its
own as things settle, with no fixed turn counter. On strong models this
spends far fewer tokens on calm turns while still firing the instant a
real signal (topic shift, contradiction, implied intent, fill pressure)
appears. Pick the mode at construction:
Sherlock.with_callable(..., companions_mode="cold_start") # default
# "off" — legacy v1.4 behavior, byte-identical (uses the safety net below)
# "turbo" — every companion, every turn (maximum signal, maximum cost)
Migration from ≤ v1.4: the default changed from always-on companions to
cold_start. Passcompanions_mode="off"(or setSHERLOCK_COMPANIONS=off) to restore the exact v1.4 behavior.
In off mode a safety net keeps the companions alive when your LLM
under-emits the tag: compact auto-fires every N turns
(memory.summarize_every_n_turns) and infer auto-fires on topic
shifts (memory.auto_infer: "smart" default | "off" | "always").
Under cold_start/turbo the gate owns that decision, so auto_infer
is inert.
The playground exposes the same three modes as a dropdown (default
turbo, so the Inference / Compaction panels visibly fill every turn) —
switch to cold_start to watch the gate stay single-model until a signal
escalates it, or off for the legacy behavior.
Web search engines
agent = Sherlock.with_callable(
main_chat=my_llm,
system_prompt="...",
main_search_engine="duckduckgo", # default; free, no key
inference_search_engine="brave", # LLM-3 freshness searches
inference_search_api_key_env="BRAVE_API_KEY",
)
Engines: duckduckgo (no key; non-commercial terms, weak for news),
tavily (needs pip install "sherlock-context[search]"), brave, valyu,
stub (tests). Pass None to disable search for a role. Native
tool-calling adapters (make_openai_tools(), make_anthropic_tools(),
make_openai_memory_tool(), dispatch_tool_call, dispatch_memory)
exist for integrations that prefer real function calls.
LLM-3's prompt enforces cross-verification discipline: ≥2 sources per claim, lowered confidence on disagreement, single-source web facts are never pinned.
🔬 Deep research
When a question needs real depth, LLM-1 proposes
<<sherlock-tool: deep_research "topic">>. It is never auto-run:
- Sherlock asks the user (playground button, or just reply "yes"/"해줘"
in a library/CLI session) — or a programmatic
deep_research_approver(topic, plan)callback decides (True/False/None=ask). Explicit refusals ("no, don't…", "하지마") always cancel, even if they contain trigger-ish words. v1.0: the approval ask carries a drafted research strategy (objective + sub-topics) and up to 2 clarifying questions for genuinely ambiguous points — answer them alongside your "yes" (or reply with just the answer; Sherlock folds it in and re-asks once). The strategy then guides the run as a guideline, never a cage. - Once approved, the loop runs (in the background when an event sink or
background=Trueis active):- Multilingual keyword plan — LLM-3 picks the languages whose web most likely holds the answer (a Japan-travel question sweeps Japanese + Korean + English) and emits short, particle-stripped keyword queries. The query language is the i18n lever — global search, no locale parameters. Your answer still comes back in your language.
- Wide round 1, narrow after — round 1 is a broad snippet-only sweep; later rounds deepen promising threads. Pages are fetched sparingly, only when the round is thin, and never twice. Fragments that don't fit a round wait in a backlog — no result is ever dropped.
- Compact shared state (the token saver) — each round LLM-1 reads only the NEW fragments + a compact digest of confirmed facts and open gaps, and answers in terse JSON. LLM-3 (from round 3) reads ONLY the digest — never raw pages — to generate the next round's meta-questions. The running loop never re-pays for old fragments.
- Collect raw → reconstruct (v1.4, the recovery layer) — every
round's raw fragments (snippets + the relevant fetched excerpts) are
KEPT per sub-topic, not discarded. The final synthesis re-reads each
section's raw bucket (deduped by URL, char-capped) alongside the
extracted facts — so a concrete detail a round's terse extraction
missed (an event name, a date, a venue) is recovered at the end
instead of lost forever. Facts stay the verified spine; raw is the
recovery layer; a requested sub-topic is never silently dropped (it
gets an honest "not confirmed" note rather than vanishing). Measured:
on a 5-city Japan-events query with a real engine (Brave) + a small
worker (gemini-flash-lite), this turned a city that previously came
back "no events" into three real, cited events — while still flagging
dates not yet officially announced. Behind
search.deep_research_reconstruct_from_raw(default on); a per-sub-topic "what's worth knowing" checklist + coverage-gated stopping push the small model to keep drilling until every requested part is covered. - Triangulation — the same fact found via different
domains/source types (community / news / official / blog)
accumulates corroboration;
[corroborated ×N]facts rank first and are stated with higher confidence in the synthesis. - Honest stopping — stops on
model_sufficient,converged_no_new_sources(2 rounds with nothing new),no_next_queries,search_engine_error(2 consecutive all-failed rounds — reported as a failure, never disguised as "convergence"), or the round cap (≤20).
- Every round is saved as a
DEEP_RESEARCHsession document (read on demand — research never floods the context window or the pinned-fact cap), and a single comprehensive cited synthesis closes the run. Messages sent mid-research are queued, acknowledged, and folded in at the next checkpoint. - A
deep_research.tokensevent reports per-stage input/output tokens (plan / round answers / meta-questions / synthesis) — visible live in the playground — so the cost is measured, not guessed. Background failures surface as a real reply (deep_research.failed), never silence.
agent = Sherlock.with_callable(
main_chat=my_llm,
system_prompt="...",
main_search_engine="brave",
main_search_api_key_env="BRAVE_API_KEY",
deep_research_approver=lambda topic, plan: None, # None = ask the user
)
Memory
Slot budget & dynamic K-turn
The context slot is laid out across four TIERs with explicit token budgets; whatever remains goes to the raw-turn tail, which accumulates whole turns walking backward (a turn either fits whole or stays out):
SYSTEM MESSAGE (fully stable → cached)
[TIER 1 — GROUND TRUTH] sherlock_system + tool_prompt + user_system
[TIER 2 — SYSTEM-TRACKED] pinned + persona_summary + compacted highlights
[TIER 4 — trailer] marks where the conversation begins
HISTORY MESSAGES (append-only, stable → cached) ← last N raw turns
FINAL USER MESSAGE (volatile → uncached)
═ SYSTEM ANALYSIS FOR THIS TURN ═ this-turn inference + fresh search + fill%
═ THE USER'S ACTUAL MESSAGE ═ the user's question (always last)
v1.4 — cache-optimal ordering. The volatile this-turn block (inference + search) used to sit inside the system message, which broke prompt caching for all the conversation that followed. It now rides the final user message, so the system message + the whole conversation history form one cacheable prefix and only the last message (analysis + the new question) pays full price. Region headers keep a small model from confusing protocol, prior conversation, this-turn system analysis, and the user's actual words.
Profiles auto-select by model context window
(MemoryConfig.slot_budget_profile: auto/default/small/off,
plus slot_budget_overrides). Inspect what was used:
state = agent.inspect_last_turn()
print(state.slot_budget)
print(state.k_turn_turns_used, state.k_turn_tokens_used)
print("hypotheses:", state.hypotheses)
How history is stored (saving the context window)
Raw turns are never the only copy of what was said, and they don't grow the prompt forever:
- Compaction (LLM-2). In the background — when the assembled prompt reaches
memory.compact_at_fill_ratioof the model window (default 0.80), on a topic change, or when LLM-1 asks — LLM-2 distills recent turns into durable memory: pinned facts with provenance ((user t12), newer wins on conflict), a rolling persona summary, and append-only highlights. Facts must be grounded in a transcript quote; ungrounded ones are confidence-capped and can never be pinned. Acorrectionsoperator lets a later turn supersede an earlier pinned fact non-destructively. - Frontier eviction (the "infinite memory" mechanism). Once turns are
summarized, their raw copies are evicted from the TIER-4 tail (the last
few turns always stay raw), but the rows remain in SQLite and are reachable
on demand via the memory tool (
memory timeline/lookup). So the per-turn prompt size plateaus as a conversation grows instead of climbing linearly with turn count — curated TIER-2 memory carries forward what matters, not the whole transcript. - Absolute budgets. Each tier has a hard token ceiling; the raw tail takes
the leftover but is itself capped (
k_turn_max_fraction, default 0.5 of the window) so a large context window can't let raw history crowd out compaction.
That is the mechanism behind "save the context window." Honest caveat: the crossover — where Sherlock's curated prompt becomes cheaper per turn than re-sending the full transcript — shows up on long, multi-topic sessions; on short exchanges the curation overhead means Sherlock spends more, not less (see Cost vs. benefit below).
Prompt layering
Your system_prompt stays the primary system message; Sherlock's
protocol (tag conventions, cross-verify rules) rides alongside it.
Adjust with extension_position="before", replace it via
sherlock_extension="…", or opt out with sherlock_extension="".
Sessions
for s in agent.list_sessions():
print(s.id, s.created_at, s.turn_count, "—", s.persona_summary)
agent.new_session() # start fresh, keep history
agent.switch_session("abc-123") # resume an earlier session
agent.delete_session("xyz-789") # cascade-delete raw + memory + vectors
Inspecting state & storage
for m in agent.messages():
print(m.role, m.content[:80])
for m in agent.memory.list():
print(m.type, m.source, m.state, "—", m.content[:80])
with_callable() defaults to an ephemeral temp directory; pass
storage_dir="~/.local/share/my_app/sherlock" to keep state across
runs. Secrets/PII can be redacted before anything enters long-term
memory (redact_secrets=True); the raw transcript is never altered.
YAML + CLI
Configure everything (providers, embeddings, decay, search, budgets) in one file:
agent = Sherlock.from_yaml("sherlock.yaml")
See sherlock.example.yaml for the full schema. The package installs a
sherlock command:
sherlock chat --config sherlock.yaml
sherlock config validate | show
sherlock models
sherlock evaluate --config sherlock.yaml --conversation evaluation/dummy_conversation.md
Validation & benchmarks
-
End-to-end against an 80-turn synthetic benchmark (
evaluation/dummy_conversation.md+ gold standard): 82/100 with Claude Opus workers, passing the 80% gate. -
Behavior probes — 31 single-capability probes (
evaluation/probes/):python -m evaluation.probe_eval --probes evaluation/probes/ \ --config sherlock.live.yaml --report probe.json
-
The full pytest suite (500+ tests) runs hermetically — scripted callables + fake engines, no network or keys needed:
pytest -q. -
Measure it yourself: the playground's A/B mode runs every prompt against the same model with and without Sherlock (the baseline gets the same search engine and today's date — a fair control), side by side with latency and token counts. We'd rather you compare than take our word.
-
Public memory benchmarks (LongMemEval/LoCoMo-style) are on the roadmap (
docs/ROADMAP.md, R28) — we don't publish numbers we haven't run.
Cost vs. benefit — what we actually measured
Sherlock is not free, and we won't pretend otherwise. From our own A/B runs (same model on both sides; worker = gemini-2.5-flash-lite, a deliberately small model):
- Short chats (2–7 turns), full history still fits — Sherlock and a bare model both pass the rubric (a tie on quality) and Sherlock spends more tokens (curation overhead). Here a bare model is simply cheaper; use Sherlock for the behaviors (provenance, honesty, inference), not for savings.
- Where Sherlock earns its tokens:
- Long, multi-topic sessions — compaction + frontier eviction hold the per-turn prompt roughly flat while a "re-send everything" prompt grows without bound, and curated recall survives past the raw-tail window. The exact token-crossover point is traffic-dependent — measure it with the A/B mode; we don't quote a number we haven't run on a public benchmark.
- Multi-part deep research — with a real engine (Brave), the collect-raw→reconstruct loop beat a strong one-shot RAG baseline on the same search: it surfaced real, cited events for cities the baseline returned "no info" on, while both stayed honest about dates not yet announced. It costs roughly an order of magnitude more tokens/latency than one-shot — a user-invoked depth feature, priced accordingly.
- Honesty under junk search — with a broken free engine (DuckDuckGo
returning irrelevant pages), Sherlock degrades to an honest, source-labeled
answer (
verifiedvsgeneral knowledge — not verified) instead of fabricating; a bare small model tends to assert stale or invented specifics.
- A failure we fixed: a small worker used to defer on rich context (ask for details it didn't need) because the inference layer over-read plain requests as hidden asks. Fixed with a null-hypothesis brake + an answer-first consumption rule — the model now answers from the context it already has (3/3 vs 1/3 on the smoke that exposed it).
Bottom line: on short, well-fitting conversations a bare model is cheaper and just as good; Sherlock pays off on length, multi-part research, and honesty. The framing is "spend tokens to be right and complete", not "spend fewer tokens" — except on long sessions, where curation also wins on raw cost.
Limits
- Works best on conversations with a coherent persona and multiple topic threads; random short exchanges give the companions little to do.
- The provenance ledger distinguishes user-stated vs system-inferred facts but does not verify external claims.
- Memory decay is time/turn-based; semantic-cluster decay is specced but not wired.
- The Evolution Engine versions companion prompts but doesn't yet learn from user feedback automatically.
The prioritized upgrade plan — small-model intelligence, provider prompt caching, deep-research trust, memory reconciliation — lives in docs/ROADMAP.md (R1–R35, evidence-linked).
Changelog highlights
v1.4 — cache-optimal slot, fill-based compaction, companion cascade
- Cache-optimal reordering: the volatile this-turn block (inference + search) moved out of the system message to the final user message, so the system message + the whole conversation history are one cacheable prefix — on a caching provider, a long conversation re-pays only for the newest message. Explicit region headers keep a small model from confusing protocol / prior conversation / this-turn analysis / the user's actual words.
- Fill-based compaction: LLM-2 auto-compacts when the prompt reaches
memory.compact_at_fill_ratioof the window (default 0.80) instead of a fixed turn cadence — below it the conversation grows append-only and caching keeps the cost down; the live context-fill % is surfaced to LLM-1. - Companion cascade & ordering: LLM-2 runs before LLM-3 (so inference reasons
over freshly-compacted memory), and when LLM-2 surfaces
worth_diggingthreads it triggers LLM-3 itself — frequent light inference (LLM-1-driven) + occasional deeper inference (LLM-2-driven). Deep-research LLM-3 persona is now cached.
v1.4 — deep research that doesn't forget; small models answer-first
- Collect raw → reconstruct: each round's raw fragments are kept per sub-topic and re-read at synthesis (facts = verified spine, raw = recovery layer), so a concrete detail a round under-extracted is recovered instead of lost. Requested sub-topics are never silently dropped; a per-sub-topic "what's worth knowing" checklist and coverage-gated stopping push a small model to keep drilling until every part is covered. All behind config kill-switches (off = exact prior behavior). Live-verified on gemini-flash-lite + Brave: a city that returned "no events" now surfaces real cited events, beating a one-shot RAG baseline.
- Answer-first inference: the inference layer no longer makes a small model defer on rich context — a null-hypothesis brake stops it reading plain requests as hidden asks, and the consumption rule leads with the answer (then addresses the implied chain), never replacing the answer with a clarifying question.
- Method, not coercion: research/strategy prompts that said "you MUST …" are rewritten as guidance — hard mandates stall small models.
- 367 hermetic tests; new deterministic proofs for raw-recovery, coverage gating, and the deferral fix.
v1.2–1.3 — live-feedback hardening
- TODAY date injected into every research prompt (fixed "this December"
resolving to last year); implied-chain inference (
really_asking+ prepared next answers) carried into LLM-1's next-turn slot; citation pairing verification; A/B mode + per-turn markdown export in the playground; a fair baseline (same search engine + today's date); semantic-novelty convergence so reworded conclusions don't burn research rounds.
v1.1 — the whole roadmap, shipped
- Small-model reliability: one-shot examples anchor LLM-2/LLM-3 JSON
output; constrained JSON decoding is requested automatically where the
provider supports it (memoized fallback elsewhere); near-miss tool tags
(
sherlock_tool, single brackets) are repaired instead of leaking. - Deep research trust: every cited URL is checked against the gathered
sources — invented citations get an inline
(unverified)flag; search plans and meta-questions actively seek counter-evidence; big runs (>18 facts + a strategy outline) synthesize section by section, each call reading only its own facts; evidence is trimmed at sentence boundaries. - Memory quality: retrieval adds recency/importance boosts and 1-hop
expansion over a new memory-links table (A-Mem style); per-type result
caps; superseded facts carry
invalid_at_turn("superseded at t7") for temporal questions; LLM-2 facts can carry a supporting quote that is verified against the transcript — ungrounded facts get confidence-capped and can never be pinned. - Token efficiency: the provenance ledger is skipped entirely when no
persona facts exist; RAG never re-surfaces pinned facts (TIER-2 already
carries them); carried-forward search results are relevance-gated; the
system prompt now marks TWO cache zones (protocol / TIER-2) so pinned-fact
churn no longer invalidates the protocol cache; optional LLMLingua-2
compression packs ~2.5× more relevant page text into the same research
budget (
pip install "sherlock-context[compress]").
v1.0 — research strategy, fragment reassembly, infinite memory, cache-native prompts
- Research strategy step: before a deep research run, LLM-1 drafts a short strategy (objective, sub-topics, scope) and asks up to 2 clarifying questions alongside the approval — answers fold into the run; the strategy is a guideline, not a cage. Sub-topics seed the open-gap tracking, so coverage is measured by the existing convergence machinery.
- Fragment reassembly: fetched pages are excerpted by query relevance
(comment-buried fragments surface instead of page heads); rephrased facts
merge their sources (corroboration accumulates across phrasings and
languages); near-miss contradictions are tagged
[disputed]and reported two-sided; rounds that add no NEW facts converge (converged_no_new_facts); shown fragments lead with source-type diversity (RRF + round-robin). - LLM-2 memory reconsolidation: the compactor can now emit
correctionsthat supersede stale pinned facts non-destructively (old rows stay queryable, marked(superseded), excluded from prompts and dedup); itsretrieval_keywordsnow expand the RAG query; Hangul bigram BM25 makes Korean agglutinated forms retrievable. - Infinite memory (compaction frontier): raw turns already covered by an LLM-2 summary leave the K-turn tail (the last 4 turns always stay verbatim; everything remains in SQLite + memory tools). Measured: −684 tokens/turn at turn 14 of a session compacted at turn 8 — savings grow with session length.
- Honest small windows:
with_callable(context_window=8192, max_output_tokens=…, slot_budget_profile=…)+ new 8K/16K/32K budget profiles; the most recent turns bypass the budget so history is never zero; one-time warning when no window is declared. - Cache-native prompts: the system message marks its byte-stable TIER 1+2
prefix; LiteLLM converts it to
cache_controlblocks (Anthropic) and reportscache_read/creation_tokens; LLM-2/LLM-3 prompts are whole-message hinted; BYO callables can opt in via acache_hintskwarg — a plainf(messages)payload stays byte-identical. The playground token bar shows⚡cached. - Waste removal: dead LLM-3 output fields removed; provenance ledger / message wrappers / tool-result banner compressed (109→52 tokens, guardrails intact); protocol docs are now conditional (no search engine → 1,308→745 tokens/turn); a failed JSON parse retries ONCE with the error fed back instead of wasting the whole companion call.
v0.9 — hardening + universal playground
- Playground is multi-provider: Gemini, OpenAI, Anthropic, the open-source-model hosts (DeepInfra · Together · OpenRouter), and any local OpenAI-compatible server, mixable per role; cumulative token bar; WS auto-reconnect; per-turn LLM call history; IME-safe input.
- Deep research correctness from an adversarial multi-agent audit (30 confirmed findings fixed): malformed small-model JSON can no longer abort a run; refusals never approve; round-1 overflow goes to a backlog instead of being dropped; engine outages stop the loop honestly; background failures surface as replies; mid-research messages are always acknowledged, persisted, and folded or accounted for; research docs no longer evict pinned facts.
- Memory integrity: restated facts resurrect from FORGOTTEN; corrections
past the dedup prefix now update the stored fact; LLM-2 output can
no longer launder into "user-verified" pins; prompt blocks carry the
turn each fact was learned (
t12) so newer wins on conflict. - Korean search queries: only multi-char particles are stripped —
하와이/제주도/고양이 survive; quoted phrases and version numbers
(
"exact phrase",3.12,C++) pass through cleaning intact.
v0.8 — multilingual search + token hygiene
- LLM-3 plans clean keyword queries across the languages most relevant to the topic; global search, query language as the lever.
- Compact shared research state: per-round deltas, LLM-3 never sees raw
pages, terse JSON rounds, synthesis from de-duplicated facts;
deep_research.tokensper-stage measurement. - Fragment triangulation with
[corroborated ×N]ranking.
v0.7 — three search modes
- LLM-1 sets search result count (
k=N); LLM-3 runs a self-evaluating background search loop;deep_researchships as an approval-gated ≤20-round loop with session documents and meta-cognition Q&A.
v0.5 — core loop
- True background companions, local embeddings by default, redaction, session management, slot budgets, behavior probes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sherlock_context-1.7.0.tar.gz.
File metadata
- Download URL: sherlock_context-1.7.0.tar.gz
- Upload date:
- Size: 290.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a7ed9293106893e833430b13481f57b1116429fdb702a0a35e7ee6249053419
|
|
| MD5 |
9ff5ecd4c634d48d6d51dfe2793205ba
|
|
| BLAKE2b-256 |
50c1c50a61cfadc1a27d5dc15b226326ec3011bb4f4c5cc916f07ebcd5e21fc9
|
Provenance
The following attestation bundles were made for sherlock_context-1.7.0.tar.gz:
Publisher:
publish.yml on MinwooKim1990/sherlock
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sherlock_context-1.7.0.tar.gz -
Subject digest:
2a7ed9293106893e833430b13481f57b1116429fdb702a0a35e7ee6249053419 - Sigstore transparency entry: 1909454091
- Sigstore integration time:
-
Permalink:
MinwooKim1990/sherlock@ad5a600e0429cca90c8a61cb97ded25b7cc7b38f -
Branch / Tag:
refs/tags/v1.7.0 - Owner: https://github.com/MinwooKim1990
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ad5a600e0429cca90c8a61cb97ded25b7cc7b38f -
Trigger Event:
push
-
Statement type:
File details
Details for the file sherlock_context-1.7.0-py3-none-any.whl.
File metadata
- Download URL: sherlock_context-1.7.0-py3-none-any.whl
- Upload date:
- Size: 280.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed6374a118ac1c95d979e12d42607bad914f709167c0bd5263d0f2d7033ad135
|
|
| MD5 |
29738875bb338daadc17448279a3aa3f
|
|
| BLAKE2b-256 |
00c0dd39a43dc3e808cdd3598fb0dbc3c0970c7989974798f1509822dbde17db
|
Provenance
The following attestation bundles were made for sherlock_context-1.7.0-py3-none-any.whl:
Publisher:
publish.yml on MinwooKim1990/sherlock
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sherlock_context-1.7.0-py3-none-any.whl -
Subject digest:
ed6374a118ac1c95d979e12d42607bad914f709167c0bd5263d0f2d7033ad135 - Sigstore transparency entry: 1909454236
- Sigstore integration time:
-
Permalink:
MinwooKim1990/sherlock@ad5a600e0429cca90c8a61cb97ded25b7cc7b38f -
Branch / Tag:
refs/tags/v1.7.0 - Owner: https://github.com/MinwooKim1990
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ad5a600e0429cca90c8a61cb97ded25b7cc7b38f -
Trigger Event:
push
-
Statement type: