Memory retrieval for AI agents — integer-pointer fidelity guarantee, 93.4% R@1 on LongMemEval_S (hybrid tier). By Hermes Labs.
Project description
cogito-ergo
Memory retrieval for AI agents with structural fidelity. Two purpose-built paths:
the existing /recall scores 85% R@1 on cogito's internal 31-case atomic-fact
eval; the new /recall_hybrid tier scores 93.4% R@1 on LongMemEval_S
(session-retrieval benchmark, see Hybrid recall).
Different workloads, different tools. The filter LLM outputs only integers — it
structurally cannot corrupt, rephrase, or hallucinate into the content returned
to your agent.
The Problem
Every retrieval system that uses an LLM to select or rank memories has the same failure mode: the LLM rephrases on the way out. You store "auth tokens expire after 3600 seconds" and get back "authentication has a configurable timeout." The specific fact is gone.
- Raw vector search returns candidates by similarity, but precision plateaus at 50–60% R@1 on real workloads
- LLM-based re-rankers improve relevance but generate text — they summarize, merge, or hallucinate into the content your agent receives
- Full RAG pipelines add latency and cost without solving the fidelity problem
cogito-ergo fixes this structurally. The filter LLM outputs only integer pointers ([3, 7, 12]). The server dereferences them to verbatim stored text. The LLM never sees, generates, or touches memory content. Fidelity is architectural, not a prompting convention.
| Mode | R@1 | hit@any | Latency |
|---|---|---|---|
| Combined (snapshot + recall) | 85% | 96% | 1303ms |
| recall only | 63% | 81% | 1197ms |
| recall_b (zero-LLM) | 56% | 96% | 127ms |
31 test cases, qwen3.5:2b filter model, fully local. $0/month.
Architecture
┌─────────────────────────────────────────────┐
│ cogito-ergo server │
│ :19420 │
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ SNAPSHOT LAYER │
│ Compressed markdown index (~741 tokens) │
│ Built once from corpus via `cogito snapshot`│
│ Returned with /recall — no vector search │
│ Solves cross-reference queries (0%→50% R@1)│
└──────────────────┬──────────────────────────┘
│
┌──────────────────▼──────────────────────────┐
│ STAGE 1 — recall_b (zero-LLM) │
│ Query decomposition → sub-queries │
│ Stop-word stripping + bigrams + trigrams │
│ Vocab expansion (cogito calibrate) │
│ Up to 8 sub-queries, merged with RRF │
│ Latency: ~127ms │
└──────────────────┬──────────────────────────┘
│ up to 100 candidates
┌──────────────────▼──────────────────────────┐
│ STAGE 2 — integer-pointer filter │
│ Filter LLM sees: [1] text [2] text ... │
│ Filter LLM outputs: [3, 7, 12] ← ONLY │
│ Server fetches candidates[3], [7], [12] │
│ Returns: verbatim stored text │
│ Added latency: ~1176ms │
└─────────────────────────────────────────────┘
The filter LLM never generates memory text. Out-of-range integers are silently ignored. Fidelity is a structural property of the pipeline, not a prompting convention.
Benchmarks
Measured 2026-03-28. 31 test cases, qwen3.5:2b as filter model (local Ollama).
| Mode | R@1 | hit@any | MRR | Latency |
|---|---|---|---|---|
| Combined (snapshot + recall) | 85% | 96% | 0.878 | 1303ms |
| recall only | 63% | 81% | — | 1197ms |
| recall_b (zero-LLM) | 56% | 96% | — | 127ms |
| snapshot only | 41% | — | — | — |
Key results:
- Snapshot layer contributes +15% hit@any vs recall-only
- Cross-reference queries: recall alone gets 0% R@1; combined gets 50%
- recall_b matches combined hit@any (96%) at 10x lower latency — use it when cost matters
Quick Start
1. Install
pip install cogito-ergo
2. Pull Ollama models
ollama pull mistral:7b
ollama pull nomic-embed-text
Requires a running Ollama instance.
3. Configure filter LLM
# Option A: direct Anthropic key
export ANTHROPIC_API_KEY=sk-ant-...
# Option B: any OpenAI-compatible gateway (local LM Studio, OpenClaw, etc.)
export COGITO_FILTER_ENDPOINT=http://your-gateway/
export COGITO_FILTER_TOKEN=your-token
export COGITO_FILTER_MODEL=anthropic/claude-haiku-4-5
Or via .cogito.json in your working directory:
{
"filter_endpoint": "http://your-gateway/",
"filter_token": "your-token",
"filter_model": "anthropic/claude-haiku-4-5"
}
4. Start the server
cogito-server
# or: cogito server
5. First recall
cogito recall "what did we decide about the auth architecture"
curl -X POST http://127.0.0.1:19420/recall \
-H "Content-Type: application/json" \
-d '{"text": "auth architecture decisions"}'
Hybrid recall (93.4% R@1 on LongMemEval_S)
recall_hybrid is an opt-in retrieval path that ports the architecture
benchmarked at 93.4% R@1 on LongMemEval_S (from a 56% mem0 baseline). It
stays in the same integer-pointer fidelity contract as /recall — only
indices cross the LLM boundary.
Query
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Stage 1 — hybrid retrieval (zero-LLM) │
│ • sub-query decomposition + vocab expansion (recall_b logic) │
│ • dense retrieval with nomic search_query: / search_document:│
│ prefixes │
│ • BM25 over the candidate pool (bm25s, optional extra) │
│ • Reciprocal Rank Fusion across runs │
│ • cosine-blended rerank against the original query │
└───────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Router (regex classifier) │
│ • "you told me" / "you said" → skip (keep Stage 1) │
│ • "how many" / "what date" → call cheap filter │
│ • everything else → keep Stage 1 at filter tier │
└───────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Stage 2 — cheap filter (tier="filter", default) │
│ 500-char snippets, integer-pointer output │
└───────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────┐
│ Stage 3 — flagship rerank (tier="flagship", opt-in) │
│ 2000-char snippets, stronger model, integer output │
│ Called when Stage 1 confidence is low or filter failed │
└───────────────────────────────────────────────────────────────┘
Tier tradeoffs
| Tier | Latency | R@1 (LongMemEval_S) | External calls | When to use |
|---|---|---|---|---|
zero_llm |
~500ms | — | none | latency-sensitive paths, cost-sensitive ops |
filter |
~1300ms | 90%+ | cheap filter LLM | default; temporal/counting queries benefit |
flagship |
~3500ms | 93.4% | filter + flagship model | hard queries, long sessions, benchmark setup |
Quick start
# Optional dependency for best BM25 fusion (zero deps fallback if absent)
pip install cogito-ergo[hybrid]
# Opt-in: set a filter endpoint (any OpenAI-compatible API)
export COGITO_FILTER_ENDPOINT=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
export COGITO_FILTER_TOKEN=sk-your-key
export COGITO_FILTER_MODEL=qwen-turbo
# Optional: flagship tier (stronger model, 4x larger context window)
export COGITO_FLAGSHIP_ENDPOINT=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
export COGITO_FLAGSHIP_TOKEN=sk-your-key
export COGITO_FLAGSHIP_MODEL=qwen-max
# Or simpler — if DASHSCOPE_API_KEY is set, the flagship tier auto-configures
export DASHSCOPE_API_KEY=sk-your-key
cogito recall-hybrid "auth architecture decisions" --tier filter
HTTP:
curl -X POST http://127.0.0.1:19420/recall_hybrid \
-H "Content-Type: application/json" \
-d '{"text": "how many auth migrations have we done", "tier": "flagship", "limit": 10}'
Python:
from cogito.recall_hybrid import recall_hybrid
from cogito.config import load, mem0_config
from mem0 import Memory
cfg = load()
mem = Memory.from_config(mem0_config(cfg))
hits, method = recall_hybrid(mem, "auth tokens", user_id="agent", cfg=cfg, tier="filter")
Graceful degradation: if the filter endpoint isn't configured, tier="filter"
falls back to zero_llm. If the flagship endpoint isn't configured,
tier="flagship" falls back to filter (which itself may degrade to
zero_llm). Nothing raises on missing credentials.
Regression notice
The hybrid path was validated at 93.4% R@1 on LongMemEval_S (multi-turn
dialog retrieval with turn-level chunking and session-date scaffolds).
On the internal 31-case eval (which measures keyword-recall over a mem0
store of short atomic memories), /recall_hybrid scores lower on R@1 than
the existing /recall path — they solve slightly different problems. The
hybrid path wins on hit@any, semantic-gap queries, and multi-memory
aggregation; the existing path wins on prefix-style direct lookup. Default
behavior is unchanged — /recall still drives cogito recall. Use
/recall_hybrid or cogito recall-hybrid when you need the hybrid trait.
Claude Code session memory
cogito-ergo v0.3.0 can ingest your Claude Code sessions and query them with the same turn-pair chunking used in the LongMemEval benchmark.
Why it matters: The 93.4% R@1 benchmark result requires role-structured session data (user+assistant turn pairs). Flat atomic memories don't have this structure. Claude Code stores every session as role-structured JSONL — wiring it to cogito uses the same retrieval architecture as the benchmark, though Claude Code sessions have not been independently evaluated at the same scale.
Install:
# No new dependencies — uses existing chromadb + Ollama (nomic-embed-text)
Ingest:
# Preview (dry run)
python3 -m cogito.ingest_claude_sessions --since 2026-04-11 --dry-run
# Ingest last 7 days
python3 -m cogito.ingest_claude_sessions --since 2026-04-11
# Ingest all sessions (may take a few minutes)
python3 -m cogito.ingest_claude_sessions
Query:
from cogito.recall_sessions import query_sessions, query_both
# Session history only
results = query_sessions("what did we discuss about LPCI last week?", top_k=3)
# Atomic facts + session history side-by-side (no auto-merge)
both = query_both("cogito architecture decisions")
MCP tools (new in v0.3.0):
cogito_recall_sessions— query Claude Code sessionscogito_recall_both— 3 atomic + 3 session results side-by-side
Expected accuracy: ~80-93% depending on query specificity (same pipeline as the benchmark; accuracy drops when querying across very long sessions because the session embedding averages across all turn content).
Privacy: All data stays local. Embedding uses Ollama (localhost:11434, nomic-embed-text).
ChromaDB at ~/.cogito/store. Nothing sent to any cloud API. Ingestion is explicit —
nothing runs automatically. Re-running ingest is idempotent (dedup via content hash).
Full demo: docs/claude-code-memory-demo.md
HTTP API
All endpoints return JSON. Server runs on port 19420 by default.
GET /health
{"status": "ok", "count": 1484, "version": "0.2.0", "calibrated": true, "snapshot": true}
Fields: count = total memories in store; calibrated = vocab_map present; snapshot = snapshot.md exists.
GET /snapshot
Returns the compressed index markdown built by cogito snapshot.
{"snapshot": "## Projects\n- **cogito-ergo** — ...", "path": "/home/user/.cogito/snapshot.md"}
Returns 404 if no snapshot has been built yet.
POST /query
Narrow vector search. L2 threshold filter only. No LLM call. Fast.
Request:
{"text": "query string", "limit": 5}
Response:
{"memories": [{"text": "...", "score": 93.4}]}
POST /recall
Broad search + integer-pointer filter. Two stages: zero-LLM RRF candidate pool, then cheap LLM selects by index.
Request:
{"text": "query string", "limit": 50, "threshold": 400}
Response:
{"memories": [{"text": "...", "score": 93.4}], "method": "filter"}
method field: "filter" = filter ran successfully; "fallback_*" = graceful degradation, all candidates returned instead. Possible fallbacks: fallback_no_endpoint, fallback_unreachable, fallback_parse_error, fallback_error.
POST /recall_b
Zero-LLM recall only. Sub-query decomposition + RRF. 127ms latency. Same hit@any as combined (96%) at lower cost.
Request:
{"text": "query string", "limit": 50}
Response:
{"memories": [{"text": "...", "score": 0.016}], "method": "decompose_4_v"}
method field: "decompose_N" = N sub-queries ran; "decompose_N_v" = vocab expansion applied.
POST /recall_hybrid
Hybrid BM25 + dense + RRF retrieval with tiered LLM escalation. Port of the architecture that reached 93.4% R@1 on LongMemEval_S. See Hybrid recall for the full diagram and tradeoffs.
Request:
{"text": "query string", "limit": 50, "tier": "filter", "top_k": 5}
tier is one of "zero_llm", "filter" (default), "flagship".
top_k is how many candidates the reranker sees (default 5).
Response:
{"memories": [{"text": "...", "score": 0.72}], "method": "hybrid_12_bm25|filter"}
method field encodes the path taken (e.g. "hybrid_24_bm25|default_s1" =
Stage 1 order kept; "hybrid_12_bm25|filter" = cheap filter reranked;
"hybrid_8_bm25|filter|flagship" = both reranks ran; "hybrid_6_nobm25_v|…"
= bm25s not installed, vocab expansion active).
POST /store
Write one memory verbatim. No extraction LLM. Agent decides the content.
This is the preferred write path for agent-curated content.
Request:
{"text": "Switched from JWT to session tokens on 2026-03-27 due to compliance requirement", "id": "<optional uuid>"}
Response:
{"id": "abc123...", "text": "Switched from JWT to session tokens..."}
POST /add
Feed text through mem0's extraction LLM before storing. Extracts multiple atomic facts from unstructured input.
Use when you have raw/unstructured text and want automatic fact extraction.
Request:
{"text": "free-form text to remember"}
Response:
{"count": 3, "memories": ["extracted fact 1", "extracted fact 2", "extracted fact 3"]}
CLI Reference
All CLI commands talk to the running HTTP server.
| Command | Description |
|---|---|
cogito recall "query" |
Two-stage recall via running server |
cogito recall "query" --limit 50 --raw |
Raw JSON output |
cogito recall-hybrid "query" --tier filter |
Hybrid BM25+dense+RRF recall (93.4% R@1 arch) |
cogito recall-hybrid "query" --tier flagship --top-k 5 |
+ flagship escalation on hard queries |
cogito query "query" |
Simple vector query, no filter |
cogito add "text" |
Add a memory via /add (mem0 extraction) |
cogito seed ~/notes/ |
Bulk-seed from markdown files via /store |
cogito seed ~/notes/ --add |
Bulk-seed using /add (extraction mode) |
cogito seed ~/notes/ --dry-run |
Preview without writing |
cogito seed ~/notes/ --glob "*.txt" |
Custom file pattern |
cogito snapshot |
Build compressed index layer |
cogito snapshot --rebuild |
Force rebuild of snapshot |
cogito snapshot --dry-run |
Preview snapshot without writing |
cogito calibrate |
Build vocab bridge from corpus (one-time) |
cogito calibrate --dry-run |
Preview vocab mappings |
cogito health |
Check server status |
cogito server |
Start the server (alias for cogito-server) |
cogito-server --port 19420 |
Start server directly |
cogito-server --config /path/to.json |
Start with explicit config file |
Configuration
Priority: env vars > .cogito.json > defaults.
Config file is searched at ./.cogito.json (cwd) then ~/.cogito/config.json.
| Env var | Config key | Default | Description |
|---|---|---|---|
COGITO_PORT |
port |
19420 |
Server port |
COGITO_USER_ID |
user_id |
"agent" |
Memory namespace (isolates stores) |
COGITO_FILTER_ENDPOINT |
filter_endpoint |
— | OpenAI-compatible base URL for filter LLM |
COGITO_FILTER_TOKEN |
filter_token |
— | Bearer token for filter endpoint |
COGITO_FILTER_MODEL |
filter_model |
anthropic/claude-haiku-4-5 |
Filter LLM model name |
COGITO_FILTER_TIMEOUT_MS |
filter_timeout_ms |
12000 |
Filter LLM timeout in ms |
ANTHROPIC_API_KEY |
anthropic_api_key |
— | Direct Anthropic key (alternative to endpoint+token) |
COGITO_STORE_PATH |
store_path |
~/.cogito/store |
ChromaDB persistence path |
COGITO_COLLECTION |
collection |
cogito_memory |
ChromaDB collection name |
COGITO_OLLAMA_URL |
ollama_url |
http://localhost:11434 |
Ollama base URL |
COGITO_LLM_MODEL |
llm_model |
mistral:7b |
LLM for fact extraction (/add) |
COGITO_EMBED_MODEL |
embed_model |
nomic-embed-text |
Embedding model |
COGITO_RECALL_LIMIT |
recall_limit |
50 |
Candidate pool size for /recall and /recall_b |
COGITO_RECALL_THRESHOLD |
recall_threshold |
400.0 |
L2 cutoff for /recall candidates |
COGITO_QUERY_THRESHOLD |
query_threshold |
250.0 |
L2 cutoff for /query results |
COGITO_FLAGSHIP_ENDPOINT |
flagship_endpoint |
— | OpenAI-compatible base URL for flagship rerank (recall_hybrid tier="flagship") |
COGITO_FLAGSHIP_TOKEN |
flagship_token |
— | Bearer token for flagship endpoint |
COGITO_FLAGSHIP_MODEL |
flagship_model |
— | Flagship model name (e.g. qwen-max) |
COGITO_FLAGSHIP_TIMEOUT_MS |
flagship_timeout_ms |
30000 |
Flagship LLM timeout in ms |
DASHSCOPE_API_KEY |
— | — | If set, recall_hybrid auto-configures flagship to DashScope qwen-max |
COGITO_HYBRID_COSINE_WEIGHT |
hybrid_cosine_weight |
0.7 |
Cosine vs RRF blend weight for hybrid retrieval (0..1) |
filter_endpoint accepts any OpenAI-compatible API: Anthropic gateway, LM Studio, Ollama's /v1 compat layer, OpenClaw, etc.
For Ollama qwen3/qwen3.5 models used as filter, cogito automatically switches to the native Ollama /api/chat endpoint with think: false to suppress thinking mode.
Python API
from cogito.recall import recall
from cogito.recall_b import recall_b
from cogito.recall_hybrid import recall_hybrid
from cogito.config import load, mem0_config
from mem0 import Memory
cfg = load() # reads .cogito.json + env vars
memory = Memory.from_config(mem0_config(cfg))
# Two-stage recall (recommended default)
memories, method = recall(memory, "auth architecture", user_id=cfg["user_id"], cfg=cfg)
for m in memories:
print(m["text"]) # verbatim stored text, never rephrased
# Zero-LLM recall (fast path)
memories, method = recall_b(memory, "auth architecture", user_id=cfg["user_id"], cfg=cfg)
# Hybrid recall (BM25 + dense + RRF + tiered LLM; 93.4% R@1 on LongMemEval_S)
memories, method = recall_hybrid(
memory, "auth architecture", user_id=cfg["user_id"], cfg=cfg,
tier="filter", # "zero_llm" | "filter" | "flagship"
)
print(method) # e.g. "filter", "decompose_4_v", "hybrid_12_bm25|filter"
Why integers?
When the filter LLM outputs only [3, 7, 12]:
- It cannot rephrase memory text — it never generates it
- It cannot hallucinate new facts into the output
- It cannot summarize two memories into one
- An out-of-range integer is silently ignored; it cannot inject noise
Compare to asking the LLM to "return the relevant passages" — even with careful prompting, LLMs will reword, compress, or merge content. The integer-pointer pattern makes fidelity a structural property of the pipeline, not a prompt engineering goal.
The filter prompt:
Output ONLY a JSON array of integers, ordered from most to least relevant.
Examples: [1, 4, 7] or []
The server then picks candidates[i] — verbatim stored text — for each valid integer i. The path from storage to retrieval contains no text generation.
Setup Recommendations
Before benchmarking or deploying — run zer0lint first.
Ingestion quality directly limits retrieval quality. A poorly formatted memory store will underperform regardless of retrieval method. The technical extraction prompt baked into cogito's default config was validated to produce 0%→100% ingestion improvement via zer0lint diagnostics.
Session start pattern (agent integration):
# 1. Load snapshot into context once at session start
import urllib.request, json
resp = urllib.request.urlopen("http://127.0.0.1:19420/snapshot")
snapshot = json.loads(resp.read())["snapshot"]
# inject snapshot into system prompt or first user message
# 2. Query per-message via /recall
# 3. Write new facts via /store (agent-curated) or /add (extraction)
Calibrate for domain-specific vocabulary:
cogito calibrate # reads your corpus, writes vocab_map to .cogito.json
# then restart server to pick up new vocab_map
Calibration builds a plain-English → technical term bridge. Example: "how fast" → ["latency", "throughput", "ms"]. Improves recall_b on domain-specific queries without adding LLM calls.
Built by Hermes Labs
cogito-ergo is part of the Hermes Labs AI agent tooling suite:
- zer0lint — Memory extraction diagnostics. Run before benchmarking to verify store quality. The technical extraction prompt in cogito's default config was validated against zer0lint.
- zer0dex — Dual-layer memory architecture pattern that cogito-ergo implements.
- lintlang — Static linter for AI agent tool descriptions and prompts
- Little Canary — Prompt injection detection
- Suy Sideguy — Runtime policy enforcement for agents
- cogito-ergo — Two-stage memory retrieval ← you are here
Roadmap
- Pluggable vector backends (pgvector, Qdrant, LlamaIndex)
- Pluggable extraction backends (non-Ollama)
- Session flush utility (end-of-session seeding)
- Benchmark harness as public CLI (
cogito bench) - Streaming /recall response
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cogito_ergo-0.3.0.tar.gz.
File metadata
- Download URL: cogito_ergo-0.3.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66cd2dbd7a9f5e11bb67a66c216696a9e0143cbaa02b27bdd1a695e0ecd287dc
|
|
| MD5 |
393f021b494067beadd0ed5e76a01ecd
|
|
| BLAKE2b-256 |
300537ceecd1632cb1f9713eff689514c5b007cfb6616b54856436e1f3748f45
|
File details
Details for the file cogito_ergo-0.3.0-py3-none-any.whl.
File metadata
- Download URL: cogito_ergo-0.3.0-py3-none-any.whl
- Upload date:
- Size: 61.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d597bdc4675577b1a3c40ab2107c3b6f4d400a5f1a4a321bc0e6c22da2721057
|
|
| MD5 |
d21f99a8d67f7842a1dffd3fbbcae6b7
|
|
| BLAKE2b-256 |
d6941b63023b36bbc67bbe7281692a5727e8e1dd70cc2cf755f89c0c8562ef13
|