Skip to main content

Memory retrieval for AI agents — integer-pointer fidelity guarantee, 93.4% R@1 on LongMemEval_S (hybrid tier). By Hermes Labs.

Project description

cogito-ergo

Memory retrieval for AI agents with structural fidelity. Two purpose-built paths: the existing /recall scores 85% R@1 on cogito's internal 31-case atomic-fact eval; the new /recall_hybrid tier scores 93.4% R@1 on LongMemEval_S (session-retrieval benchmark, see Hybrid recall). Different workloads, different tools. The filter LLM outputs only integers — it structurally cannot corrupt, rephrase, or hallucinate into the content returned to your agent.

PyPI version Python 3.10+ License: MIT Made by Hermes Labs


The Problem

Every retrieval system that uses an LLM to select or rank memories has the same failure mode: the LLM rephrases on the way out. You store "auth tokens expire after 3600 seconds" and get back "authentication has a configurable timeout." The specific fact is gone.

  • Raw vector search returns candidates by similarity, but precision plateaus at 50–60% R@1 on real workloads
  • LLM-based re-rankers improve relevance but generate text — they summarize, merge, or hallucinate into the content your agent receives
  • Full RAG pipelines add latency and cost without solving the fidelity problem

cogito-ergo fixes this structurally. The filter LLM outputs only integer pointers ([3, 7, 12]). The server dereferences them to verbatim stored text. The LLM never sees, generates, or touches memory content. Fidelity is architectural, not a prompting convention.

Mode R@1 hit@any Latency
Combined (snapshot + recall) 85% 96% 1303ms
recall only 63% 81% 1197ms
recall_b (zero-LLM) 56% 96% 127ms

31 test cases, qwen3.5:2b filter model, fully local. $0/month.


Architecture

                    ┌─────────────────────────────────────────────┐
                    │              cogito-ergo server              │
                    │                 :19420                       │
                    └──────────────────┬──────────────────────────┘
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │             SNAPSHOT LAYER                   │
                    │   Compressed markdown index (~741 tokens)    │
                    │   Built once from corpus via `cogito snapshot`│
                    │   Returned with /recall — no vector search   │
                    │   Solves cross-reference queries (0%→50% R@1)│
                    └──────────────────┬──────────────────────────┘
                                       │
                    ┌──────────────────▼──────────────────────────┐
                    │         STAGE 1 — recall_b (zero-LLM)       │
                    │   Query decomposition → sub-queries          │
                    │   Stop-word stripping + bigrams + trigrams   │
                    │   Vocab expansion (cogito calibrate)         │
                    │   Up to 8 sub-queries, merged with RRF       │
                    │   Latency: ~127ms                            │
                    └──────────────────┬──────────────────────────┘
                                       │ up to 100 candidates
                    ┌──────────────────▼──────────────────────────┐
                    │       STAGE 2 — integer-pointer filter       │
                    │   Filter LLM sees: [1] text  [2] text ...   │
                    │   Filter LLM outputs: [3, 7, 12]  ← ONLY    │
                    │   Server fetches candidates[3], [7], [12]   │
                    │   Returns: verbatim stored text              │
                    │   Added latency: ~1176ms                     │
                    └─────────────────────────────────────────────┘

The filter LLM never generates memory text. Out-of-range integers are silently ignored. Fidelity is a structural property of the pipeline, not a prompting convention.


Benchmarks

Measured 2026-03-28. 31 test cases, qwen3.5:2b as filter model (local Ollama).

Mode R@1 hit@any MRR Latency
Combined (snapshot + recall) 85% 96% 0.878 1303ms
recall only 63% 81% 1197ms
recall_b (zero-LLM) 56% 96% 127ms
snapshot only 41%

Key results:

  • Snapshot layer contributes +15% hit@any vs recall-only
  • Cross-reference queries: recall alone gets 0% R@1; combined gets 50%
  • recall_b matches combined hit@any (96%) at 10x lower latency — use it when cost matters

Quick Start

1. Install

pip install cogito-ergo

2. Pull Ollama models

ollama pull mistral:7b
ollama pull nomic-embed-text

Requires a running Ollama instance.

3. Configure filter LLM

# Option A: direct Anthropic key
export ANTHROPIC_API_KEY=sk-ant-...

# Option B: any OpenAI-compatible gateway (local LM Studio, OpenClaw, etc.)
export COGITO_FILTER_ENDPOINT=http://your-gateway/
export COGITO_FILTER_TOKEN=your-token
export COGITO_FILTER_MODEL=anthropic/claude-haiku-4-5

Or via .cogito.json in your working directory:

{
  "filter_endpoint": "http://your-gateway/",
  "filter_token": "your-token",
  "filter_model": "anthropic/claude-haiku-4-5"
}

4. Start the server

cogito-server
# or: cogito server

5. First recall

cogito recall "what did we decide about the auth architecture"
curl -X POST http://127.0.0.1:19420/recall \
  -H "Content-Type: application/json" \
  -d '{"text": "auth architecture decisions"}'

Hybrid recall (93.4% R@1 on LongMemEval_S)

recall_hybrid is an opt-in retrieval path that ports the architecture benchmarked at 93.4% R@1 on LongMemEval_S (from a 56% mem0 baseline). It stays in the same integer-pointer fidelity contract as /recall — only indices cross the LLM boundary.

Query
  │
  ▼
┌───────────────────────────────────────────────────────────────┐
│ Stage 1 — hybrid retrieval (zero-LLM)                          │
│   • sub-query decomposition + vocab expansion (recall_b logic) │
│   • dense retrieval with nomic search_query: / search_document:│
│     prefixes                                                    │
│   • BM25 over the candidate pool (bm25s, optional extra)       │
│   • Reciprocal Rank Fusion across runs                         │
│   • cosine-blended rerank against the original query           │
└───────────────────────────────────────────────────────────────┘
  │
  ▼
┌───────────────────────────────────────────────────────────────┐
│ Router (regex classifier)                                      │
│   • "you told me" / "you said"  → skip (keep Stage 1)          │
│   • "how many" / "what date"    → call cheap filter            │
│   • everything else             → keep Stage 1 at filter tier  │
└───────────────────────────────────────────────────────────────┘
  │
  ▼
┌───────────────────────────────────────────────────────────────┐
│ Stage 2 — cheap filter  (tier="filter", default)               │
│   500-char snippets, integer-pointer output                    │
└───────────────────────────────────────────────────────────────┘
  │
  ▼
┌───────────────────────────────────────────────────────────────┐
│ Stage 3 — flagship rerank  (tier="flagship", opt-in)           │
│   2000-char snippets, stronger model, integer output           │
│   Called when Stage 1 confidence is low or filter failed       │
└───────────────────────────────────────────────────────────────┘

Tier tradeoffs

Tier Latency R@1 (LongMemEval_S) External calls When to use
zero_llm ~500ms none latency-sensitive paths, cost-sensitive ops
filter ~1300ms 90%+ cheap filter LLM default; temporal/counting queries benefit
flagship ~3500ms 93.4% filter + flagship model hard queries, long sessions, benchmark setup

Quick start

# Optional dependency for best BM25 fusion (zero deps fallback if absent)
pip install cogito-ergo[hybrid]

# Opt-in: set a filter endpoint (any OpenAI-compatible API)
export COGITO_FILTER_ENDPOINT=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
export COGITO_FILTER_TOKEN=sk-your-key
export COGITO_FILTER_MODEL=qwen-turbo

# Optional: flagship tier (stronger model, 4x larger context window)
export COGITO_FLAGSHIP_ENDPOINT=https://dashscope-intl.aliyuncs.com/compatible-mode/v1
export COGITO_FLAGSHIP_TOKEN=sk-your-key
export COGITO_FLAGSHIP_MODEL=qwen-max
# Or simpler — if DASHSCOPE_API_KEY is set, the flagship tier auto-configures
export DASHSCOPE_API_KEY=sk-your-key

cogito recall-hybrid "auth architecture decisions" --tier filter

HTTP:

curl -X POST http://127.0.0.1:19420/recall_hybrid \
  -H "Content-Type: application/json" \
  -d '{"text": "how many auth migrations have we done", "tier": "flagship", "limit": 10}'

Python:

from cogito.recall_hybrid import recall_hybrid
from cogito.config import load, mem0_config
from mem0 import Memory

cfg = load()
mem = Memory.from_config(mem0_config(cfg))
hits, method = recall_hybrid(mem, "auth tokens", user_id="agent", cfg=cfg, tier="filter")

Graceful degradation: if the filter endpoint isn't configured, tier="filter" falls back to zero_llm. If the flagship endpoint isn't configured, tier="flagship" falls back to filter (which itself may degrade to zero_llm). Nothing raises on missing credentials.

Regression notice

The hybrid path was validated at 93.4% R@1 on LongMemEval_S (multi-turn dialog retrieval with turn-level chunking and session-date scaffolds). On the internal 31-case eval (which measures keyword-recall over a mem0 store of short atomic memories), /recall_hybrid scores lower on R@1 than the existing /recall path — they solve slightly different problems. The hybrid path wins on hit@any, semantic-gap queries, and multi-memory aggregation; the existing path wins on prefix-style direct lookup. Default behavior is unchanged — /recall still drives cogito recall. Use /recall_hybrid or cogito recall-hybrid when you need the hybrid trait.


Claude Code session memory

cogito-ergo v0.3.0 can ingest your Claude Code sessions and query them with the same turn-pair chunking used in the LongMemEval benchmark.

Why it matters: The 93.4% R@1 benchmark result requires role-structured session data (user+assistant turn pairs). Flat atomic memories don't have this structure. Claude Code stores every session as role-structured JSONL — wiring it to cogito uses the same retrieval architecture as the benchmark, though Claude Code sessions have not been independently evaluated at the same scale.

Install:

# No new dependencies — uses existing chromadb + Ollama (nomic-embed-text)

Ingest:

# Preview (dry run)
python3 -m cogito.ingest_claude_sessions --since 2026-04-11 --dry-run

# Ingest last 7 days
python3 -m cogito.ingest_claude_sessions --since 2026-04-11

# Ingest all sessions (may take a few minutes)
python3 -m cogito.ingest_claude_sessions

Query:

from cogito.recall_sessions import query_sessions, query_both

# Session history only
results = query_sessions("what did we discuss about LPCI last week?", top_k=3)

# Atomic facts + session history side-by-side (no auto-merge)
both = query_both("cogito architecture decisions")

MCP tools (new in v0.3.0):

  • cogito_recall_sessions — query Claude Code sessions
  • cogito_recall_both — 3 atomic + 3 session results side-by-side

Expected accuracy: ~80-93% depending on query specificity (same pipeline as the benchmark; accuracy drops when querying across very long sessions because the session embedding averages across all turn content).

Privacy: All data stays local. Embedding uses Ollama (localhost:11434, nomic-embed-text). ChromaDB at ~/.cogito/store. Nothing sent to any cloud API. Ingestion is explicit — nothing runs automatically. Re-running ingest is idempotent (dedup via content hash).

Full demo: docs/claude-code-memory-demo.md

HTTP API

All endpoints return JSON. Server runs on port 19420 by default.

GET /health

{"status": "ok", "count": 1484, "version": "0.2.0", "calibrated": true, "snapshot": true}

Fields: count = total memories in store; calibrated = vocab_map present; snapshot = snapshot.md exists.


GET /snapshot

Returns the compressed index markdown built by cogito snapshot.

{"snapshot": "## Projects\n- **cogito-ergo** — ...", "path": "/home/user/.cogito/snapshot.md"}

Returns 404 if no snapshot has been built yet.


POST /query

Narrow vector search. L2 threshold filter only. No LLM call. Fast.

Request:

{"text": "query string", "limit": 5}

Response:

{"memories": [{"text": "...", "score": 93.4}]}

POST /recall

Broad search + integer-pointer filter. Two stages: zero-LLM RRF candidate pool, then cheap LLM selects by index.

Request:

{"text": "query string", "limit": 50, "threshold": 400}

Response:

{"memories": [{"text": "...", "score": 93.4}], "method": "filter"}

method field: "filter" = filter ran successfully; "fallback_*" = graceful degradation, all candidates returned instead. Possible fallbacks: fallback_no_endpoint, fallback_unreachable, fallback_parse_error, fallback_error.


POST /recall_b

Zero-LLM recall only. Sub-query decomposition + RRF. 127ms latency. Same hit@any as combined (96%) at lower cost.

Request:

{"text": "query string", "limit": 50}

Response:

{"memories": [{"text": "...", "score": 0.016}], "method": "decompose_4_v"}

method field: "decompose_N" = N sub-queries ran; "decompose_N_v" = vocab expansion applied.


POST /recall_hybrid

Hybrid BM25 + dense + RRF retrieval with tiered LLM escalation. Port of the architecture that reached 93.4% R@1 on LongMemEval_S. See Hybrid recall for the full diagram and tradeoffs.

Request:

{"text": "query string", "limit": 50, "tier": "filter", "top_k": 5}

tier is one of "zero_llm", "filter" (default), "flagship". top_k is how many candidates the reranker sees (default 5).

Response:

{"memories": [{"text": "...", "score": 0.72}], "method": "hybrid_12_bm25|filter"}

method field encodes the path taken (e.g. "hybrid_24_bm25|default_s1" = Stage 1 order kept; "hybrid_12_bm25|filter" = cheap filter reranked; "hybrid_8_bm25|filter|flagship" = both reranks ran; "hybrid_6_nobm25_v|…" = bm25s not installed, vocab expansion active).


POST /store

Write one memory verbatim. No extraction LLM. Agent decides the content.

This is the preferred write path for agent-curated content.

Request:

{"text": "Switched from JWT to session tokens on 2026-03-27 due to compliance requirement", "id": "<optional uuid>"}

Response:

{"id": "abc123...", "text": "Switched from JWT to session tokens..."}

POST /add

Feed text through mem0's extraction LLM before storing. Extracts multiple atomic facts from unstructured input.

Use when you have raw/unstructured text and want automatic fact extraction.

Request:

{"text": "free-form text to remember"}

Response:

{"count": 3, "memories": ["extracted fact 1", "extracted fact 2", "extracted fact 3"]}

CLI Reference

All CLI commands talk to the running HTTP server.

Command Description
cogito recall "query" Two-stage recall via running server
cogito recall "query" --limit 50 --raw Raw JSON output
cogito recall-hybrid "query" --tier filter Hybrid BM25+dense+RRF recall (93.4% R@1 arch)
cogito recall-hybrid "query" --tier flagship --top-k 5 + flagship escalation on hard queries
cogito query "query" Simple vector query, no filter
cogito add "text" Add a memory via /add (mem0 extraction)
cogito seed ~/notes/ Bulk-seed from markdown files via /store
cogito seed ~/notes/ --add Bulk-seed using /add (extraction mode)
cogito seed ~/notes/ --dry-run Preview without writing
cogito seed ~/notes/ --glob "*.txt" Custom file pattern
cogito snapshot Build compressed index layer
cogito snapshot --rebuild Force rebuild of snapshot
cogito snapshot --dry-run Preview snapshot without writing
cogito calibrate Build vocab bridge from corpus (one-time)
cogito calibrate --dry-run Preview vocab mappings
cogito health Check server status
cogito server Start the server (alias for cogito-server)
cogito-server --port 19420 Start server directly
cogito-server --config /path/to.json Start with explicit config file

Configuration

Priority: env vars > .cogito.json > defaults.

Config file is searched at ./.cogito.json (cwd) then ~/.cogito/config.json.

Env var Config key Default Description
COGITO_PORT port 19420 Server port
COGITO_USER_ID user_id "agent" Memory namespace (isolates stores)
COGITO_FILTER_ENDPOINT filter_endpoint OpenAI-compatible base URL for filter LLM
COGITO_FILTER_TOKEN filter_token Bearer token for filter endpoint
COGITO_FILTER_MODEL filter_model anthropic/claude-haiku-4-5 Filter LLM model name
COGITO_FILTER_TIMEOUT_MS filter_timeout_ms 12000 Filter LLM timeout in ms
ANTHROPIC_API_KEY anthropic_api_key Direct Anthropic key (alternative to endpoint+token)
COGITO_STORE_PATH store_path ~/.cogito/store ChromaDB persistence path
COGITO_COLLECTION collection cogito_memory ChromaDB collection name
COGITO_OLLAMA_URL ollama_url http://localhost:11434 Ollama base URL
COGITO_LLM_MODEL llm_model mistral:7b LLM for fact extraction (/add)
COGITO_EMBED_MODEL embed_model nomic-embed-text Embedding model
COGITO_RECALL_LIMIT recall_limit 50 Candidate pool size for /recall and /recall_b
COGITO_RECALL_THRESHOLD recall_threshold 400.0 L2 cutoff for /recall candidates
COGITO_QUERY_THRESHOLD query_threshold 250.0 L2 cutoff for /query results
COGITO_FLAGSHIP_ENDPOINT flagship_endpoint OpenAI-compatible base URL for flagship rerank (recall_hybrid tier="flagship")
COGITO_FLAGSHIP_TOKEN flagship_token Bearer token for flagship endpoint
COGITO_FLAGSHIP_MODEL flagship_model Flagship model name (e.g. qwen-max)
COGITO_FLAGSHIP_TIMEOUT_MS flagship_timeout_ms 30000 Flagship LLM timeout in ms
DASHSCOPE_API_KEY If set, recall_hybrid auto-configures flagship to DashScope qwen-max
COGITO_HYBRID_COSINE_WEIGHT hybrid_cosine_weight 0.7 Cosine vs RRF blend weight for hybrid retrieval (0..1)

filter_endpoint accepts any OpenAI-compatible API: Anthropic gateway, LM Studio, Ollama's /v1 compat layer, OpenClaw, etc.

For Ollama qwen3/qwen3.5 models used as filter, cogito automatically switches to the native Ollama /api/chat endpoint with think: false to suppress thinking mode.


Python API

from cogito.recall import recall
from cogito.recall_b import recall_b
from cogito.recall_hybrid import recall_hybrid
from cogito.config import load, mem0_config
from mem0 import Memory

cfg = load()  # reads .cogito.json + env vars
memory = Memory.from_config(mem0_config(cfg))

# Two-stage recall (recommended default)
memories, method = recall(memory, "auth architecture", user_id=cfg["user_id"], cfg=cfg)
for m in memories:
    print(m["text"])  # verbatim stored text, never rephrased

# Zero-LLM recall (fast path)
memories, method = recall_b(memory, "auth architecture", user_id=cfg["user_id"], cfg=cfg)

# Hybrid recall (BM25 + dense + RRF + tiered LLM; 93.4% R@1 on LongMemEval_S)
memories, method = recall_hybrid(
    memory, "auth architecture", user_id=cfg["user_id"], cfg=cfg,
    tier="filter",  # "zero_llm" | "filter" | "flagship"
)

print(method)  # e.g. "filter", "decompose_4_v", "hybrid_12_bm25|filter"

Why integers?

When the filter LLM outputs only [3, 7, 12]:

  • It cannot rephrase memory text — it never generates it
  • It cannot hallucinate new facts into the output
  • It cannot summarize two memories into one
  • An out-of-range integer is silently ignored; it cannot inject noise

Compare to asking the LLM to "return the relevant passages" — even with careful prompting, LLMs will reword, compress, or merge content. The integer-pointer pattern makes fidelity a structural property of the pipeline, not a prompt engineering goal.

The filter prompt:

Output ONLY a JSON array of integers, ordered from most to least relevant.
Examples: [1, 4, 7]   or   []

The server then picks candidates[i] — verbatim stored text — for each valid integer i. The path from storage to retrieval contains no text generation.


Setup Recommendations

Before benchmarking or deploying — run zer0lint first.

Ingestion quality directly limits retrieval quality. A poorly formatted memory store will underperform regardless of retrieval method. The technical extraction prompt baked into cogito's default config was validated to produce 0%→100% ingestion improvement via zer0lint diagnostics.

Session start pattern (agent integration):

# 1. Load snapshot into context once at session start
import urllib.request, json
resp = urllib.request.urlopen("http://127.0.0.1:19420/snapshot")
snapshot = json.loads(resp.read())["snapshot"]
# inject snapshot into system prompt or first user message

# 2. Query per-message via /recall
# 3. Write new facts via /store (agent-curated) or /add (extraction)

Calibrate for domain-specific vocabulary:

cogito calibrate  # reads your corpus, writes vocab_map to .cogito.json
# then restart server to pick up new vocab_map

Calibration builds a plain-English → technical term bridge. Example: "how fast" → ["latency", "throughput", "ms"]. Improves recall_b on domain-specific queries without adding LLM calls.


Built by Hermes Labs

cogito-ergo is part of the Hermes Labs AI agent tooling suite:

  • zer0lint — Memory extraction diagnostics. Run before benchmarking to verify store quality. The technical extraction prompt in cogito's default config was validated against zer0lint.
  • zer0dex — Dual-layer memory architecture pattern that cogito-ergo implements.
  • lintlang — Static linter for AI agent tool descriptions and prompts
  • Little Canary — Prompt injection detection
  • Suy Sideguy — Runtime policy enforcement for agents
  • cogito-ergo — Two-stage memory retrieval ← you are here

Roadmap

  • Pluggable vector backends (pgvector, Qdrant, LlamaIndex)
  • Pluggable extraction backends (non-Ollama)
  • Session flush utility (end-of-session seeding)
  • Benchmark harness as public CLI (cogito bench)
  • Streaming /recall response

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogito_ergo-0.3.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogito_ergo-0.3.0-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file cogito_ergo-0.3.0.tar.gz.

File metadata

  • Download URL: cogito_ergo-0.3.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cogito_ergo-0.3.0.tar.gz
Algorithm Hash digest
SHA256 66cd2dbd7a9f5e11bb67a66c216696a9e0143cbaa02b27bdd1a695e0ecd287dc
MD5 393f021b494067beadd0ed5e76a01ecd
BLAKE2b-256 300537ceecd1632cb1f9713eff689514c5b007cfb6616b54856436e1f3748f45

See more details on using hashes here.

File details

Details for the file cogito_ergo-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: cogito_ergo-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 61.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cogito_ergo-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d597bdc4675577b1a3c40ab2107c3b6f4d400a5f1a4a321bc0e6c22da2721057
MD5 d21f99a8d67f7842a1dffd3fbbcae6b7
BLAKE2b-256 d6941b63023b36bbc67bbe7281692a5727e8e1dd70cc2cf755f89c0c8562ef13

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page