Active memory and executive control for AI coding agents. Classifies user prompts against a curated tripwire store, injects the matched lessons into the agent's working context via a Claude Code UserPromptSubmit hook, before the agent reasons about the task.
Project description
Cortex
Active memory and executive control for AI coding agents.
Stop your agent from repeating the mistakes you already paid for.
The problem
AI coding agents have no institutional memory. Your vector store knows about the fee formula, the lookahead bug, the dead-zone hypothesis — but the agent has to think to ask, and usually doesn't. By the time the agent realizes "I should have checked that first", the mistake is already in the commit, or worse, in the live trade.
Cortex makes memory active. A Claude Code hook intercepts every prompt, classifies it against a curated tripwire store, and injects the matched lessons into the agent's working context before it reasons about the task. The lessons get in front of the agent at the one moment that matters — before work starts.
What Cortex actually does
Cortex hooks into Claude Code's UserPromptSubmit event. When you submit a
prompt, it runs through a four-stage pipeline in under 20ms and injects a
structured brief of relevant past lessons into the agent's working context
— before the agent reasons about your task.
User prompt ─────────────────────────────────────────┐
"replay basis-arb on 5m poly slots" │
▼
┌────────────────────┐
│ cortex-hook │
│ │
┌───────────┤ 1. Classify │
│ │ YAML rules │
│ <1ms │ │
│ │ 2. Synthesize │
│ │ sum cost │
│ │ components │
│ │ │
│ │ 3. Fallback │
│ │ TF-IDF scoring │
│ │ │
│ │ 4. Audit │
│ │ jsonl log │
│ └────────────────────┘
▼
<cortex_brief> injected into agent context
(as hookSpecificOutput.additionalContext)
No vector search to trigger. No RAG query to write. No "remember to check X" ritual. The agent gets the warning automatically at the one moment it matters — before it starts.
Here's what the brief actually looks like for a real prompt:
<cortex_brief n="5" critical="4">
Cortex matched rule(s): poly_backtest_task, poly_lag_arb
SYNTHESIS (cumulative cost from matched tripwires):
pm_5m_directional_block: Sum = 19.65pp (threshold 5.0pp, op gte)
+2.4pp spread_slip [directional_5m_dead]
+7.25pp info_decay_5min [information_decay_5m]
+10.0pp adverse_selection [adverse_selection_maker]
>> Sum structural drag = 19.65pp (3 components) >= 5.0pp floor.
Any directional 5m strategy needs pre-fee edge > 19.65pp to
even be testable. See directional_5m_dead for the full autopsy.
The following lessons apply to this task. Each cost real money or
research time in the past. Read them before committing to an approach:
[1] poly_fee_empirical -- CRITICAL
Polymarket net fee = 0.072 x min(p, 1-p) x size - NOT 10% flat
...
[2] real_entry_price -- CRITICAL
Use real up_ask/dn_ask entry -- never $0.50 midpoint (33x inflation)
...
</cortex_brief>
This is not a mock. It's the literal stdout of cortex-hook fed a real
research prompt. The 19.65pp number is computed at runtime by summing
three cost components tied to three separate tripwires. The synthesizer
is what turns "here are some lessons" into "here is the blocking number."
Install
git clone https://github.com/jimbokl/LLMCORTEX.git
cd LLMCORTEX
pip install -e ".[dev]"
One runtime dependency (pyyaml). Everything else is stdlib. 107 tests.
~1800 lines of Python. Zero telemetry, zero network calls outside your
machine.
Wire it into Claude Code
# Inside your project directory
cortex init # creates .cortex/store.db
cortex migrate # seeds 13 example tripwires
cortex stats # sanity check
# Hook it up
mkdir -p .claude
cat > .claude/settings.json <<'EOF'
{
"hooks": {
"UserPromptSubmit": [
{"hooks": [{"type": "command", "command": "cortex-hook"}]}
],
"PostToolUse": [
{"hooks": [{"type": "command", "command": "cortex-watch"}]}
]
}
}
EOF
Your next prompt with any relevant keyword (backtest, live, deploy,
5m, directional, fee, poly, ...) fires the hook and injects the
brief. If nothing matches, the hook is silent and Claude Code proceeds
normally.
Verify without launching Claude Code:
echo '{"session_id":"test","prompt":"run a 5m poly directional backtest"}' \
| cortex-hook \
| python -m json.tool
The six layers of defense
Cortex is not one more rule engine. It's six independent subsystems over a single SQLite store, each targeting a different failure mode:
| # | Layer | Answers the question |
|---|---|---|
| 1 | Classify (YAML rules) | What is this task about, and which lessons apply? |
| 2 | Synthesize (cost sums) | Do the drags of the matched lessons exceed any plausible edge? |
| 3 | Fallback (TF-IDF over bodies) | Did the rule engine miss due to a vocabulary gap? |
| 4 | Verify (code grep, auto-run in hook) | Is the bug from incident X still present in the current codebase right now? |
| 5 | Audit (jsonl session log) | What lessons got injected in this session? What tool calls happened? |
| 6 | Detect (runtime regex) | Did the agent act on the warning, or silently ignore it? |
One rule engine fixes one failure mode. Cortex fixes three (blindness, laziness, compositional ignore) and measures effectiveness on a fourth (silent violation).
See docs/architecture.md for the three failure modes that drove the design and which layer targets each.
Why the synthesizer matters
Most memory systems list matched lessons. Cortex composes them.
Real example from the session logs:
Task: test a directional signal on 5m Polymarket slots
Matched tripwires (individual):
- Spread + slippage: 2.4pp per trade
- Information decay: 1.45pp per minute × 5 min = 7.25pp
- Adverse selection on maker fills: 10pp WR drop
Synthesizer:
2.4 + 7.25 + 10.0 = 19.65pp ≥ 5.0pp threshold — FIREOutput: "Any directional 5m strategy needs pre-fee edge > 19.65pp to even be testable."
The three individual lessons were already in the agent's memory. None of them individually said "stop". The synthesizer is the only thing that sums them and produces an actionable blocking signal. It's the bit that closes the gap between "I knew each of those things" and "I didn't put them together in time."
This is the novel contribution. The rest of Cortex is boring plumbing that exists to make the synthesizer useful.
Measured effectiveness (Day 6)
Cortex doesn't just inject. It measures whether the injections are applied. Here's real output from a development session:
$ cortex stats --sessions
Cortex session audit (all-time)
============================================================
Sessions: 14
Total events: 344
Sessions with inject: 5 (36%)
Sessions with fallback: 9 (64%)
Avg tool_calls / session: 23.4
Top matched tripwires:
14 x poly_fee_empirical
10 x real_entry_price
7 x backtest_must_match_prod
4 x never_single_strategy
4 x lookahead_parquet
3 x directional_5m_dead
Synthesis rules fired:
3 x pm_5m_directional_block
Silent violations detected: 1 across 1 session(s)
1 x lookahead_parquet
Tripwire effectiveness (violation rate = viol / hits):
[WARN] lookahead_parquet hits=6 viol=1 rate=0.17
[OK ] poly_fee_empirical hits=16 viol=0 rate=0.00
[OK ] real_entry_price hits=12 viol=0 rate=0.00
[OK ] backtest_must_match_prod hits=9 viol=0 rate=0.00
Cold tripwires (2 never matched in window):
- book_holography_failed
- late_lock_replay_traps
Tripwires with violation_patterns get a runtime effectiveness rate via
cortex-watch. Every tool call after an injection is regex-matched
against the tripwire's patterns. If the agent runs
df['slot_ts'] = (df['ts'] // 300) * 300 after being warned about
lookahead, that's a potential_violation event. The regex is smart
enough to not flag (df['ts'] // 300) * 300 + 300 — the honest
forward-shift fix.
Rate = 0.00 means the lesson was applied every time it was shown. Rate > 0.5 means it was mostly ignored — rephrase the brief, or add blocking enforcement, or kill the tripwire if it's crying wolf. Either way, you have a number instead of a guess.
This closes the feedback loop. You're not arguing about whether Cortex is worth running. You measure it.
Fail-open is a promise
Every code path reachable from cortex-hook or cortex-watch exits 0
on any error. The hierarchy:
- Empty stdin → exit 0, no output
- JSON parse fails → exit 0, no output
- Store missing or locked → exit 0, no output
- Classifier crash → exit 0, no output
- Verifier crash → exit 0, no output
- Any other exception anywhere → exit 0, no output
A broken Cortex must never block the user's interaction. If the hook
throws, Claude Code sees empty additionalContext and proceeds normally.
You don't notice. There are tests for every failure path — see
test_hook.py and test_watch.py.
Rejected paths (honest CHANGELOG)
Cortex wasn't designed in one shot. Some things we built and deleted:
Palace semantic daemon (Day 4 morning → Day 4 afternoon)
Built a mempalace.searcher HTTP daemon for semantic fallback. Killed
it the same day. Three reasons:
- Narrow coverage. The ONNX embedding model is English-only. Short
queries (
show top features) scored zero hits. Russian queries scored zero hits. Only long, domain-specific English queries worked. - Infrastructure cost. Required a ~200MB warm daemon holding ONNX
- chromadb, a ~30s warmup phase, and an HTTP layer between hook and daemon.
- Replacement was trivial. Weighted token overlap over the 13 tripwire bodies gave strictly better coverage at 1% of the complexity. 130 lines of pure Python vs 350 lines + a daemon + an ONNX model.
Lesson: measure before you build. A 20-minute TF-IDF prototype would
have answered the question "is rule-engine + keyword scoring enough?"
before half a day went into daemon plumbing. The delete commit is in
CHANGELOG.md.
This kind of thing stays documented on purpose. If you're evaluating Cortex for your own project, you should see what we tried that didn't work, not just the highlight reel.
FAQ
Why not a vector store?
Vector stores answer "what's semantically similar to this query?" — useful when you already know to ask. Cortex answers "what should I be warned about right now?" — the question the agent faces at task start, automatically, with no query.
They're different questions. Cortex sits alongside your vector store, not instead of it. See docs/architecture.md.
Why not an LLM for classification?
Predictability. Rules are diffable in git. When a false positive shows up in the logs, you tighten the rule and re-test. An LLM classifier is a black box that hallucinates on some non-zero fraction of prompts — and you silently ship bad injections on exactly the cases you're least equipped to notice.
The TF-IDF fallback exists for when the rule engine misses. It's still deterministic. Still grep-able. Still fixable in a PR.
What's the cost at hook time?
Measured on the live BOTWA store (see BENCHMARKS.md):
- Tokenize: 1 µs (p50)
- TF-IDF fallback: 0.5 ms (p50)
- Synthesize: 17 µs (p50)
- Render brief: 12 µs (p50)
- Full
classify_promptin-process: 6.3 ms (p50, 8.8 ms p99) — dominated by YAML rule parse and SQLite store open - End-to-end
cortex-hooksubprocess: ~58 ms per fresh invocation (Python startup + imports + Cortex work)
The 58 ms per-prompt wall-clock cost is well below human perception and
well below any network hop your agent makes. On a warm long-running
process (Day-9 cortex serve mode, not shipped yet) the full pipeline
is <1 ms.
Run cortex bench yourself to reproduce these numbers on your machine.
Does it work on Linux / Mac / Windows?
Built and primarily tested on Windows. Cross-platform paths via pathlib. UTF-8 everywhere. Tests use tmp_path fixtures. CI matrix covers Ubuntu + Python 3.10 / 3.11 / 3.12. No Windows-only syscalls.
Is it on PyPI?
Not yet. Install from this repo with pip install -e .. PyPI release when Day 7-8 features ship.
Is this production-ready?
It's running live in one production project (the Polymarket research repo where it was born) on a shared Claude Code session with PostToolUse + UserPromptSubmit hooks active. 107 tests, ruff clean, fail-open on every error path. Use it with that level of caveat: it's alpha, it's maintained, and the author eats the dog food daily.
Docs
| Document | For |
|---|---|
| docs/architecture.md | Why Cortex exists. Three failure modes. Design decisions. Rejected paths. Cortex vs Palace. |
| docs/authoring.md | How to write tripwires, rules, cost components, synthesis rules, verifiers, violation patterns. Includes a "should this be a tripwire?" decision tree. |
| docs/hooks.md | Claude Code hook contract. Environment variables. Manual testing. Five troubleshooting recipes. |
| BENCHMARKS.md | Real latency / storage / brief-size / token-impact numbers from the live store. |
| CONTRIBUTING.md | Five ground rules. Dev setup. PR checklist. |
| CHANGELOG.md | Day 1-8 feature history with rejected-path post-mortems. |
CLI at a glance
| Command | Purpose |
|---|---|
cortex init |
Create an empty store |
cortex migrate |
Seed tripwires from importers (idempotent) |
cortex list [--domain X] [--severity Y] |
Browse tripwires |
cortex show <id> |
Full detail of one tripwire |
cortex find w1,w2,w3 |
Simulate a trigger match for given words |
cortex stats |
Store summary by severity and domain |
cortex stats --sessions [--days N] |
Session audit: rule hits, cold tripwires, silent violations, effectiveness |
cortex add ... |
Add a tripwire manually |
cortex import-palace "query" [--to-inbox] |
Smart search over Palace, emit draft templates or stage to inbox |
cortex inbox list |
List pending tripwire drafts with validation status |
cortex inbox show <draft_id> |
Show one draft with full JSON and TODO/MISSING markers |
cortex inbox approve <draft_id> [--force] |
Promote a draft to the tripwire store |
cortex inbox reject <draft_id> |
Delete a draft without promoting |
cortex bench [--iterations N] [--no-subprocess] [--json] |
Benchmark subsystem latency, storage footprint, brief sizes |
cortex suggest-patterns <tripwire_id> [--fix-example "..."] |
Auto-generate regex candidates for violation_patterns from past session log data |
cortex reflect [--days N] [--dry-run] |
Haiku DMN reflection: propose new tripwires from session logs into the inbox |
cortex-hook |
UserPromptSubmit hook entry point |
cortex-watch |
PostToolUse audit hook entry point |
cortex-check-lookahead --features-dir DIR |
Standalone lookahead-bug verifier |
About the seed tripwires
The 13 tripwires shipped in cortex/importers/memory_md.py
are real lessons from a Polymarket trading research project where
Cortex was born. They reference specific past research failures: a misread
fee formula, a dead-zone trading hypothesis, a lookahead bug in a feature
pipeline, a single-strategy bleed, a survivorship-biased paper test.
They're included as concrete working examples of the tripwire format, not as rules for your project. Write your own. The examples are there so you can see what "good" looks like. See docs/authoring.md for the full guide and a decision tree.
The rules under cortex/rules/polymarket.yml are likewise domain-specific.
cortex/rules/generic.yml has a few reusable ones (paper trading config,
backtest-vs-prod comparison, feature pipelines).
Roadmap
Day 7 — shipped:
- Pre-flight verifier auto-run from hook — critical tripwires with
verify_cmdrun during injection, opt-in viaCORTEX_VERIFY_ENABLE=1, allow-list guarded (cortex-*/python -m cortexprefixes by default), 3-second hard timeout,shell=False, captured output truncated, fail-safe on any error. Results appear at the top of the brief with[OK]/[FAIL]/[SKIP]status. Static warnings become "the bug is present in your current code RIGHT NOW."
Day 8 — shipped:
- Inbox workflow —
cortex inbox list / show / approve / rejectmanages draft tripwires as JSON files under.cortex/inbox/.cortex import-palace --to-inboxstages Palace hits as editable drafts instead of printing copy-paste templates.validate_draft()reportsTODO/MISSINGfield status;approverefuses to promote drafts with TODO placeholders unless--forceis passed. Closes the Palace → Cortex knowledge transfer loop with a human in the loop, and provides the substrate for Day 9 DMN proposals.
Day 9 — shipped:
- Auto-regex pattern-suggest helper —
cortex suggest-patterns <id>reads session logs for past injections, extracts tool_calls that followed, runs LCS + digit/whitespace generalization, and emits ready-to-paste regex candidates with HIGH/MEDIUM/LOW confidence. The optional--fix-example "snippet"flag verifies the candidate does NOT match a known fix pattern — candidates that do are automatically downgraded to[LOW CONFIDENCE]. No more hand-writing regexes while staring at snippets.
Days 10-12 — shipped:
- Day 10: Verifier blocking mode —
CORTEX_VERIFY_BLOCK=1makes a failed critical verifier exit code 2 fromcortex-hook, rejecting the prompt in Claude Code. The brief still emits so the agent sees the FAIL marker. - Day 11: Haiku DMN reflection loop —
cortex reflectreads recent session logs, summarizes patterns, asks Claude Haiku 4.5 for proposed new tripwires, writes drafts to the inbox for human approval. Optional[dmn]extra,--dry-runbudget preview, ~$0.011 per reflection call. Mock-tested end-to-end; live dry-run verified on 17 sessions / 1067 events. - Day 12: PyPI release prep — wheel builds cleanly
(
cortex_agent-0.1.0-py3-none-any.whl, 71 KB, 21 Python modules, 2 YAML rule files, 4 console scripts). Metadata, urls, classifiers, keywords all filled in. Publishing process documented in CONTRIBUTING.md. Awaiting maintainer credentials for the actual upload.
Rejected: cortex serve daemon. Day 8.5 measured subprocess cost
at ~60ms, below human perception. A warm daemon would save ~10
seconds/day against 8-12 hours of work — not worth the operational
complexity. See CHANGELOG.md and
docs/architecture.md for the full
post-mortem.
Day 13+ (speculative, no plans):
cortex bench --compare <file>for CI regression tracking- Multi-project store federation (one
.cortex/per repo, optional cross-project queries) - Web dashboard for session stats visualization
License
MIT. See LICENSE.
Built because the agents we trust to reason about our code have no institutional memory of what blew up last week. Cortex gives them one.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmcortex_agent-0.1.0.tar.gz.
File metadata
- Download URL: llmcortex_agent-0.1.0.tar.gz
- Upload date:
- Size: 94.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5ac95ec4745d202bf0cdfbabb52e6e7299dec7f393620594adc0701ee123cbc
|
|
| MD5 |
7bb2d13354eb75bed912e63813bbddd5
|
|
| BLAKE2b-256 |
330dc4dd13df36f243870070ce9f62aa84819827b4ac055e32e67cfa86fa0b31
|
File details
Details for the file llmcortex_agent-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llmcortex_agent-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3758bac6cdee0505405ed85a426c5255c387412761286b78726ca1e16b44cf4c
|
|
| MD5 |
486afa5ce5ce932323c6e92c5059c0c5
|
|
| BLAKE2b-256 |
b0e61ae12c2582fbdb3ee89cb2e5d82b6c3ae390b48d93a6b9646d4d82f0e64a
|