Graph-native working memory for coding agents with causal retrieval and bounded capacity.
Project description
state-trace
Graph-native working memory for coding agents: typed memories, causal retrieval, bounded capacity, and compact briefs for small models.
state-trace is a bounded working-memory layer for coding and debugging agents that need the right file, failure, and next action under tight token budgets. It is not a replacement for a general-purpose temporal knowledge graph like Graphiti — see ARCHITECTURE.md for the honest comparison.
What it is optimized for:
- artifact-first retrieval for coding agents
- current-vs-stale task state (
engine.current_state(),engine.failed_hypotheses()) - compact harness-facing briefs for smaller models
- online agent loops and post-hoc trajectory ingestion
- bounded memory with decay, compression, and lifecycle retention
- MCP-mountable, local-first deployment
Headline: SWE-bench-Verified localization — n=500
The credibility benchmark. Cold-start artifact localization on the full SWE-bench-Verified test split: given only the GitHub issue text and hints (no trajectory), rank the correct patch file at 1 and at 5.
pip install -e ".[bench]"
python3 examples/swebench_verified_eval.py --limit 500 --backends no_memory bm25 state_trace graphiti
| backend | n | Artifact@1 | Artifact@1 CI | Artifact@5 | Artifact@5 CI | AvgLatencyMs |
|---|---|---|---|---|---|---|
| no_memory | 500 | 0.000 | [0.000, 0.000] | 0.000 | [0.000, 0.000] | 0.00 |
| bm25 | 500 | 0.176 | [0.144, 0.208] | 0.300 | [0.262, 0.338] | 0.14 |
| state_trace | 500 | 0.150 | [0.122, 0.182] | 0.150 | [0.122, 0.182] | 16.92 |
| graphiti | 500 | 0.098 | [0.072, 0.126] | 0.216 | [0.182, 0.254] | 4901.38 |
What this says, plainly:
- state_trace beats Graphiti on Artifact@1 (0.150 vs 0.098, non-overlapping 95% CIs). On "put the right file first, from just an issue," the typed coding-agent ontology helps.
- state_trace does not beat BM25 on Artifact@1 (0.150 vs 0.176, overlapping CIs). On cold-start without a trajectory, lexical search over file-path tokens is still a strong baseline.
- state_trace loses on Artifact@5 because
retrieve_briefcurrently surfaces a single primary patch candidate, so A@5 ≡ A@1. This is a known brief-shape issue, not a retrieval-quality issue; see the "Known limitations" note below.
Where state_trace actually differentiates is not one-shot cold-start localization — it's the trajectory-informed and online-loop lane where current_state and failed_hypotheses come into play. See BENCHMARKS.md for the trajectory benchmarks (small-N, caveated).
Known limitations of this benchmark
state_tracereturns a single-file brief, so A@5 = A@1. Fixing the brief to expose top-5 candidates is a straightforward follow-up.- Graphiti is run with stubbed LLM/embedder (deterministic hash embeddings, BM25 + cosine + BFS → RRF). Its full LLM-entity-extraction pipeline is not exercised — that's the same simplification used in
graphiti_head_to_head_eval.pyand is documented there. - Cold-start localization from issue text is only one problem a memory layer solves. It is deliberately the hardest one we ship a number for; the trajectory benchmarks in BENCHMARKS.md test the other axes.
What makes the architecture different
Typed coding-agent ontology, not generic Entity/Edge:
- Nodes:
task,observation,decision,file,goal,session,command,test,symbol,patch_hunk,error_signature,episode - Edges:
patches_file,fails_in,verified_by,rejected_by,supersedes,contradicts,solves,derived_from,precedes,motivates, and more - Intent routing: the retrieval scorer re-prioritizes edge types per query intent (
locate_file,failure_analysis,history,general).
Bounded working memory as a first-class constraint:
enforce_capacity()runs decay, compression, and summarization on every step.current_state(session)answers "what's live right now" directly — cheap for state-trace, expensive for a general-purpose knowledge graph.failed_hypotheses(session)returns invalidated, superseded, or unrecovered-error nodes — the "don't propose this again" signal.
Local-first, MCP-mountable:
- Hot graph is an in-process
networkx.MultiDiGraph. Cold storage is WAL SQLite+FTS5. state-trace-mcpis a stdio MCP server you can mount in Claude Code / Cursor / Codex CLI.
See ARCHITECTURE.md for why these choices matter vs. Graphiti, and BENCHMARKS.md for the smaller repo-local benchmarks.
vs. Graphiti
Graphiti is the stronger general-purpose temporal knowledge graph for AI agents. state-trace is narrower: working memory for one coding/debugging session at a time. We're not claiming to replace Graphiti — we're claiming a specific lane where the tradeoffs land differently.
Each row below is a concrete, measured axis, not a vibe.
| Axis | state-trace | Graphiti | Winner for coding agents |
|---|---|---|---|
| Artifact@1 on SWE-bench-Verified, n=500 | 0.150 [0.122, 0.182] | 0.098 [0.072, 0.126] | state-trace — non-overlapping 95% CIs |
| Per-retrieval latency (same benchmark) | 16.9 ms | 4,901 ms | state-trace — ~290× faster |
| Write path per agent step | Typed insert, zero LLM calls | add_episode → LLM entity extraction each step |
state-trace — cheaper, deterministic, no API key |
| Default deploy | Pure Python + local SQLite/JSON; state-trace-mcp stdio binary |
Neo4j / Kuzu / FalkorDB graph DB + embedder + LLM | state-trace — local-first, no external services |
| Coding-agent ontology | Typed: file, patch_hunk, error_signature, test, command, symbol, observation, decision, task, goal, session, episode |
Generic EntityNode / EntityEdge / EpisodicNode |
state-trace — retrieval scorer routes on these types |
| "What's true right now in this session?" | engine.current_state(session) — direct O(graph) query |
Inferred from temporal facts via Cypher or LLM | state-trace — first-class API |
| "What have I already tried and rejected?" | engine.failed_hypotheses(session) — direct query returning invalid_at + superseded + unrecovered-error nodes |
Has to be inferred from invalid_at + contradictions |
state-trace — first-class API |
| Working-memory capacity bound | enforce_capacity with decay + compression + lifecycle retention. Long-horizon pressure benchmark: Artifact@1 0.771 while staying within a 96-unit budget 100% of the time |
Unbounded by design; relies on the graph DB to scale | state-trace for long debugging sessions that need a memory ceiling |
| Small-model brief | retrieve_brief produces ~220-token structured brief (patch_file, rerun_command, tests_to_rerun, failed_attempts, recommended_actions, …) that fits a tight budget |
Returns raw nodes/facts; caller compresses | state-trace — built for small-model harnesses |
| MCP-mountable | state-trace-mcp stdio server in the [mcp] extra — 11 tools exposed, drop into ~/.claude/settings.json |
No official MCP server; library-first | state-trace — plug straight into Claude Code / Cursor / Codex / opencode |
| Long-lived temporal knowledge across weeks | Scoped to a session or repo namespace; no cross-namespace fact merging | First-class; bi-temporal validity, contradiction resolution, fact supersession across episodes | Graphiti |
| Multi-tenant SaaS scale | Single-writer process model; authoritative graph is in-process networkx | Built for it on Neo4j/Kuzu substrate | Graphiti |
| Cross-session learning about users / orgs / policies | Out of scope | First-class | Graphiti |
When to pick which
Use state-trace when:
- Your agent is editing code in a single debugging or refactoring session.
- You talk to an MCP client (Claude Code, Cursor, Codex CLI, opencode) and want working memory without standing up a graph DB.
- Per-action latency matters — you're calling memory on every tool invocation in an agent loop.
- You run on small models where a 220-token structured brief beats a 1,000-token raw dump.
- You need "what file should I patch / what did I already try" to be a direct query, not inferred.
Use Graphiti when:
- You need a knowledge graph of facts about the world, users, or an organization that evolves across weeks.
- Multi-tenant, multi-agent shared memory is part of the design.
- You're willing to run Neo4j/Kuzu and pay the LLM-extraction cost per ingest for the ontological payoff.
- Your retrieval patterns are richer than "which file, which test, which failed hypothesis."
They solve adjacent problems. The only reason a comparison is even interesting is that both ship as "memory for AI agents" — the honest answer is they're different products that happen to live on the same shelf.
Installation
uv sync # or: pip install -e .
pip install -e ".[mcp]" # stdio MCP server for Claude Code / Cursor / Codex CLI
pip install -e ".[bench]" # graphiti-core[kuzu] + datasets (for the headline benchmark)
pip install -e ".[llm]" # OpenAI-backed live benchmarks + LLM ingestion
pip install -e ".[adapters]" # LangGraph / LlamaIndex adapter shims
pip install -e ".[api]" # FastAPI app
Distribution name: state-trace. Python import path: state_trace.
Quickstart
from state_trace import MemoryEngine
engine = MemoryEngine(capacity_limit=24.0, storage_path="memory.json")
task = engine.store(
"Fix login by tracing the refresh token path",
{"type": "task", "session": "auth-debug", "goal": "restore login", "file": "auth.ts", "importance": 0.92},
)
engine.store(
"Login still returns 401 after refresh token exchange",
{"type": "observation", "session": "auth-debug", "goal": "restore login", "file": "auth.ts",
"blocks": [task.id], "importance": 0.88},
)
engine.store(
"Authorization header is dropped before the retry request reaches auth.ts",
{"type": "decision", "session": "auth-debug", "goal": "restore login",
"related_to": [task.id], "file": "auth.ts", "importance": 0.91},
)
result = engine.retrieve("Why is login still broken?", {"session": "auth-debug", "goal": "restore login"})
Current state, live hypotheses, failed attempts
The architectural wedge. These APIs return a live view of the session without re-ranking:
state = engine.current_state(session="auth-debug", goal="restore login")
# → {"active_task": ..., "latest_observation": ..., "active_files": [...], ...}
failures = engine.failed_hypotheses(session="auth-debug")
# → [{"id": ..., "reason": ["superseded"], "content": "Login still returns 401 ..."}, ...]
current_state filters out invalidated and superseded nodes; failed_hypotheses surfaces them as "do not propose again" context. A general-purpose temporal graph has to infer this from fact updates; here it's a direct query.
MCP Server
pip install -e ".[mcp]"
state-trace-mcp
Environment config:
STATE_TRACE_STORAGE_PATH— durable path;.db/.sqliteuses the SQLite backend. Default:~/.state-trace/memory.db.STATE_TRACE_NAMESPACE— default namespace (e.g. the repo slug).STATE_TRACE_CAPACITY_LIMIT— working-memory budget (default256).
Tools exposed: store, retrieve, retrieve_brief, record_action, record_observation, record_test_result, ingest_agent_log_file, current_state, failed_hypotheses, list_namespaces, graph_snapshot.
Example Claude Code config (~/.claude/settings.json):
{
"mcpServers": {
"state-trace": {
"command": "state-trace-mcp",
"env": {
"STATE_TRACE_STORAGE_PATH": "/Users/me/.state-trace/memory.db",
"STATE_TRACE_NAMESPACE": "repo-x"
}
}
}
}
Online agent loop
engine = MemoryEngine(capacity_limit=256.0)
ctx = {"session": "auth-debug", "goal": "restore login", "repo": "example/auth-service"}
engine.record_action('open "src/auth.ts"', {**ctx, "files": ["src/auth.ts"]})
engine.record_observation(
"AttributeError: login still fails with a 401 in src/auth.ts",
{**ctx, "files": ["src/auth.ts"], "status": "error"},
)
engine.record_action('edit "src/auth.ts"', {**ctx, "files": ["src/auth.ts"], "action_kind": "edit"})
engine.record_test_result(
"pytest tests/test_auth.py::test_refresh_retry",
"tests/test_auth.py::test_refresh_retry PASSED",
{**ctx, "files": ["src/auth.ts", "tests/test_auth.py::test_refresh_retry"]},
)
brief = engine.retrieve_brief(
"Which file should I patch and what test should I rerun?",
{"session": "auth-debug", "goal": "restore login"},
mode="small_model",
)
The brief fields: patch_file, rerun_command, target_files, tests_to_rerun, current_state, failed_attempts, recommended_actions, evidence, symbols, patch_hints, confidence, token_estimate.
Trajectory ingestion
engine = MemoryEngine(capacity_limit=256.0)
engine.store_agent_log_file("examples/data/agent_logs/marshmallow__marshmallow-1867.json")
Supported inputs: normalized agent_log JSON, raw SWE-agent .traj files, raw OpenHands event JSON logs.
Live solve-rate (next credibility step)
examples/swebench_verified_solve_rate.py scaffolds end-to-end solve-rate measurement: state-trace brief → LLM patch proposal → SWE-bench-Verified prediction JSONL. It does not run the swebench docker harness; that step is documented in the script's header.
python3 examples/swebench_verified_solve_rate.py --limit 5 --model gpt-5.1-mini --dry-run
Storage backends
MemoryEngine(storage_path=...) picks the backend from the file extension:
.db/.sqlite/.sqlite3— durable SQLite with WAL journal + FTS5 seed index. Recommended for long-running agent harnesses.- any other path — JSON blob (simple, single-writer, fine for benchmarks).
See ARCHITECTURE.md for the "why networkx + SQLite, not Neo4j" explainer.
Namespaces
engine = MemoryEngine(storage_path="memory.db", namespace="payments-api")
engine.retrieve("why is login broken?") # scoped to payments-api by default
engine.retrieve("...", include_all_namespaces=True) # opt out
Nodes without a namespace remain visible in every view so pre-namespace data is not lost.
Framework adapters
from state_trace.adapters import StateTraceLangGraphMemory, StateTraceLlamaIndexMemory
lg_memory = StateTraceLangGraphMemory(default_session="coding-session")
li_memory = StateTraceLlamaIndexMemory(session_id="agent-session")
Neither adapter imports the host framework; they satisfy the duck-typed memory contract used by each.
FastAPI
from state_trace.api import app # POST /store, /retrieve, /retrieve_brief, GET /graph
Pass "explain": true on retrieve to include per-node score breakdowns.
Tests
python3 -m pytest -q
Benchmarks
Full set of repo-local benchmarks and their honest caveats lives in BENCHMARKS.md. The SWE-bench-Verified row above is the only one that's at a scale worth citing externally.
Positioning
See vs. Graphiti above for the head-to-head comparison and ARCHITECTURE.md for the architecture tradeoffs in detail. tl;dr: different products, adjacent problems — state-trace owns the narrow coding-agent working-memory lane; Graphiti owns weeks-of-history temporal knowledge graphs.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file state_trace-0.2.0.tar.gz.
File metadata
- Download URL: state_trace-0.2.0.tar.gz
- Upload date:
- Size: 73.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a30e13a62f66612357a5ae2177753caa50e589996b7738ae17f4f8f7afe66d82
|
|
| MD5 |
72070cb3aee3ffd63f5f5eedc7b4e5d4
|
|
| BLAKE2b-256 |
17ef94189c2d10c97e89df52d04dbee78681fc9204953fd81189e69b5f18a401
|
Provenance
The following attestation bundles were made for state_trace-0.2.0.tar.gz:
Publisher:
release.yml on razroo/state-trace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
state_trace-0.2.0.tar.gz -
Subject digest:
a30e13a62f66612357a5ae2177753caa50e589996b7738ae17f4f8f7afe66d82 - Sigstore transparency entry: 1367218125
- Sigstore integration time:
-
Permalink:
razroo/state-trace@e3ae59eb60efee957130d16e9eae66e062ab9d3f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/razroo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e3ae59eb60efee957130d16e9eae66e062ab9d3f -
Trigger Event:
release
-
Statement type:
File details
Details for the file state_trace-0.2.0-py3-none-any.whl.
File metadata
- Download URL: state_trace-0.2.0-py3-none-any.whl
- Upload date:
- Size: 62.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83ec0de0c4e14d16dd79aa6ce45f979c30c4f3e74594052d35219e3e9cd53ab1
|
|
| MD5 |
9660207cc4c0f73b983d6edccdc87b71
|
|
| BLAKE2b-256 |
364fbec39a8e05b75c7af11d4c24485c71979d73fc6acd8b7673eee8d9be1798
|
Provenance
The following attestation bundles were made for state_trace-0.2.0-py3-none-any.whl:
Publisher:
release.yml on razroo/state-trace
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
state_trace-0.2.0-py3-none-any.whl -
Subject digest:
83ec0de0c4e14d16dd79aa6ce45f979c30c4f3e74594052d35219e3e9cd53ab1 - Sigstore transparency entry: 1367218217
- Sigstore integration time:
-
Permalink:
razroo/state-trace@e3ae59eb60efee957130d16e9eae66e062ab9d3f -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/razroo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e3ae59eb60efee957130d16e9eae66e062ab9d3f -
Trigger Event:
release
-
Statement type: