Validated execution traces as memory for MCP-based agent tool orchestration
Project description
behavioral-memory
Give your agent institutional memory. Drop-in retrieval of validated execution traces for any LLM agent framework.
Your agent makes the same mistakes repeatedly because it has no memory of what worked before. behavioral-memory fixes this — it stores validated execution traces (task → tool chain mappings) and retrieves semantically similar ones at query time, so your agent learns from past successes instead of starting from scratch every time.
Based on: "Behavioral Memory for Tool Orchestration: Semantic Retrieval of Validated Execution Traces in MCP-Based Agent Systems" (IEEE, 2025)
Install
pip install behavioral-memory
Plug Into Your Agent (3 lines)
The library is framework-agnostic. You bring your own LLM, your own agent — behavioral-memory handles the memory layer.
Core API
from behavioral_memory import PlanEngine, ToolRegistry, InMemoryTraceStore
# 1. Choose your LLM (any LangChain-compatible model)
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")
# 2. Create a memory store (no database needed)
store = InMemoryTraceStore(embeddings=embeddings)
# 3. Generate plans with behavioral memory
engine = PlanEngine(llm=llm, store=store)
plan = engine.generate(query="Get revenue data and email a report")
That's it. Your agent now has memory.
With OpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o", temperature=0)
store = InMemoryTraceStore(embeddings=OpenAIEmbeddings())
engine = PlanEngine(llm=llm, store=store)
With Ollama (fully local)
from langchain_ollama import ChatOllama, OllamaEmbeddings
llm = ChatOllama(model="llama3")
store = InMemoryTraceStore(embeddings=OllamaEmbeddings(model="nomic-embed-text"))
engine = PlanEngine(llm=llm, store=store)
Production: PostgreSQL + pgvector
from behavioral_memory import TraceStore # pip install behavioral-memory[postgres]
store = TraceStore(
embeddings=embeddings,
connection_url="postgresql+psycopg://user:pass@localhost/behavioral_memory",
)
How It Helps Your Agent
Before behavioral memory, your agent sees only the task and tool schemas — it has to figure out orchestration from scratch every time. With behavioral memory, it retrieves validated examples of similar tasks that worked before.
Your Agent's Query: "Build a revenue analysis pipeline"
│
┌────────────┴────────────┐
│ BEHAVIORAL MEMORY │
│ │
│ 1. Retrieve top-k │ ← finds 3 similar validated traces
│ similar traces │ from past successful executions
│ │
│ 2. Merge with tool │ ← current MCP tool schemas
│ schemas │
│ │
│ 3. Generate plan │ ← LLM sees examples + schemas + query
└────────────┬────────────┘
│
▼
Better execution plan
(right tools, right params, right order)
Seed your memory with domain knowledge
from behavioral_memory import ExecutionTrace, ToolCall
trace = ExecutionTrace(
task_description="Calculate quarterly revenue",
tool_chain=[
ToolCall(step_id="s1", tool_name="query_database",
parameters={"query": "SELECT SUM(quantity * unit_price) FROM order_items"}),
ToolCall(step_id="s2", tool_name="generate_report",
parameters={"source_step": "s1", "format": "markdown_table"}),
],
source="seed",
)
store.add(trace)
Register your own tool schemas
The PlanEngine needs to know what tools your agent has:
from behavioral_memory import ToolSchema, ToolRegistry
schema = ToolSchema(
name="search_docs",
description="Search internal documentation",
parameters_schema={
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
)
registry = ToolRegistry()
registry.register(schema)
engine = PlanEngine(llm=llm, store=store, registry=registry)
Or load schemas dynamically from an MCP server:
from behavioral_memory.tools.mcp_client import fetch_mcp_schemas
schemas = await fetch_mcp_schemas("http://localhost:3000/sse")
registry.register_many(schemas)
Validate before storing (Gatekeeper Pipeline)
Don't let bad traces into memory. The gatekeeper runs three checks before accepting a trace:
from behavioral_memory import GatekeeperPipeline
gatekeeper = GatekeeperPipeline(store=store, registry=registry)
result = gatekeeper.submit(trace) # schema check → sandbox → dedup → store
print(result.accepted) # True if all gates passed
Learn from production (Langfuse Feedback Loop)
Traces logged to Langfuse can be reviewed by domain experts. Positively scored traces automatically flow back into memory through the gatekeeper:
from behavioral_memory import FeedbackPoller, AnnotationHandler
poller = FeedbackPoller(settings=settings)
handler = AnnotationHandler(poller=poller, gatekeeper=gatekeeper)
handler.run_loop() # continuously polls → validates → stores
Without LangChain (plain Python)
If you don't use LangChain, you can use the lower-level primitives directly:
from behavioral_memory.planner.prompt import SYSTEM_PROMPT, build_prompt
from behavioral_memory.planner.postprocess import postprocess_plan
# Build the prompt yourself
prompt = build_prompt(query="Get revenue data", traces=my_traces, tool_schemas=my_schemas)
# Call your own LLM
raw_output = your_llm.chat(system=SYSTEM_PROMPT, user=prompt)
# Parse the JSON plan
steps = postprocess_plan(raw_output) # returns list[ToolCall]
Persistence and Limitations
| Store | Persistence | Multi-user | Best for |
|---|---|---|---|
InMemoryTraceStore |
Process memory only | No | Dev, CI, demos |
TraceStore (pgvector) |
PostgreSQL, survives restarts | Shared DB, single collection | Production |
Current limitations:
- All traces share one collection (default:
validated_traces). No per-user or per-session isolation. - Langfuse is optional — the core framework (planning, retrieval, gatekeeper) works without it.
- The reference agent at
agent/is a planning demo with stub tool execution — bring your own tool runtime.
Key Results
On a 30-task benchmark with 7 MCP tools (Gemini 2.5 Pro, temperature 0):
| Metric | Zero-Shot | Static Few-Shot | With Behavioral Memory |
|---|---|---|---|
| Tool Selection (TSA) | 63.3% | 70.0% | 83.3% |
| Parameter Validity (PV) | 72.2% | 79.6% | 84.0% |
| Plan Correctness (PCR) | 33.3% | 50.0% | 63.3% |
| Sequence Accuracy (ESA) | 63.3% | 70.0% | 83.3% |
McNemar's test: p = 0.004 vs zero-shot. Plan correctness nearly doubled.
Reproduced live run (May 2026)
| Metric | Paper | Live Run (pgvector) |
|---|---|---|
| TSA | 83.3% | 86.7% |
| PV | 84.0% | 82.2% |
| PCR | 63.3% | 80.0% |
| ESA | 83.3% | 86.7% |
| McNemar p | 0.004 | 0.039 |
All results within 95% bootstrap confidence intervals.
Architecture
Three layers (from the paper):
| Layer | What it does | Key class |
|---|---|---|
| Behavioral | Store and retrieve validated execution traces via cosine similarity | InMemoryTraceStore / TraceStore |
| Tool | Load tool schemas dynamically via MCP | ToolRegistry / MCPClient |
| Executive | Assemble prompt (traces + schemas + query), call LLM, parse plan | PlanEngine |
Gatekeeper Pipeline guards memory quality with three gates:
- Schema validation — tools exist, params valid, deps logical
- Sandboxed execution — runtime check with timeout
- Semantic deduplication — cosine > 0.95 rejected
Reproduce the Paper
git clone https://github.com/harsh-kr11/behavioral-memory.git
cd behavioral-memory
pip install -e ".[agent,eval]"
export GOOGLE_API_KEY=your-key
# Run the 30-task benchmark
python examples/run_live_benchmark.py
# Quick test (5 tasks)
python examples/run_live_benchmark.py --limit 5
# Exact paper reproduction (with pgvector)
pip install -e ".[postgres]"
docker compose up -d # or: podman-compose up -d
python examples/run_live_benchmark.py --postgres
# Gatekeeper ablation study (Section IV.D.5)
python examples/gatekeeper_ablation.py --verbose
# Validate pipeline offline (no API keys)
python examples/validate_pipeline.py
A reference LangGraph agent is included at agent/ for demo purposes.
Cursor Agent Skill
This repo ships a Cursor Agent Skill for guided integration. Open this repo in Cursor and type /behavioral-memory in the Agent chat to invoke the skill — it walks through store setup, seed traces, feedback loops, Langfuse v4 wiring, and pgvector persistence.
# Verify your setup after following the skill
python .cursor/skills/behavioral-memory/scripts/verify_setup.py
See integration-examples.md for LangGraph, FastAPI, and production patterns.
Development
pip install -e ".[dev,eval]"
make test # 104 tests
make lint # ruff check
make typecheck # mypy (strict)
make ci # all checks
Configuration
All via environment variables or .env:
| Variable | Default | Description |
|---|---|---|
FEW_SHOT_K |
3 |
Traces to retrieve per query |
MAX_PROMPT_TOKENS |
3500 |
Token budget for prompt |
SIMILARITY_DEDUP_THRESHOLD |
0.95 |
Dedup cosine threshold |
SANDBOX_TIMEOUT_SECONDS |
30 |
Gatekeeper sandbox timeout |
VECTOR_STORE_URL |
— | PostgreSQL connection (only for TraceStore) |
LANGFUSE_SECRET_KEY |
— | Langfuse secret (optional) |
LANGFUSE_PUBLIC_KEY |
— | Langfuse public key (optional) |
Citation
@inproceedings{khan2025behavioral,
title={Behavioral Memory for Tool Orchestration: Semantic Retrieval of
Validated Execution Traces in MCP-Based Agent Systems},
author={Khan, Mehvash and Kumar, Harsh and Jangir, Rahul},
booktitle={IEEE Conference Proceedings},
year={2025}
}
License
Apache 2.0 — See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file behavioral_memory-0.1.1.tar.gz.
File metadata
- Download URL: behavioral_memory-0.1.1.tar.gz
- Upload date:
- Size: 82.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2af6e81c3791e2fda2bf6d3c30a32b3f5defad665756ba843eec714daed821e7
|
|
| MD5 |
36f28bb108e8cd61979064ca8c58bd8d
|
|
| BLAKE2b-256 |
0ad1ec2f9f96b786ca99ac3d5521ccede7aae0901b5c765846b1c39b7da46399
|
Provenance
The following attestation bundles were made for behavioral_memory-0.1.1.tar.gz:
Publisher:
release.yml on harsh-kr11/behavioral-memory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
behavioral_memory-0.1.1.tar.gz -
Subject digest:
2af6e81c3791e2fda2bf6d3c30a32b3f5defad665756ba843eec714daed821e7 - Sigstore transparency entry: 1574648233
- Sigstore integration time:
-
Permalink:
harsh-kr11/behavioral-memory@952782d742947604ec9a6766b4751f1328615d8d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/harsh-kr11
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@952782d742947604ec9a6766b4751f1328615d8d -
Trigger Event:
push
-
Statement type:
File details
Details for the file behavioral_memory-0.1.1-py3-none-any.whl.
File metadata
- Download URL: behavioral_memory-0.1.1-py3-none-any.whl
- Upload date:
- Size: 59.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fd3059c2ed27beae3845655a5ffe7f471fc0bccf351a4a553b334f4b05d50e4
|
|
| MD5 |
a17f27ac12fffea31d8c6d36bd1821e2
|
|
| BLAKE2b-256 |
139a7b38e962be84d8a80e63fbecaf291269bd25301f96fea291b600738a5a79
|
Provenance
The following attestation bundles were made for behavioral_memory-0.1.1-py3-none-any.whl:
Publisher:
release.yml on harsh-kr11/behavioral-memory
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
behavioral_memory-0.1.1-py3-none-any.whl -
Subject digest:
0fd3059c2ed27beae3845655a5ffe7f471fc0bccf351a4a553b334f4b05d50e4 - Sigstore transparency entry: 1574648259
- Sigstore integration time:
-
Permalink:
harsh-kr11/behavioral-memory@952782d742947604ec9a6766b4751f1328615d8d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/harsh-kr11
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@952782d742947604ec9a6766b4751f1328615d8d -
Trigger Event:
push
-
Statement type: