Skip to main content

Validated execution traces as memory for MCP-based agent tool orchestration

Project description

behavioral-memory

Give your agent institutional memory. Drop-in retrieval of validated execution traces for any LLM agent framework.

License Python CI

Your agent makes the same mistakes repeatedly because it has no memory of what worked before. behavioral-memory fixes this — it stores validated execution traces (task → tool chain mappings) and retrieves semantically similar ones at query time, so your agent learns from past successes instead of starting from scratch every time.

Based on: "Behavioral Memory for Tool Orchestration: Semantic Retrieval of Validated Execution Traces in MCP-Based Agent Systems" (IEEE, 2025)


Install

pip install behavioral-memory

Plug Into Your Agent (3 lines)

The library is framework-agnostic. You bring your own LLM, your own agent — behavioral-memory handles the memory layer.

Core API

from behavioral_memory import PlanEngine, ToolRegistry, InMemoryTraceStore

# 1. Choose your LLM (any LangChain-compatible model)
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
llm = ChatGoogleGenerativeAI(model="gemini-2.5-pro", temperature=0)
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

# 2. Create a memory store (no database needed)
store = InMemoryTraceStore(embeddings=embeddings)

# 3. Generate plans with behavioral memory
engine = PlanEngine(llm=llm, store=store)
plan = engine.generate(query="Get revenue data and email a report")

That's it. Your agent now has memory.

With OpenAI

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o", temperature=0)
store = InMemoryTraceStore(embeddings=OpenAIEmbeddings())
engine = PlanEngine(llm=llm, store=store)

With Ollama (fully local)

from langchain_ollama import ChatOllama, OllamaEmbeddings

llm = ChatOllama(model="llama3")
store = InMemoryTraceStore(embeddings=OllamaEmbeddings(model="nomic-embed-text"))
engine = PlanEngine(llm=llm, store=store)

Production: PostgreSQL + pgvector

from behavioral_memory import TraceStore  # pip install behavioral-memory[postgres]

store = TraceStore(
    embeddings=embeddings,
    connection_url="postgresql+psycopg://user:pass@localhost/behavioral_memory",
)

How It Helps Your Agent

Before behavioral memory, your agent sees only the task and tool schemas — it has to figure out orchestration from scratch every time. With behavioral memory, it retrieves validated examples of similar tasks that worked before.

Your Agent's Query: "Build a revenue analysis pipeline"
                │
   ┌────────────┴────────────┐
   │   BEHAVIORAL MEMORY     │
   │                         │
   │  1. Retrieve top-k      │  ← finds 3 similar validated traces
   │     similar traces      │     from past successful executions
   │                         │
   │  2. Merge with tool     │  ← current MCP tool schemas
   │     schemas             │
   │                         │
   │  3. Generate plan       │  ← LLM sees examples + schemas + query
   └────────────┬────────────┘
                │
                ▼
         Better execution plan
         (right tools, right params, right order)

Seed your memory with domain knowledge

from behavioral_memory import ExecutionTrace, ToolCall

trace = ExecutionTrace(
    task_description="Calculate quarterly revenue",
    tool_chain=[
        ToolCall(step_id="s1", tool_name="query_database",
                 parameters={"query": "SELECT SUM(quantity * unit_price) FROM order_items"}),
        ToolCall(step_id="s2", tool_name="generate_report",
                 parameters={"source_step": "s1", "format": "markdown_table"}),
    ],
    source="seed",
)
store.add(trace)

Register your own tool schemas

The PlanEngine needs to know what tools your agent has:

from behavioral_memory import ToolSchema, ToolRegistry

schema = ToolSchema(
    name="search_docs",
    description="Search internal documentation",
    parameters_schema={
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
)

registry = ToolRegistry()
registry.register(schema)
engine = PlanEngine(llm=llm, store=store, registry=registry)

Or load schemas dynamically from an MCP server:

from behavioral_memory.tools.mcp_client import fetch_mcp_schemas

schemas = await fetch_mcp_schemas("http://localhost:3000/sse")
registry.register_many(schemas)

Validate before storing (Gatekeeper Pipeline)

Don't let bad traces into memory. The gatekeeper runs three checks before accepting a trace:

from behavioral_memory import GatekeeperPipeline

gatekeeper = GatekeeperPipeline(store=store, registry=registry)
result = gatekeeper.submit(trace)  # schema check → sandbox → dedup → store
print(result.accepted)  # True if all gates passed

Learn from production (Langfuse Feedback Loop)

Traces logged to Langfuse can be reviewed by domain experts. Positively scored traces automatically flow back into memory through the gatekeeper:

from behavioral_memory import FeedbackPoller, AnnotationHandler

poller = FeedbackPoller(settings=settings)
handler = AnnotationHandler(poller=poller, gatekeeper=gatekeeper)
handler.run_loop()  # continuously polls → validates → stores

Without LangChain (plain Python)

If you don't use LangChain, you can use the lower-level primitives directly:

from behavioral_memory.planner.prompt import SYSTEM_PROMPT, build_prompt
from behavioral_memory.planner.postprocess import postprocess_plan

# Build the prompt yourself
prompt = build_prompt(query="Get revenue data", traces=my_traces, tool_schemas=my_schemas)

# Call your own LLM
raw_output = your_llm.chat(system=SYSTEM_PROMPT, user=prompt)

# Parse the JSON plan
steps = postprocess_plan(raw_output)  # returns list[ToolCall]

Persistence and Limitations

Store Persistence Multi-user Best for
InMemoryTraceStore Process memory only No Dev, CI, demos
TraceStore (pgvector) PostgreSQL, survives restarts Shared DB, single collection Production

Current limitations:

  • All traces share one collection (default: validated_traces). No per-user or per-session isolation.
  • Langfuse is optional — the core framework (planning, retrieval, gatekeeper) works without it.
  • The reference agent at agent/ is a planning demo with stub tool execution — bring your own tool runtime.

Key Results

On a 30-task benchmark with 7 MCP tools (Gemini 2.5 Pro, temperature 0):

Metric Zero-Shot Static Few-Shot With Behavioral Memory
Tool Selection (TSA) 63.3% 70.0% 83.3%
Parameter Validity (PV) 72.2% 79.6% 84.0%
Plan Correctness (PCR) 33.3% 50.0% 63.3%
Sequence Accuracy (ESA) 63.3% 70.0% 83.3%

McNemar's test: p = 0.004 vs zero-shot. Plan correctness nearly doubled.

Reproduced live run (May 2026)
Metric Paper Live Run (pgvector)
TSA 83.3% 86.7%
PV 84.0% 82.2%
PCR 63.3% 80.0%
ESA 83.3% 86.7%
McNemar p 0.004 0.039

All results within 95% bootstrap confidence intervals.


Architecture

Three layers (from the paper):

Layer What it does Key class
Behavioral Store and retrieve validated execution traces via cosine similarity InMemoryTraceStore / TraceStore
Tool Load tool schemas dynamically via MCP ToolRegistry / MCPClient
Executive Assemble prompt (traces + schemas + query), call LLM, parse plan PlanEngine

Gatekeeper Pipeline guards memory quality with three gates:

  1. Schema validation — tools exist, params valid, deps logical
  2. Sandboxed execution — runtime check with timeout
  3. Semantic deduplication — cosine > 0.95 rejected

Reproduce the Paper

git clone https://github.com/harsh-kr11/behavioral-memory.git
cd behavioral-memory
pip install -e ".[agent,eval]"
export GOOGLE_API_KEY=your-key

# Run the 30-task benchmark
python examples/run_live_benchmark.py

# Quick test (5 tasks)
python examples/run_live_benchmark.py --limit 5

# Exact paper reproduction (with pgvector)
pip install -e ".[postgres]"
docker compose up -d  # or: podman-compose up -d
python examples/run_live_benchmark.py --postgres

# Gatekeeper ablation study (Section IV.D.5)
python examples/gatekeeper_ablation.py --verbose

# Validate pipeline offline (no API keys)
python examples/validate_pipeline.py

A reference LangGraph agent is included at agent/ for demo purposes.


Development

pip install -e ".[dev,eval]"
make test         # 104 tests
make lint         # ruff check
make typecheck    # mypy (strict)
make ci           # all checks

Configuration

All via environment variables or .env:

Variable Default Description
FEW_SHOT_K 3 Traces to retrieve per query
MAX_PROMPT_TOKENS 3500 Token budget for prompt
SIMILARITY_DEDUP_THRESHOLD 0.95 Dedup cosine threshold
SANDBOX_TIMEOUT_SECONDS 30 Gatekeeper sandbox timeout
VECTOR_STORE_URL PostgreSQL connection (only for TraceStore)
LANGFUSE_SECRET_KEY Langfuse secret (optional)
LANGFUSE_PUBLIC_KEY Langfuse public key (optional)

Citation

@inproceedings{khan2025behavioral,
  title={Behavioral Memory for Tool Orchestration: Semantic Retrieval of
         Validated Execution Traces in MCP-Based Agent Systems},
  author={Khan, Mehvash and Kumar, Harsh and Jangir, Rahul},
  booktitle={IEEE Conference Proceedings},
  year={2025}
}

License

Apache 2.0 — See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

behavioral_memory-0.1.0.tar.gz (76.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

behavioral_memory-0.1.0-py3-none-any.whl (59.3 kB view details)

Uploaded Python 3

File details

Details for the file behavioral_memory-0.1.0.tar.gz.

File metadata

  • Download URL: behavioral_memory-0.1.0.tar.gz
  • Upload date:
  • Size: 76.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for behavioral_memory-0.1.0.tar.gz
Algorithm Hash digest
SHA256 318f6e082b5b2d08ab3c7ecd3f4f4a6369194e2ca66b328d110f2d7117c3e3fa
MD5 456817b76b6c851ad19433ea637cd6e8
BLAKE2b-256 7a8818619339a207093e644e3952338468ffc017dca37b02354b41bcaee4af0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for behavioral_memory-0.1.0.tar.gz:

Publisher: release.yml on harsh-kr11/behavioral-memory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file behavioral_memory-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for behavioral_memory-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 401cc90c3728b1bdba2e8e7ad6eb0dcac5f7d8a350abf8258b9398e02cafbc57
MD5 c19693ef5c26300ed25f578538e756c1
BLAKE2b-256 56ec0ca313496429595736c6426adde7af4ecb600f589a4e26499698518d4ee1

See more details on using hashes here.

Provenance

The following attestation bundles were made for behavioral_memory-0.1.0-py3-none-any.whl:

Publisher: release.yml on harsh-kr11/behavioral-memory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page