Skip to main content

Capacity-bounded warm memory for LLM agents, with a LangGraph BaseStore implementation, embeddings-based importance scoring, and a comparative benchmark.

Project description

WarmMemory

PyPI CI Python License: MIT LangGraph

WarmMemory is a Python package for short-term memory management in LLM agents. It adds a small in-process working-memory layer that keeps the most recent or most relevant interactions close to the agent, reducing repeated retrieval work and helping control prompt growth.

The repository provides:

  • a reusable Python package for warm-memory buffering,
  • a decorator for automatic interaction capture,
  • a pluggable importance scoring interface,
  • a deterministic benchmark for recency vs relevance vs fallback memory policies,
  • a LangGraph BaseStore integration with per-namespace eviction, embeddings-based ranking, and a pre-built agent,
  • HTML documentation for architecture and usage.

Why This Exists

Many agent systems use one of two expensive patterns:

  • they keep appending conversation history to the prompt,
  • or they query long-term memory on nearly every turn.

Both increase latency and cost. WarmMemory introduces a hot path:

  • keep a small working set in RAM,
  • retrieve from that working set first,
  • fall back to longer-term retrieval only when needed,
  • and send only a compact context window to the model.

Core Ideas

1. Sliding-Window Memory

The system can keep the last N interactions using recent(k).

2. Relevance-Aware Memory

Instead of only keeping the latest messages, the system can rank rows against the current query using relevant(query, k) and compact the active working set with retain_relevant(query, k).

3. Automatic Agent Capture

The @remember_interaction decorator records agent inputs and outputs without forcing changes into the core agent logic.

4. Two-Tier Memory Architecture

The benchmark models a practical split:

  • warm memory for fast in-process access,
  • long-term memory for slower fallback retrieval.

Repository Layout

  • warm_memory/: package source code
  • warm_memory/buffer.py: Pandas-backed warm-memory store
  • warm_memory/scoring.py: scoring interface and default heuristic scorer
  • warm_memory/decorators.py: function decorator for interaction capture
  • warm_memory/benchmark.py: deterministic benchmark harness
  • warm_memory/workload.py: synthetic workload for evaluation
  • warm_memory/langgraph/: LangGraph integration (optional extra)
    • store.py: WarmStore(BaseStore) with per-namespace eviction
    • embeddings.py: bring-your-own embeddings scorer
    • agent.py: pre-built build_warm_memory_agent graph
    • benchmark.py: full-history vs vector-only vs warm-fallback benchmark
  • examples/langgraph_warm_agent.py: runnable LangGraph agent example
  • scripts/run_benchmark.py: legacy benchmark entrypoint
  • scripts/run_langgraph_benchmark.py: LangGraph-based benchmark entrypoint
  • reports/warm_memory_benchmark.md: legacy benchmark output
  • reports/warm_memory_langgraph_benchmark.md: LangGraph benchmark output
  • docs/warm_memory_guide.html: public-facing HTML documentation
  • tests/: unit tests

Installation

pip install warm-memory
# or with the LangGraph integration:
pip install warm-memory[langgraph]

Or install from source for development:

python3 -m pip install -e ".[langgraph]"

Quick Start

from warm_memory import WarmMemoryBuffer, remember_interaction

memory = WarmMemoryBuffer(capacity=8)

@remember_interaction(memory)
def agent(prompt: str) -> str:
    if "billing" in prompt.lower():
        return "Your invoice is available in the billing portal."
    return f"Echo: {prompt}"

agent("How do I reset my password?")
agent("Where is my billing invoice?")

recent_rows = memory.recent(4)
relevant_rows = memory.relevant("invoice", limit=2)
memory.retain_relevant("invoice", limit=4)

Example Usage Pattern

Use WarmMemory in front of a larger memory system:

  1. Receive a new user query.
  2. Search the warm buffer first.
  3. If warm memory is sufficient, build a compact prompt from those rows.
  4. If warm memory is weak, fall back to long-term retrieval.
  5. Write the new interaction back into warm memory.

This pattern is useful for:

  • coding agents,
  • research assistants,
  • task-oriented copilots,
  • customer support agents,
  • and any multi-turn system with repeated local context.

Benchmark

The repository includes a deterministic benchmark that compares:

  • recency: always use the latest warm-memory rows,
  • relevance: rank and retain the top relevant warm-memory rows,
  • fallback: use warm relevance first, then long-term retrieval on misses.

Run it with:

python3 scripts/run_benchmark.py

This writes a report to reports/warm_memory_benchmark.md.

On the current synthetic workload, the tradeoff looks like this:

  • recency is the fastest policy,
  • fallback is the most accurate policy,
  • relevance sits between the two and provides a cleaner hot working set.

The benchmark is designed to surface that tradeoff rather than name a single winner: each policy occupies a different point on the latency-accuracy curve.

Documentation

  • HTML guide: docs/warm_memory_guide.html
  • Benchmark report: reports/warm_memory_benchmark.md
  • README visual: docs/warm_memory_architecture.svg

The HTML guide explains:

  • how the architecture works,
  • where latency is saved,
  • how to use the package,
  • and how the components fit together.

Architecture

WarmMemory architecture

The pipeline:

  1. Agent Runtime receives the user query in a per-user namespace and triggers two reads: a fast lookup against WarmMemory (the in-process working set) and a Retrieval Ranker scoring pass over those rows (KeywordImportanceScorer by default; swap in EmbeddingsImportanceScorer for semantic ranking).
  2. Warm Hit? checks the best score against the configured threshold.
  3. Green path (warm hit): results flow to Prompt Builder, which injects only the top-K rows into the system prompt before invoking the LLM. The vector tier is never touched.
  4. Orange path (warm miss): the query falls through to Long-Term Memory (LangGraph's InMemoryStore with an embedding index, PostgresStore, or any BaseStore) and the LLM consumes those results as fallback.
  5. Dashed write-back loop: the LLM response is captured by the decorator and written back to WarmMemory (and mirrored to Long-Term Memory by the memory_write node), so future turns can recall it.

On the synthetic benchmark, ~50% of turns take the green path, eliminating that many vector-store calls.

The diagram ships in two paired formats:

  • docs/warm_memory_architecture.drawio.svg — the rendered SVG that GitHub displays inline. The decision arrows flow ("marching ants" SMIL animation) so the hot/cold paths read at a glance. Open the file directly in a browser to see the animation; GitHub also renders the animation when displaying the SVG.
  • docs/warm_memory_architecture.drawio — the editable mxgraph source. Open at diagrams.net (File → Open from device) to edit; re-export the SVG when done.

The .drawio.svg also embeds the mxgraph XML in its content attribute, so either file round-trips through the editor — they're kept in sync from the same generator script.

For a richer narrated walkthrough, open docs/warm_memory_guide.html locally or publish it with GitHub Pages.

Development

Run tests:

python3 -m unittest discover -s tests -v

LangGraph Integration

WarmMemory ships an optional warm_memory.langgraph module that plugs directly into the LangGraph ecosystem. Install with the extra:

pip install warm-memory[langgraph]

Drop-in BaseStore

WarmStore implements LangGraph's BaseStore interface with per-namespace warm buffers — each namespace gets its own bounded buffer, so multi-tenant agents don't evict each other's memory.

from warm_memory.langgraph import WarmStore

store = WarmStore(capacity=16)
store.put(("alice",), "preferences", {"text": "wants concise answers"})
store.put(("alice",), "billing", {"text": "invoice overdue", "topic": "billing"})

# query-based recall (keyword scorer by default)
hits = store.search(("alice",), query="how do I pay my invoice?")

# filter operators: $eq, $ne, $gt, $gte, $lt, $lte
billing = store.search(("alice",), filter={"topic": "billing"})

Bring-your-own embeddings

Swap the default keyword scorer for any LangChain Embeddings:

from langchain_openai import OpenAIEmbeddings
from warm_memory.langgraph import EmbeddingsImportanceScorer, WarmStore

scorer = EmbeddingsImportanceScorer(OpenAIEmbeddings())
store = WarmStore(scorer=scorer)

Works with any LangChain embeddings provider — OpenAI, HuggingFace, Voyage, Anthropic — or DeterministicFakeEmbedding for tests.

Pre-built agent

build_warm_memory_agent returns a compiled LangGraph that reads warm memory before responding and writes the new exchange back on the way out:

from warm_memory.langgraph import WarmStore, build_warm_memory_agent

store = WarmStore(capacity=8)
agent = build_warm_memory_agent(model=my_chat_model, store=store)
agent.invoke({"query": "Where's my invoice?", "namespace": ("alice",)})

A runnable example using FakeListChatModel (no API keys) lives at examples/langgraph_warm_agent.py.

Comparative benchmark

scripts/run_langgraph_benchmark.py compares three retrieval strategies through the LangGraph store API:

  • full-history: every prior turn in the prompt (naive baseline)
  • vector-only: LangGraph's InMemoryStore with an embedding index
  • warm-fallback: WarmStore in front of the vector store
python3 scripts/run_langgraph_benchmark.py

This writes reports/warm_memory_langgraph_benchmark.md. Run it with synthetic embeddings by default; set WARM_BENCH_EMBEDDINGS=openai (and OPENAI_API_KEY) to compare against real semantic search.

Roadmap

  • add an embedding-based or reranker-based importance scorer (done via EmbeddingsImportanceScorer)
  • compare against vector-store-first baselines (done via warm-fallback strategy in the LangGraph benchmark)
  • benchmark against real agent traces instead of only synthetic workloads
  • record actual model latency and token usage from a live LLM pipeline
  • add charts and experiment summaries for publication-style reporting
  • TTL support for the LangGraph BaseStore
  • publish warm-memory to PyPI (live at v0.2.1)
  • propose inclusion in LangGraph's third-party store list (LangChain Forum proposal in flight)

License

This project is released under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warm_memory-0.2.2.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

warm_memory-0.2.2-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file warm_memory-0.2.2.tar.gz.

File metadata

  • Download URL: warm_memory-0.2.2.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for warm_memory-0.2.2.tar.gz
Algorithm Hash digest
SHA256 11066768f086fc9e6f2a36691809da1dd2afa06a8bc789b96263dfb7d1fc6771
MD5 ee01512bfa5c85d44c716d2e052d979b
BLAKE2b-256 99f22d2cfaf8d9c9b1f75d25b71a33ea88352be2d9b47c17b689decd5c6aaa2a

See more details on using hashes here.

Provenance

The following attestation bundles were made for warm_memory-0.2.2.tar.gz:

Publisher: publish.yml on vsingh45/WarmMemory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file warm_memory-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: warm_memory-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for warm_memory-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 346387ef6bedcf03db9732333aeb0011d25e2c4bc6c74e63913a024b55db9cda
MD5 0745a09e4fea4fa800f32b0b6aa4c4a6
BLAKE2b-256 ce2cd95daa29b694a65675434ba0a95f01e9ae9a0b325c2e5cce3a4c73f5d7b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for warm_memory-0.2.2-py3-none-any.whl:

Publisher: publish.yml on vsingh45/WarmMemory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page