Skip to main content

Agentic Context Engineering (ACE) middleware for LangChain — self-improving playbooks from the ICLR 2026 paper by Zhang et al.

Project description

agent_episodic_memory

PyPI License Python

Agentic Context Engineering (ACE) as a LangChain middleware. Your agent learns from every run and stores strategies as an evolving playbook. Based on the ICLR 2026 paper by Zhang, Hu et al. (Stanford University, SambaNova Systems, and UC Berkeley).

One import. Self-improving agents. Delta updates instead of full context rewrites. Drop-in AgentMiddleware subclass for LangChain v1 create_agent.

What it does

agent_episodic_memory treats agent context as an evolving playbook that accumulates strategies across runs. After each run, the middleware reflects on what worked and appends delta entries to the playbook. On the next run, the curated playbook is injected into the system prompt. Over time, the agent's context becomes more useful without full rewrites — the paper's key insight is that structured, incremental updates preserve detail that single-pass summarization erodes.

Three paper components map 1:1 to LangChain v1 middleware hooks:

  • Generatorwrap_model_call — injects the current playbook into the system prompt
  • Reflectorafter_model — produces delta entries describing what the model did
  • Curatorafter_agent — deduplicates entries by content fingerprint

v0.1 scope. This is an architecture port of the paper's three-component structure, not a full reproduction. The default Reflector is a zero-LLM heuristic that labels every entry neutral — it never infers success/failure, since a heuristic that guessed would poison the playbook with confident-wrong answers. The Curator deduplicates by content fingerprint only; the paper's grow-and-refine semantic merge is not implemented in v0.1. Plug in a real LLM Reflector by subclassing and overriding reflect() — see How the three hooks work.

The middleware is stateless across instances — the playbook lives on agent state under state["ace_playbook"], so it persists across tool calls within a run and can be checkpointed by LangGraph across runs.

Quick install

pip install agent_episodic_memory
from langchain.agents import create_agent
from agent_episodic_memory import ACEMiddleware

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    middleware=[ACEMiddleware()],
)

That's the whole integration. The playbook is empty on the first run and grows from there.

vs. existing LangChain context middleware

agent_episodic_memory SummarizationMiddleware ContextEditingMiddleware compact-middleware
Learns across runs
Delta updates (not full rewrites)
Structured playbook state
Architecture from ICLR 2026 paper
Preserves detail across iterations partial partial
Zero LLM calls in the generator hook
Composes with SummarizationMiddleware

Paper results (reference only — not reproduced by this package)

agent_episodic_memory is an architecture port, not a paper reproduction. The numbers below are from Zhang et al.'s official implementation against the AppWorld / FiNER / Formula harnesses using DeepSeek-V3.1 as the backbone and an LLM-based Reflector + grow-and-refine Curator. This package ships the Generator/Reflector/Curator hook structure and a zero-LLM default Reflector — it does not ship the adaptation harness, the benchmark suites, or a trained Reflector, and therefore does not produce these metrics out of the box. For the canonical implementation, see ace-agent/ace.

From Zhang et al., ICLR 2026:

  • +10.6% on the AppWorld agent benchmark
  • +8.6% on financial reasoning (FiNER + Formula)
  • −86.9% average adaptation latency
  • −82.3% latency and −75.1% rollouts vs. GEPA (offline AppWorld)
  • −91.5% latency and −83.6% token dollar cost vs. Dynamic Cheatsheet (online FiNER)
  • 91.8% KV cache reuse during evaluation
  • Matches the top-1 ranked IBM CUGA (60.3%) on AppWorld overall average, despite using the much smaller open-source DeepSeek-V3.1 instead of CUGA's GPT-4.1. With online adaptation, ACE also surpasses IBM CUGA by 8.4% in TGC and 0.7% in SGC on the test-challenge split.

The IBM CUGA reference is used by the paper as a rough contextual benchmark — not a direct methodological comparison — to show ACE operates in a similar performance range using a much smaller open backbone.

How the three hooks work

Generator (wrap_model_call)

Before each model call, the middleware reads the current playbook from state["ace_playbook"] and injects it into the system message. If an existing system message is present, the playbook is appended; otherwise a new system message is constructed.

The rendered playbook groups entries by category (tool_use, final_answer, observation, …) and marks each with its outcome:

<playbook>
  <tool_use>
    [+] read_file on config.json succeeded
    [-] grep without file-type flag returned too many hits
  </tool_use>
  <final_answer>
    [+] summarize after three or fewer file reads
  </final_answer>
</playbook>

Outcomes are [+] success, [-] failure, [o] neutral. The system prompt instructs the model to prefer [+] patterns and avoid [-] patterns when they apply.

Reflector (after_model)

After each model call, the middleware inspects the latest AIMessage. The default reflector is deterministic and zero-LLM: it categorizes the output as tool_use (has tool calls) or final_answer (pure text) and appends a DeltaEntry with outcome neutralit deliberately never claims success or failure. A heuristic cannot know whether a final answer was correct, and labeling every final answer success would poison the playbook with confident-wrong entries that then feed forward into every future run. The paper explicitly warns against this failure mode.

Override the public reflect() method in a subclass to plug in an LLM-based Reflector that produces real success/failure labels. For example, a cheaper model that rates whether the step made progress toward the task goal, or a judge model that compares the final answer against a ground-truth signal. See the paper §4 for the full Reflector design the authors ran against AppWorld and FiNER.

Curator (after_agent)

At the end of the run, the Curator merges the accumulated ace_pending_deltas into the playbook. Entries are deduplicated by a content-hash fingerprint over (category, content, outcome). This is exact-match dedup only — the paper's grow-and-refine semantic merge is not implemented in v0.1. Two near-identical strategies that differ by one token will coexist in the playbook as separate entries. The updated playbook is written back to state and is ready for the next run.

Composition with other middleware

ACEMiddleware composes with every LangChain built-in middleware:

from langchain.agents import create_agent
from langchain.agents.middleware import (
    HumanInTheLoopMiddleware,
    ModelFallbackMiddleware,
    SummarizationMiddleware,
    ToolRetryMiddleware,
)
from agent_episodic_memory import ACEMiddleware

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    middleware=[
        ACEMiddleware(),              # evolves the playbook
        SummarizationMiddleware(...), # compacts history when long
        ToolRetryMiddleware(),        # retries flaky tools
        ModelFallbackMiddleware(...), # falls back on model errors
        HumanInTheLoopMiddleware(...),# gates sensitive tool calls
    ],
)

ACEMiddleware is designed to run first in the chain so the playbook is injected into the system message before any compaction or retry logic modifies the request.

Multi-tenant deployments

⚠️ The playbook is scoped to the LangGraph thread_id, not to any tenant or user concept.

ACE stores the evolving playbook under state["ace_playbook"], which is checkpointed per-thread by LangGraph. If two different users share a thread_id (even by accident — e.g. a deployment that reuses thread ids across anonymous sessions, or a supervisor that forks state across subagents), they will share the same playbook, and strategies learned from one user's runs will be injected into the other user's system prompts.

For any multi-tenant deployment:

  • Scope thread_id per tenant. Use thread_id = f"{tenant_id}:{conversation_id}" or similar so the checkpointer can never accidentally cross-pollinate.
  • Do not share the same checkpointer thread across users. Even for anonymous traffic, mint a fresh thread_id per session.
  • Consider wiping the playbook between logical contexts if you have any scenario where cross-run learning is undesirable (e.g. one-shot question answering with no continuity).

v0.1 does not ship a built-in namespace primitive for the playbook. If you need strict per-user memory isolation with shared infrastructure, use langmem alongside ACEMiddleware — langmem provides namespace-scoped memory out of the box, and ACE composes with it cleanly.

Limitations

Direct from the paper's Limitations and Challenges section:

ACE's reliance on a reasonably strong Reflector: if the Reflector fails to extract meaningful insights from generated traces or outcomes, the constructed context may become noisy or even harmful.

In practice this means the default deterministic reflector (which categorizes based on structural signals, not semantic understanding) works well for tasks where success correlates with structural patterns — tool-use behavior, error detection, retry patterns — and degrades on tasks that require nuanced interpretation of intent. For those, pass a reflector_model override (see above).

The paper also notes that ACE is most beneficial in settings that demand detailed domain knowledge, complex tool use, or environment-specific strategies — not tasks already covered by base model weights or simple system prompts.

Relationship to the official ACE implementation

The official paper repository is at ace-agent/ace and contains the full offline/online adaptation framework the authors used for evaluation, including the AppWorld, FiNER, Formula, medical, and Text-to-SQL experiments.

agent_episodic_memory is a LangChain v1 middleware port of the paper's Generator/Reflector/Curator architecture. It does not reimplement the adaptation harness or the benchmark suites — the goal is to make the paper's context-engineering pattern available as a drop-in middleware in existing LangChain agents.

There is also kayba-ai/agentic-context-engine, which provides a separate ACERunner framework that wraps LangChain Runnables from the outside. That implementation is complementary: it owns the run loop; agent_episodic_memory plugs into LangChain's existing run loop as a middleware subclass.

Citation

If you use agent_episodic_memory in research, please cite the original paper:

@inproceedings{zhang2026ace,
  title={Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models},
  author={Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.04618}
}

Links

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_episodic_memory-0.1.0.tar.gz (106.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_episodic_memory-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file agent_episodic_memory-0.1.0.tar.gz.

File metadata

  • Download URL: agent_episodic_memory-0.1.0.tar.gz
  • Upload date:
  • Size: 106.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for agent_episodic_memory-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b065047b1a4537cf161bdaf79a86505badc228082364fdf4cd99bbd1e64ae459
MD5 1c010c0ebf23922b935317f488db6592
BLAKE2b-256 8bd1402ee25acbf4ad3c0663ada328b2f53101e31d4a789a70cb5042b7a81a5e

See more details on using hashes here.

File details

Details for the file agent_episodic_memory-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_episodic_memory-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8c07cac538528a157c18b3916d434a8ff1aedd8f3005c68f7e88027c4f1d8f5
MD5 8113f650a1c025f5935c3b9a6b99d9b8
BLAKE2b-256 356989e3aca9b14d077052b1edb0c90720974143207cb2fef55d17c1c22acf5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page