Agentic Context Engineering (ACE) middleware for LangChain — self-improving playbooks from the ICLR 2026 paper by Zhang et al.
Project description
agent_episodic_memory
Agentic Context Engineering (ACE) as a LangChain middleware. Your agent learns from every run and stores strategies as an evolving playbook. Based on the ICLR 2026 paper by Zhang, Hu et al. (Stanford University, SambaNova Systems, and UC Berkeley).
One import. Self-improving agents. Delta updates instead of full context rewrites. Drop-in AgentMiddleware subclass for LangChain v1 create_agent.
What it does
agent_episodic_memory treats agent context as an evolving playbook that accumulates strategies across runs. After each run, the middleware reflects on what worked and appends delta entries to the playbook. On the next run, the curated playbook is injected into the system prompt. Over time, the agent's context becomes more useful without full rewrites — the paper's key insight is that structured, incremental updates preserve detail that single-pass summarization erodes.
Three paper components map 1:1 to LangChain v1 middleware hooks:
- Generator →
wrap_model_call— injects the current playbook into the system prompt - Reflector →
after_model— produces delta entries describing what the model did - Curator →
after_agent— deduplicates entries by content fingerprint
v0.1 scope. This is an architecture port of the paper's three-component structure, not a full reproduction. The default Reflector is a zero-LLM heuristic that labels every entry
neutral— it never inferssuccess/failure, since a heuristic that guessed would poison the playbook with confident-wrong answers. The Curator deduplicates by content fingerprint only; the paper's grow-and-refine semantic merge is not implemented in v0.1. Plug in a real LLM Reflector by subclassing and overridingreflect()— see How the three hooks work.
The middleware is stateless across instances — the playbook lives on agent state under state["ace_playbook"], so it persists across tool calls within a run and can be checkpointed by LangGraph across runs.
Quick install
pip install agent_episodic_memory
from langchain.agents import create_agent
from agent_episodic_memory import ACEMiddleware
agent = create_agent(
model="anthropic:claude-sonnet-4-6",
tools=[...],
middleware=[ACEMiddleware()],
)
That's the whole integration. The playbook is empty on the first run and grows from there.
vs. existing LangChain context middleware
agent_episodic_memory |
SummarizationMiddleware |
ContextEditingMiddleware |
compact-middleware |
|
|---|---|---|---|---|
| Learns across runs | ✅ | ❌ | ❌ | ❌ |
| Delta updates (not full rewrites) | ✅ | ❌ | ❌ | ❌ |
| Structured playbook state | ✅ | ❌ | ❌ | ❌ |
| Architecture from ICLR 2026 paper | ✅ | ❌ | ❌ | ❌ |
| Preserves detail across iterations | ✅ | partial | ❌ | partial |
| Zero LLM calls in the generator hook | ✅ | ❌ | ✅ | ❌ |
Composes with SummarizationMiddleware |
✅ | — | ✅ | ✅ |
Paper results (reference only — not reproduced by this package)
agent_episodic_memoryis an architecture port, not a paper reproduction. The numbers below are from Zhang et al.'s official implementation against the AppWorld / FiNER / Formula harnesses using DeepSeek-V3.1 as the backbone and an LLM-based Reflector + grow-and-refine Curator. This package ships the Generator/Reflector/Curator hook structure and a zero-LLM default Reflector — it does not ship the adaptation harness, the benchmark suites, or a trained Reflector, and therefore does not produce these metrics out of the box. For the canonical implementation, see ace-agent/ace.
From Zhang et al., ICLR 2026:
- +10.6% on the AppWorld agent benchmark
- +8.6% on financial reasoning (FiNER + Formula)
- −86.9% average adaptation latency
- −82.3% latency and −75.1% rollouts vs. GEPA (offline AppWorld)
- −91.5% latency and −83.6% token dollar cost vs. Dynamic Cheatsheet (online FiNER)
- 91.8% KV cache reuse during evaluation
- Matches the top-1 ranked IBM CUGA (60.3%) on AppWorld overall average, despite using the much smaller open-source DeepSeek-V3.1 instead of CUGA's GPT-4.1. With online adaptation, ACE also surpasses IBM CUGA by 8.4% in TGC and 0.7% in SGC on the test-challenge split.
The IBM CUGA reference is used by the paper as a rough contextual benchmark — not a direct methodological comparison — to show ACE operates in a similar performance range using a much smaller open backbone.
How the three hooks work
Generator (wrap_model_call)
Before each model call, the middleware reads the current playbook from state["ace_playbook"] and injects it into the system message. If an existing system message is present, the playbook is appended; otherwise a new system message is constructed.
The rendered playbook groups entries by category (tool_use, final_answer, observation, …) and marks each with its outcome:
<playbook>
<tool_use>
[+] read_file on config.json succeeded
[-] grep without file-type flag returned too many hits
</tool_use>
<final_answer>
[+] summarize after three or fewer file reads
</final_answer>
</playbook>
Outcomes are [+] success, [-] failure, [o] neutral. The system prompt instructs the model to prefer [+] patterns and avoid [-] patterns when they apply.
Reflector (after_model)
After each model call, the middleware inspects the latest AIMessage. The default reflector is deterministic and zero-LLM: it categorizes the output as tool_use (has tool calls) or final_answer (pure text) and appends a DeltaEntry with outcome neutral — it deliberately never claims success or failure. A heuristic cannot know whether a final answer was correct, and labeling every final answer success would poison the playbook with confident-wrong entries that then feed forward into every future run. The paper explicitly warns against this failure mode.
Override the public reflect() method in a subclass to plug in an LLM-based Reflector that produces real success/failure labels. For example, a cheaper model that rates whether the step made progress toward the task goal, or a judge model that compares the final answer against a ground-truth signal. See the paper §4 for the full Reflector design the authors ran against AppWorld and FiNER.
Curator (after_agent)
At the end of the run, the Curator merges the accumulated ace_pending_deltas into the playbook. Entries are deduplicated by a content-hash fingerprint over (category, content, outcome). This is exact-match dedup only — the paper's grow-and-refine semantic merge is not implemented in v0.1. Two near-identical strategies that differ by one token will coexist in the playbook as separate entries. The updated playbook is written back to state and is ready for the next run.
Composition with other middleware
ACEMiddleware composes with every LangChain built-in middleware:
from langchain.agents import create_agent
from langchain.agents.middleware import (
HumanInTheLoopMiddleware,
ModelFallbackMiddleware,
SummarizationMiddleware,
ToolRetryMiddleware,
)
from agent_episodic_memory import ACEMiddleware
agent = create_agent(
model="anthropic:claude-sonnet-4-6",
tools=[...],
middleware=[
ACEMiddleware(), # evolves the playbook
SummarizationMiddleware(...), # compacts history when long
ToolRetryMiddleware(), # retries flaky tools
ModelFallbackMiddleware(...), # falls back on model errors
HumanInTheLoopMiddleware(...),# gates sensitive tool calls
],
)
ACEMiddleware is designed to run first in the chain so the playbook is injected into the system message before any compaction or retry logic modifies the request.
Multi-tenant deployments
⚠️ The playbook is scoped to the LangGraph
thread_id, not to any tenant or user concept.
ACE stores the evolving playbook under state["ace_playbook"], which is checkpointed per-thread by LangGraph. If two different users share a thread_id (even by accident — e.g. a deployment that reuses thread ids across anonymous sessions, or a supervisor that forks state across subagents), they will share the same playbook, and strategies learned from one user's runs will be injected into the other user's system prompts.
For any multi-tenant deployment:
- Scope
thread_idper tenant. Usethread_id = f"{tenant_id}:{conversation_id}"or similar so the checkpointer can never accidentally cross-pollinate. - Do not share the same checkpointer thread across users. Even for anonymous traffic, mint a fresh
thread_idper session. - Consider wiping the playbook between logical contexts if you have any scenario where cross-run learning is undesirable (e.g. one-shot question answering with no continuity).
v0.1 does not ship a built-in namespace primitive for the playbook. If you need strict per-user memory isolation with shared infrastructure, use langmem alongside ACEMiddleware — langmem provides namespace-scoped memory out of the box, and ACE composes with it cleanly.
Limitations
Direct from the paper's Limitations and Challenges section:
ACE's reliance on a reasonably strong Reflector: if the Reflector fails to extract meaningful insights from generated traces or outcomes, the constructed context may become noisy or even harmful.
In practice this means the default deterministic reflector (which categorizes based on structural signals, not semantic understanding) works well for tasks where success correlates with structural patterns — tool-use behavior, error detection, retry patterns — and degrades on tasks that require nuanced interpretation of intent. For those, pass a reflector_model override (see above).
The paper also notes that ACE is most beneficial in settings that demand detailed domain knowledge, complex tool use, or environment-specific strategies — not tasks already covered by base model weights or simple system prompts.
Relationship to the official ACE implementation
The official paper repository is at ace-agent/ace and contains the full offline/online adaptation framework the authors used for evaluation, including the AppWorld, FiNER, Formula, medical, and Text-to-SQL experiments.
agent_episodic_memory is a LangChain v1 middleware port of the paper's Generator/Reflector/Curator architecture. It does not reimplement the adaptation harness or the benchmark suites — the goal is to make the paper's context-engineering pattern available as a drop-in middleware in existing LangChain agents.
There is also kayba-ai/agentic-context-engine, which provides a separate ACERunner framework that wraps LangChain Runnables from the outside. That implementation is complementary: it owns the run loop; agent_episodic_memory plugs into LangChain's existing run loop as a middleware subclass.
Citation
If you use agent_episodic_memory in research, please cite the original paper:
@inproceedings{zhang2026ace,
title={Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models},
author={Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026},
url={https://arxiv.org/abs/2510.04618}
}
Links
- Paper (arXiv): https://arxiv.org/abs/2510.04618
- Official ACE implementation: https://github.com/ace-agent/ace
- ICLR 2026 poster page: https://iclr.cc/virtual/2026/poster/10008343
- LangChain middleware docs: https://docs.langchain.com/oss/python/langchain/middleware
- Issues: https://github.com/johanity/agent-episodic-memory/issues
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_episodic_memory-0.1.0.tar.gz.
File metadata
- Download URL: agent_episodic_memory-0.1.0.tar.gz
- Upload date:
- Size: 106.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b065047b1a4537cf161bdaf79a86505badc228082364fdf4cd99bbd1e64ae459
|
|
| MD5 |
1c010c0ebf23922b935317f488db6592
|
|
| BLAKE2b-256 |
8bd1402ee25acbf4ad3c0663ada328b2f53101e31d4a789a70cb5042b7a81a5e
|
File details
Details for the file agent_episodic_memory-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_episodic_memory-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8c07cac538528a157c18b3916d434a8ff1aedd8f3005c68f7e88027c4f1d8f5
|
|
| MD5 |
8113f650a1c025f5935c3b9a6b99d9b8
|
|
| BLAKE2b-256 |
356989e3aca9b14d077052b1edb0c90720974143207cb2fef55d17c1c22acf5e
|