Agentic Context Engineering (ACE) middleware for LangChain — self-improving playbooks from the ICLR 2026 paper by Zhang et al.

These details have not been verified by PyPI

Project links

Project description

agent_episodic_memory

Agentic Context Engineering (ACE) as a LangChain middleware. Your agent learns from every run and stores strategies as an evolving playbook. Based on the ICLR 2026 paper by Zhang, Hu et al. (Stanford University, SambaNova Systems, and UC Berkeley).

One import. Self-improving agents. Delta updates instead of full context rewrites. Drop-in AgentMiddleware subclass for LangChain v1 create_agent.

What it does

agent_episodic_memory treats agent context as an evolving playbook that accumulates strategies across runs. After each run, the middleware reflects on what worked and appends delta entries to the playbook. On the next run, the curated playbook is injected into the system prompt. Over time, the agent's context becomes more useful without full rewrites — the paper's key insight is that structured, incremental updates preserve detail that single-pass summarization erodes.

Three paper components map 1:1 to LangChain v1 middleware hooks:

Generator → wrap_model_call — injects the current playbook into the system prompt
Reflector → after_model — produces delta entries describing what the model did
Curator → after_agent — deduplicates entries by content fingerprint

v0.1 scope. This is an architecture port of the paper's three-component structure, not a full reproduction. The default Reflector is a zero-LLM heuristic that labels every entry neutral — it never infers success/failure, since a heuristic that guessed would poison the playbook with confident-wrong answers. The Curator deduplicates by content fingerprint only; the paper's grow-and-refine semantic merge is not implemented in v0.1. Plug in a real LLM Reflector by subclassing and overriding reflect() — see How the three hooks work.

The middleware is stateless across instances — the playbook lives on agent state under state["ace_playbook"], so it persists across tool calls within a run and can be checkpointed by LangGraph across runs.

Quick install

pip install agent_episodic_memory

from langchain.agents import create_agent
from agent_episodic_memory import ACEMiddleware

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    middleware=[ACEMiddleware()],
)

That's the whole integration. The playbook is empty on the first run and grows from there.

vs. existing LangChain context middleware

	`agent_episodic_memory`	`SummarizationMiddleware`	`ContextEditingMiddleware`	`compact-middleware`
Learns across runs	✅	❌	❌	❌
Delta updates (not full rewrites)	✅	❌	❌	❌
Structured playbook state	✅	❌	❌	❌
Architecture from ICLR 2026 paper	✅	❌	❌	❌
Preserves detail across iterations	✅	partial	❌	partial
Zero LLM calls in the generator hook	✅	❌	✅	❌
Composes with `SummarizationMiddleware`	✅	—	✅	✅

Paper results (reference only — not reproduced by this package)

agent_episodic_memory is an architecture port, not a paper reproduction. The numbers below are from Zhang et al.'s official implementation against the AppWorld / FiNER / Formula harnesses using DeepSeek-V3.1 as the backbone and an LLM-based Reflector + grow-and-refine Curator. This package ships the Generator/Reflector/Curator hook structure and a zero-LLM default Reflector — it does not ship the adaptation harness, the benchmark suites, or a trained Reflector, and therefore does not produce these metrics out of the box. For the canonical implementation, see ace-agent/ace.

From Zhang et al., ICLR 2026:

+10.6% on the AppWorld agent benchmark
+8.6% on financial reasoning (FiNER + Formula)
−86.9% average adaptation latency
−82.3% latency and −75.1% rollouts vs. GEPA (offline AppWorld)
−91.5% latency and −83.6% token dollar cost vs. Dynamic Cheatsheet (online FiNER)
91.8% KV cache reuse during evaluation
Matches the top-1 ranked IBM CUGA (60.3%) on AppWorld overall average, despite using the much smaller open-source DeepSeek-V3.1 instead of CUGA's GPT-4.1. With online adaptation, ACE also surpasses IBM CUGA by 8.4% in TGC and 0.7% in SGC on the test-challenge split.

The IBM CUGA reference is used by the paper as a rough contextual benchmark — not a direct methodological comparison — to show ACE operates in a similar performance range using a much smaller open backbone.

How the three hooks work

Generator (`wrap_model_call`)

Before each model call, the middleware reads the current playbook from state["ace_playbook"] and injects it into the system message. If an existing system message is present, the playbook is appended; otherwise a new system message is constructed.

The rendered playbook groups entries by category (tool_use, final_answer, observation, …) and marks each with its outcome:

<playbook>
  <tool_use>
    [+] read_file on config.json succeeded
    [-] grep without file-type flag returned too many hits
  </tool_use>
  <final_answer>
    [+] summarize after three or fewer file reads
  </final_answer>
</playbook>

Outcomes are [+] success, [-] failure, [o] neutral. The system prompt instructs the model to prefer [+] patterns and avoid [-] patterns when they apply.

Reflector (`after_model`)

After each model call, the middleware inspects the latest AIMessage. The default reflector is deterministic and zero-LLM: it categorizes the output as tool_use (has tool calls) or final_answer (pure text) and appends a DeltaEntry with outcome neutral — it deliberately never claims success or failure. A heuristic cannot know whether a final answer was correct, and labeling every final answer success would poison the playbook with confident-wrong entries that then feed forward into every future run. The paper explicitly warns against this failure mode.

Override the public reflect() method in a subclass to plug in an LLM-based Reflector that produces real success/failure labels. For example, a cheaper model that rates whether the step made progress toward the task goal, or a judge model that compares the final answer against a ground-truth signal. See the paper §4 for the full Reflector design the authors ran against AppWorld and FiNER.

Curator (`after_agent`)

At the end of the run, the Curator merges the accumulated ace_pending_deltas into the playbook. Entries are deduplicated by a content-hash fingerprint over (category, content, outcome). This is exact-match dedup only — the paper's grow-and-refine semantic merge is not implemented in v0.1. Two near-identical strategies that differ by one token will coexist in the playbook as separate entries. The updated playbook is written back to state and is ready for the next run.

Composition with other middleware

ACEMiddleware composes with every LangChain built-in middleware:

from langchain.agents import create_agent
from langchain.agents.middleware import (
    HumanInTheLoopMiddleware,
    ModelFallbackMiddleware,
    SummarizationMiddleware,
    ToolRetryMiddleware,
)
from agent_episodic_memory import ACEMiddleware

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[...],
    middleware=[
        ACEMiddleware(),              # evolves the playbook
        SummarizationMiddleware(...), # compacts history when long
        ToolRetryMiddleware(),        # retries flaky tools
        ModelFallbackMiddleware(...), # falls back on model errors
        HumanInTheLoopMiddleware(...),# gates sensitive tool calls
    ],
)

ACEMiddleware is designed to run first in the chain so the playbook is injected into the system message before any compaction or retry logic modifies the request.

Multi-tenant deployments

⚠️ The playbook is scoped to the LangGraph thread_id, not to any tenant or user concept.

ACE stores the evolving playbook under state["ace_playbook"], which is checkpointed per-thread by LangGraph. If two different users share a thread_id (even by accident — e.g. a deployment that reuses thread ids across anonymous sessions, or a supervisor that forks state across subagents), they will share the same playbook, and strategies learned from one user's runs will be injected into the other user's system prompts.

For any multi-tenant deployment:

Scope thread_id per tenant. Use thread_id = f"{tenant_id}:{conversation_id}" or similar so the checkpointer can never accidentally cross-pollinate.
Do not share the same checkpointer thread across users. Even for anonymous traffic, mint a fresh thread_id per session.
Consider wiping the playbook between logical contexts if you have any scenario where cross-run learning is undesirable (e.g. one-shot question answering with no continuity).

v0.1 does not ship a built-in namespace primitive for the playbook. If you need strict per-user memory isolation with shared infrastructure, use langmem alongside ACEMiddleware — langmem provides namespace-scoped memory out of the box, and ACE composes with it cleanly.

Limitations

Direct from the paper's Limitations and Challenges section:

ACE's reliance on a reasonably strong Reflector: if the Reflector fails to extract meaningful insights from generated traces or outcomes, the constructed context may become noisy or even harmful.

In practice this means the default deterministic reflector (which categorizes based on structural signals, not semantic understanding) works well for tasks where success correlates with structural patterns — tool-use behavior, error detection, retry patterns — and degrades on tasks that require nuanced interpretation of intent. For those, pass a reflector_model override (see above).

The paper also notes that ACE is most beneficial in settings that demand detailed domain knowledge, complex tool use, or environment-specific strategies — not tasks already covered by base model weights or simple system prompts.

Relationship to the official ACE implementation

The official paper repository is at ace-agent/ace and contains the full offline/online adaptation framework the authors used for evaluation, including the AppWorld, FiNER, Formula, medical, and Text-to-SQL experiments.

agent_episodic_memory is a LangChain v1 middleware port of the paper's Generator/Reflector/Curator architecture. It does not reimplement the adaptation harness or the benchmark suites — the goal is to make the paper's context-engineering pattern available as a drop-in middleware in existing LangChain agents.

There is also kayba-ai/agentic-context-engine, which provides a separate ACERunner framework that wraps LangChain Runnables from the outside. That implementation is complementary: it owns the run loop; agent_episodic_memory plugs into LangChain's existing run loop as a middleware subclass.

Citation

If you use agent_episodic_memory in research, please cite the original paper:

@inproceedings{zhang2026ace,
  title={Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models},
  author={Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2510.04618}
}

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_episodic_memory-0.1.0.tar.gz (106.6 kB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_episodic_memory-0.1.0-py3-none-any.whl (12.3 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file agent_episodic_memory-0.1.0.tar.gz.

File metadata

Download URL: agent_episodic_memory-0.1.0.tar.gz
Upload date: Apr 12, 2026
Size: 106.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for agent_episodic_memory-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b065047b1a4537cf161bdaf79a86505badc228082364fdf4cd99bbd1e64ae459`
MD5	`1c010c0ebf23922b935317f488db6592`
BLAKE2b-256	`8bd1402ee25acbf4ad3c0663ada328b2f53101e31d4a789a70cb5042b7a81a5e`

See more details on using hashes here.

File details

Details for the file agent_episodic_memory-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_episodic_memory-0.1.0-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 12.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for agent_episodic_memory-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e8c07cac538528a157c18b3916d434a8ff1aedd8f3005c68f7e88027c4f1d8f5`
MD5	`8113f650a1c025f5935c3b9a6b99d9b8`
BLAKE2b-256	`356989e3aca9b14d077052b1edb0c90720974143207cb2fef55d17c1c22acf5e`

See more details on using hashes here.

agent-episodic-memory 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

agent_episodic_memory

What it does

Quick install

vs. existing LangChain context middleware

Paper results (reference only — not reproduced by this package)

How the three hooks work

Generator (wrap_model_call)

Reflector (after_model)

Curator (after_agent)

Composition with other middleware

Multi-tenant deployments

Limitations

Relationship to the official ACE implementation

Citation

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Generator (`wrap_model_call`)

Reflector (`after_model`)

Curator (`after_agent`)