ACE โ Agentic Context Engineering: evolving, self-improving context playbooks for LLM agents. A faithful, framework-style implementation of the ICLR 2026 paper, with first-class OpenAI Agents SDK support.
Project description
๐ฎ ACE โ Agentic Context Engineering
Evolving, self-improving context playbooks for LLM agents โ a clean, tested, framework-style implementation of the ICLR 2026 paper, with first-class OpenAI Agents SDK support.
Stop re-prompting. Let your agent write its own playbook from experience.
๐ Documentation site ยท ๐ Architecture
Quickstart ยท Why ACE ยท Cookbook ยท Use on your own task ยท OpenAI Agents SDK ยท How it works ยท Results ยท Architecture
What is this?
LLM agents and domain experts increasingly improve through context adaptation โ editing the inputs (instructions, strategies, evidence) instead of the weights. But the two dominant approaches break down:
- Brevity bias โ prompt optimizers collapse toward short, generic instructions and throw away hard-won domain detail.
- Context collapse โ letting an LLM rewrite the whole context every step compresses it into a lossy summary and craters accuracy (see below).
ACE fixes both. It treats context as an evolving playbook of small, itemized bullets that accumulate, refine, and organize strategies over time, through a modular Generator โ Reflector โ Curator loop with incremental delta updates and a grow-and-refine mechanism. The result: comprehensive, scalable, self-improving context โ with low overhead.
This repository is a faithful, dependency-light, fully tested implementation you can use in a couple of commands and a few lines of code.
โจ Why ACE
| Prompt optimizers (GEPA, MIPRO) | Monolithic memory (full rewrite) | ACE | |
|---|---|---|---|
| Keeps domain detail | โ brevity bias | โ ๏ธ erodes over time | โ accumulates |
| Survives long horizons | โ ๏ธ | โ context collapse | โ incremental deltas |
| Update cost | ๐ข full re-optimization | ๐ข full re-ingest each step | โก tiny deltas, non-LLM merge |
| Works without labels | โ ๏ธ | โ | โ execution feedback |
| Interpretable / editable | โ ๏ธ | โ ๏ธ | โ inspectable bullets |
๐ Quickstart
git clone https://github.com/rrahimi-uci/agentic-context-engineering && cd agentic-context-engineering
pip install -e . # core library (numpy + rich only)
Run the headline comparison โ no API key required (uses a deterministic, offline teaching environment):
ace demo --html report.html
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโณโโโโโโโโโโโโโโ
โ Method โ Accuracy โ Playbook โ Note โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ Base LLM (no context) โ 44.4% โ 0 โ โ โ
โ ACE (offline โ eval) โ 83.3% โ 5 โ +38.9 pts โ
โ Monolithic rewrite (online) โ 72.2% โ 4 โ 2 collapses โ
โ ACE (online) โ 83.3% โ 6 โ no collapse โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโ
Watch a run adapt live in your terminal:
ace run # animated dashboard: playbook growth, accuracy, deltas
โฆor in ~10 lines of Python
from ace import ACE, SimulatedLLM, TeachingEnvironment, build_teaching_task
from ace.baselines import StaticAgent
env = TeachingEnvironment()
task = build_teaching_task()
train, test = task.split()
base = StaticAgent(SimulatedLLM(env)).run(test) # no learning
ace = ACE(SimulatedLLM(env))
ace.adapt_offline(train) # build a playbook from feedback
result = ace.evaluate(test) # measure on held-out data
print(f"Base {base.accuracy:.0f}% โ ACE {result.accuracy:.0f}%")
print(ace.playbook.render()) # human-readable playbook
๐ Use it with the OpenAI Agents SDK
ACE plugs into the OpenAI Agents SDK as a self-improving memory. The playbook is injected into your agent's instructions on every run; after each task you hand back feedback (a label or just natural execution signal) and ACE grows the playbook.
pip install "ace-playbook[all]" # adds openai + openai-agents (SDK needs Python 3.10+)
export OPENAI_API_KEY=sk-...
One call wraps your agent so it learns โ wrap_agent builds the ACE engine,
loads a saved playbook if present, and persists what it learns:
from agents import Agent
from ace import wrap_agent # one top-level import
agent = wrap_agent(
Agent(name="Support", instructions="You are a concise support agent."),
model="gpt-4o-mini",
playbook="support_memory.json", # load if it exists; save target for .save()
)
# Run + learn from execution feedback โ no ground-truth labels needed:
out = agent.run_and_learn(
"Cancel order #C99",
signal="Policy: cancellation requires identity verification first.",
)
print(out.output)
print(agent.playbook.render()) # the agent just wrote itself a rule
agent.save() # learned memory survives a restart
You don't have to think about the internals โ but they're all there:
- Auto-learn from tool errors โ a
RunHookslistener records each run; if a tool fails and you pass no explicit feedback, that error becomes the signal. - Rich trajectories โ tool calls/outputs/messages are captured via the SDK's typed run-items, so the Reflector learns from what actually happened.
- Tracing โ the learning step is emitted as an
ace.learnspan next to the agent run in the OpenAI trace UI. - Async โ inside an event loop (FastAPI, notebooks), use the same-semantics
async entry points:
await agent.arun_and_learn("Cancel #C99", signal="..."). - Streaming โ
await agent.arun_streamed_and_learn(query, on_event=...), oragent.stream(query)for full control overstream_events(). - Sessions are orthogonal โ ACE memory is cross-task learned strategy;
the SDK's
session=is within-conversation history. Pass a session straight through any run:agent.run_and_learn(q, session=my_session, signal=...).
Need to share one engine across agents, use a non-OpenAI backend, or pass dynamic
(callable) base instructions? Drop down to ACEAgent(base, ace=...) directly โ
wrap_agent is just the batteries-included wrapper around it. A runnable
end-to-end example lives in examples/04_openai_agents.py.
๐งฉ Use it on your own task
Two extension points make ACE general-purpose โ bring your own Task and your
own feedback (no ground-truth labels required):
from ace import ACE, Feedback, Sample, Task, OpenAILLM
my_task = Task(name="my-domain", samples=[Sample(id="1", question="...")],
evaluate=lambda pred, s: my_score(pred, s))
def my_feedback(sample, generation) -> Feedback:
# plug in execution signals, a reward fn, or an LLM judge โ your call
ok = run_my_checks(generation.answer)
return Feedback(correct=ok, signal="tests passed" if ok else "tests FAILED")
ace = ACE(OpenAILLM(model="gpt-4o-mini"))
ace.adapt_online(my_task, feedback_fn=my_feedback) # learns from YOUR signals
See examples/05_custom_task.py (runs offline). The Curator calls the LLM to
propose ADD/UPDATE/REMOVE edits by default (deterministic fallback never
drops a lesson); force deterministic curation with ACEConfig(curator_use_llm=False).
๐ง How it works
flowchart LR
Q([Query]) --> G[Generator]
PB[(Context Playbook)] -. injected .-> G
G -->|trajectory + bullet usage| R[Reflector]
FB([Feedback: labels or execution signal]) --> R
R -->|insights, iterative refinement| C[Curator]
C -->|delta items| M{{Deterministic Merge - non-LLM}}
M --> PB
M --> GR[Grow & Refine: dedupe / prune]
GR --> PB
classDef role fill:#1e293b,color:#fff;
classDef store fill:#2563eb,color:#fff;
classDef det fill:#16a34a,color:#fff;
class G,R,C role;
class PB store;
class M,GR det;
- Generator solves the query using the current playbook, flagging which bullets helped or misled.
- Reflector critiques the trajectory against feedback and distills concrete, reusable insights (optionally over several refinement rounds).
- Curator turns insights into a few delta operations (
ADD/UPDATE/REMOVE). - Deterministic merge applies those edits to the playbook โ no LLM, no rewrite, no collapse.
- Grow-and-refine de-duplicates (semantic or lexical) and prunes consistently harmful bullets.
ACE runs in two regimes โ multi-epoch offline optimization and sequential online test-time adaptation (which can be warm-started from an offline playbook):
flowchart LR
subgraph Offline["Offline โ system-prompt optimization"]
TR[(Train split)] --> EP{Multi-epoch}
EP --> ST[ACE.step] --> EP
EP --> PBO[(Playbook)]
end
subgraph Online["Online โ test-time memory"]
S[Next sample] --> PR[predict] --> LE[learn] --> S
end
PBO -. optional warm start .-> Online
classDef store fill:#2563eb,color:#fff;
class PBO store;
Full diagrams (roles, bullet lifecycle, grow-and-refine, feedback regimes, data model โ 14 in total) live in ARCHITECTURE.md and on the docs site.
๐ Results
Reproducible, in this repo (offline teaching environment, no API key)
These come straight from the bundled examples (examples/*.py) and are fully deterministic:
| Demo | Base LLM | ACE | ฮ |
|---|---|---|---|
| Quickstart (offline โ held-out eval) | 44.4% | 83.3% | +38.9 pts |
| Context-collapse benchmark (online) | 41.7% | 88.3% | +46.6 pts |
| Offline warmup + online | 34.5% | 96.6% | +62.1 pts |
In the context-collapse demo, the monolithic-rewrite baseline collapses its context
7ร and stalls at 60.0%, while ACE never collapses. Adaptation token ingestion for ACE
is โ94.9% vs. full re-ingestion (deltas are tiny). Generate the visual report with
ace demo --html report.html โ sample report.
Reported in the paper (real benchmarks, DeepSeek-V3.1)
| Benchmark | Baseline | + ACE |
|---|---|---|
| AppWorld (agent, avg) | 42.4% (ReAct) | 59.5% (+17.1) |
| FiNER (financial NER) | 70.7% | 78.3% |
| Formula (financial reasoning) | 67.5% | 85.5% |
| Adaptation latency (offline AppWorld) | โ | โ86.9% |
| Token cost (online FiNER) | โ | โ83.6% |
On the AppWorld leaderboard, ReAct+ACE with an open-source model matches the top-ranked production GPT-4.1 agent and surpasses it on the harder test-challenge split. (Numbers above are from the paper; this repo reproduces the mechanism and its qualitative behavior offline.)
๐๏ธ What's in the box
ace/
โโโ playbook.py # Bullet + Playbook: the evolving, sectioned context
โโโ delta.py # incremental ADD/UPDATE/REMOVE + deterministic merge
โโโ roles.py # Generator ยท Reflector ยท Curator (+ prompts)
โโโ refine.py # grow-and-refine: semantic dedupe + harmful pruning
โโโ engine.py # ACE orchestrator: offline / online adaptation
โโโ llm.py # LLM protocol ยท OpenAILLM ยท deterministic SimulatedLLM
โโโ feedback.py # labeled or label-free execution feedback
โโโ tasks.py # Sample/Task + offline TeachingEnvironment
โโโ baselines.py # StaticAgent + MonolithicRewriteAgent (context collapse)
โโโ visualize.py # live terminal dashboard + self-contained HTML report
โโโ integrations/
โ โโโ openai_agents.py # wrap_agent / ACEAgent: drop-in self-improving memory
โโโ cli.py # `ace demo | run | playbook | version`
cookbook/ # 10 guided recipes (7 need no API key) + tests
examples/ # 5 runnable demos (4 need no API key)
tests/ # 148 tests, run in <1s, zero network
๐งช Develop & test
pip install -e ".[dev]"
pytest # 148 tests, fully offline, ~1s
python examples/01_quickstart.py
python examples/02_context_collapse.py # writes ace_report.html
The bundled SimulatedLLM + TeachingEnvironment make every demo and test
deterministic and key-free, so the ACE control loop is exercised end-to-end
in CI. Swap in OpenAILLM for real models and benchmarks โ the algorithm and
prompts are unchanged.
๐ Key concepts (glossary)
- Playbook โ the evolving context, a set of itemized bullets grouped into sections.
- Bullet โ one atomic lesson with a stable id and
helpful/harmfulcounters. - Delta update โ a small, localized batch of
ADD/UPDATE/REMOVEedits (vs. a full rewrite). - Grow-and-refine โ append new bullets, update existing in place, semantically de-duplicate, prune harmful.
- Generator / Reflector / Curator โ the three specialized roles of the ACE loop.
- Offline vs. online โ multi-epoch optimization on a train split vs. sequential test-time adaptation.
๐ Citation
@inproceedings{zhang2026ace,
title = {Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models},
author = {Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and others},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
url = {https://arxiv.org/abs/2510.04618}
}
This implementation is an independent, open-source reproduction for research and educational use. All credit for the ACE method belongs to the original authors.
๐ License
MIT. Contributions welcome โ see CONTRIBUTING.md.
pip install โ a few lines โ a playbook that gets better with every task.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ace_playbook-0.2.0.tar.gz.
File metadata
- Download URL: ace_playbook-0.2.0.tar.gz
- Upload date:
- Size: 63.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
644a95b286e6822c71ff0ab8602013536d345d25ec7679e9c1e78917e2438eba
|
|
| MD5 |
6c8a068b2e27f8dd496b825982c4cf1b
|
|
| BLAKE2b-256 |
d1b5fc354df9db9fcb2aa6de36e33005d4bfdba0576cc3a1e2d757ac74603f7b
|
Provenance
The following attestation bundles were made for ace_playbook-0.2.0.tar.gz:
Publisher:
publish.yml on rrahimi-uci/agentic-context-engineering
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ace_playbook-0.2.0.tar.gz -
Subject digest:
644a95b286e6822c71ff0ab8602013536d345d25ec7679e9c1e78917e2438eba - Sigstore transparency entry: 2011388857
- Sigstore integration time:
-
Permalink:
rrahimi-uci/agentic-context-engineering@2561cef32c2f52755814d15c87d0db39af4358d2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/rrahimi-uci
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2561cef32c2f52755814d15c87d0db39af4358d2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ace_playbook-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ace_playbook-0.2.0-py3-none-any.whl
- Upload date:
- Size: 49.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07e0087f8a1d602713bf5131f7b8c9e777b89cf2621404513d2df169eecc290f
|
|
| MD5 |
aa44f02d8d1c5db15c8317f01817dc9e
|
|
| BLAKE2b-256 |
3de0c3fa2f1d164f20c4e19577752fa826c59cf6bdfe46da0c0ae8a6fb409888
|
Provenance
The following attestation bundles were made for ace_playbook-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on rrahimi-uci/agentic-context-engineering
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ace_playbook-0.2.0-py3-none-any.whl -
Subject digest:
07e0087f8a1d602713bf5131f7b8c9e777b89cf2621404513d2df169eecc290f - Sigstore transparency entry: 2011389021
- Sigstore integration time:
-
Permalink:
rrahimi-uci/agentic-context-engineering@2561cef32c2f52755814d15c87d0db39af4358d2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/rrahimi-uci
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2561cef32c2f52755814d15c87d0db39af4358d2 -
Trigger Event:
release
-
Statement type: