In-Context Reinforcement Learning for LLM Agents
Project description
ICRL
In-Context Reinforcement Learning for LLM Agents
ICRL implements the In-Context Reinforcement Learning algorithm, enabling LLM agents to bootstrap their own performance by learning from successful trajectories. The agent accumulates successful experiences and retrieves relevant examples at each decision point to improve future task completion.
Installation
Install from PyPI
pip install icrl-py
# or with uv
uv add icrl-py
Install from source
git clone https://github.com/SuperAce100/icrl.git
cd icrl
# Create & activate a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows (PowerShell)
# Install in editable mode
pip install -e .
If you use uv:
git clone https://github.com/SuperAce100/icrl.git
cd icrl
uv sync
# or: uv pip install -e .
Verify the install:
python -c "import icrl; print(icrl.__version__)"
# If the CLI entrypoint is installed:
icrl --help
Dependencies: pydantic, litellm, sentence-transformers, faiss-cpu, aiofiles, rich, python-dotenv
Quick Start
import asyncio
from icrl import Agent, LiteLLMProvider
# Create the agent
agent = Agent(
llm=LiteLLMProvider(model="gpt-4o-mini"),
db_path="./trajectories",
plan_prompt="Goal: {goal}\n\nExamples:\n{examples}\n\nCreate a plan:",
reason_prompt="Goal: {goal}\nPlan: {plan}\nObservation: {observation}\nThink step by step:",
act_prompt="Goal: {goal}\nPlan: {plan}\nReasoning: {reasoning}\nNext action:",
k=3, # number of examples to retrieve
max_steps=30, # max steps per episode
)
# Training: successful trajectories are stored for future use
trajectory = asyncio.run(agent.train(env, goal="Complete the task"))
# Inference: uses stored examples but doesn't add new ones
trajectory = asyncio.run(agent.run(env, goal="Complete another task"))
Core Concepts
The ICRL Algorithm
- Bootstrap Phase: The agent attempts tasks, storing successful trajectories
- Retrieval: At each decision point, semantically similar examples are retrieved
- Generation: The LLM generates plans/reasoning/actions informed by examples
- Curation: Low-utility trajectories are automatically pruned over time
ReAct Loop
Each episode follows a Plan → Reason → Act loop:
┌─────────────────────────────────────────────────────────┐
│ 1. PLAN: Generate high-level strategy using examples │
├─────────────────────────────────────────────────────────┤
│ 2. REASON: Analyze observation with retrieved context │
├─────────────────────────────────────────────────────────┤
│ 3. ACT: Execute action based on reasoning │
├─────────────────────────────────────────────────────────┤
│ 4. OBSERVE: Get environment feedback │
│ └─→ Loop back to REASON until done │
└─────────────────────────────────────────────────────────┘
API Reference
Agent
The main class for training and running the ICRL agent.
from icrl import Agent
agent = Agent(
llm: LLMProvider, # LLM for generating completions
db_path: str, # Path to trajectory database
plan_prompt: str, # Template with {goal}, {examples}
reason_prompt: str, # Template with {goal}, {plan}, {observation}, {history}, {examples}
act_prompt: str, # Template with {goal}, {plan}, {reasoning}, {history}, {examples}
k: int = 3, # Number of examples to retrieve
max_steps: int = 30, # Maximum steps per episode
seed_trajectories: list[Trajectory] | None = None, # Initial examples
on_step: Callable[[Step, StepContext], None] | None = None, # Step callback
curation_threshold: float = 0.3, # Utility threshold for pruning
curation_min_retrievals: int = 5, # Min retrievals before pruning
verify_trajectory: Callable[[Trajectory], bool] | None = None, # Verification callback
)
Methods
| Method | Description |
|---|---|
await agent.train(env, goal) |
Run training episode, store successful trajectories (with optional verification) |
await agent.run(env, goal) |
Run inference episode (database frozen) |
agent.train_sync(env, goal) |
Synchronous wrapper for train |
agent.run_sync(env, goal) |
Synchronous wrapper for run |
await agent.train_batch(env_factory, goals) |
Train on multiple goals |
await agent.run_batch(env_factory, goals) |
Run inference on multiple goals |
agent.get_stats() |
Get database statistics |
agent.database |
Access the underlying TrajectoryDatabase |
LiteLLMProvider
Built-in LLM provider supporting 100+ models via LiteLLM.
from icrl import LiteLLMProvider
llm = LiteLLMProvider(
model: str = "gpt-4o-mini", # Model identifier
temperature: float = 0.7, # Sampling temperature
max_tokens: int | None = None, # Max tokens (None for model default)
**kwargs, # Additional LiteLLM arguments
)
Supported models include:
- OpenAI:
gpt-4o,gpt-4o-mini,gpt-4-turbo,gpt-3.5-turbo - Anthropic:
claude-3-5-sonnet-20241022,claude-3-opus-20240229 - Google:
gemini/gemini-pro,gemini/gemini-1.5-pro - Azure, Cohere, Replicate, and many more
Environment Protocol
Implement this protocol for your custom environment:
from icrl import Environment
class MyEnvironment:
def reset(self, goal: str) -> str:
"""Reset environment and return initial observation.
Args:
goal: The goal description for this episode.
Returns:
Initial observation as a string.
"""
self._goal = goal # Store for use in step()
return "Initial state description"
def step(self, action: str) -> tuple[str, bool, bool]:
"""Execute an action.
Args:
action: The action string to execute.
Returns:
Tuple of (observation, done, success):
- observation: Result of the action
- done: Whether episode has ended
- success: Whether goal was achieved
"""
# Execute action and check if goal is met
observation = execute(action)
success = check_goal(self._goal)
done = success or max_steps_reached
return observation, done, success
LLMProvider Protocol
Implement for custom LLM integrations:
from icrl import LLMProvider, Message
class MyLLMProvider:
async def complete(self, messages: list[Message]) -> str:
"""Generate completion from messages.
Args:
messages: List of Message(role, content) objects.
Returns:
Generated text as a string.
"""
# Call your LLM
return await my_llm_call(messages)
Data Models
All models are Pydantic BaseModel classes for type safety and serialization.
Trajectory
A complete episode trajectory:
from icrl import Trajectory, Step
trajectory = Trajectory(
id: str, # Auto-generated UUID
goal: str, # Goal description
plan: str, # Generated plan
steps: list[Step], # List of steps taken
success: bool, # Whether goal was achieved
metadata: dict[str, Any], # Custom metadata
)
# Convert to example string for prompts
example_str = trajectory.to_example_string()
Step
A single step in a trajectory:
from icrl import Step
step = Step(
observation: str, # What the agent observed
reasoning: str, # Agent's reasoning
action: str, # Action taken
)
StepContext
Context available during prompt formatting:
from icrl import StepContext
context = StepContext(
goal: str,
plan: str,
observation: str,
reasoning: str = "",
history: list[Step] = [],
examples: list[Trajectory] = [],
)
# Format for prompts
context.format_examples() # → "Goal: ...\nPlan: ...\nSteps: ..."
context.format_history() # → "Step 1: action -> observation\n..."
Message
A chat message:
from icrl import Message
message = Message(role="user", content="Hello")
Prompt Templates
Prompts use Python format strings with these placeholders:
| Placeholder | Available In | Description |
|---|---|---|
{goal} |
All prompts | The current goal |
{examples} |
All prompts | Formatted retrieved trajectories |
{plan} |
reason, act | The generated plan |
{observation} |
reason, act | Current observation |
{reasoning} |
act | Generated reasoning |
{history} |
reason, act | Previous steps in episode |
Example Prompts
PLAN_PROMPT = """You are a helpful agent.
Goal: {goal}
Here are examples of similar tasks that were completed successfully:
{examples}
Create a step-by-step plan to accomplish the goal."""
REASON_PROMPT = """Goal: {goal}
Plan: {plan}
Previous steps:
{history}
Current observation:
{observation}
Examples of similar situations:
{examples}
Think step by step about what you observe and what to do next."""
ACT_PROMPT = """Goal: {goal}
Plan: {plan}
Steps so far:
{history}
Current observation: {observation}
Your reasoning: {reasoning}
What is the next action? Respond with only the action."""
Step Callbacks
Monitor agent progress with step callbacks:
from icrl import Step, StepContext
def my_callback(step: Step, context: StepContext) -> None:
print(f"Observation: {step.observation[:100]}...")
print(f"Reasoning: {step.reasoning}")
print(f"Action: {step.action}")
print(f"Using {len(context.examples)} examples")
agent = Agent(
...,
on_step=my_callback,
)
Trajectory Database
The agent stores trajectories on disk with FAISS-based semantic search.
# Access the database directly
db = agent.database
# Search for similar trajectories
similar = db.search("find config files", k=3)
# Get all trajectories
all_trajs = db.get_all()
# Get a specific trajectory
traj = db.get("trajectory-id")
# Remove a trajectory
db.remove("trajectory-id")
Database Structure
./trajectories/
├── trajectories/
│ ├── <uuid-1>.json
│ ├── <uuid-2>.json
│ └── ...
├── index.faiss # FAISS vector index
├── index_ids.json # ID mapping
└── curation.json # Utility tracking
Curation
The agent automatically prunes low-utility trajectories. A trajectory is pruned when:
- It has been retrieved at least
min_retrievalstimes - Its utility score (success rate when used) falls below
threshold
agent = Agent(
...,
curation_threshold=0.3, # Prune if utility < 30%
curation_min_retrievals=5, # After at least 5 retrievals
)
Advanced Usage
Seed Trajectories
Initialize with pre-existing examples:
from icrl import Trajectory, Step
seed = Trajectory(
goal="Example task",
plan="1. Do A\n2. Do B",
steps=[
Step(observation="Started", reasoning="Need to do A", action="do_a"),
Step(observation="A done", reasoning="Now do B", action="do_b"),
],
success=True,
)
agent = Agent(
...,
seed_trajectories=[seed],
)
Batch Training
Train on multiple tasks efficiently:
def make_env():
return MyEnvironment()
goals = ["Task 1", "Task 2", "Task 3"]
# Training mode - learns from each successful episode
trajectories = await agent.train_batch(make_env, goals)
# Inference mode - frozen database
trajectories = await agent.run_batch(make_env, goals)
Custom Embeddings
The database uses sentence-transformers with all-MiniLM-L6-v2 by default (as used in the paper). For custom embeddings, subclass the database:
from icrl.embedder import SentenceTransformerEmbedder
from icrl.database import TrajectoryDatabase
embedder = SentenceTransformerEmbedder(model_name="your-model")
db = TrajectoryDatabase(path="./trajectories", embedder=embedder)
Examples
Demo scripts are in examples/ (see examples/README.md).
Mock/offline verification scripts and test-focused walkthroughs are in tests/
(see tests/README.md).
Minimal OpenAI Demo
export OPENAI_API_KEY=your-key
uv run python examples/basic_openai_demo.py
Minimal Anthropic Demo
export ANTHROPIC_API_KEY=your-key
uv run python examples/basic_anthropic_demo.py
File System Navigation Agent
See examples/demo_with_real_llm.py for a complete example of an agent that navigates a virtual file system:
# Set your API key
export OPENAI_API_KEY=your-key
# Run the demo
uv run python examples/demo_with_real_llm.py
Mock LLM for Testing
Use the mock provider for fast iteration without API calls:
from examples.mock_llm import MockLLMProvider
llm = MockLLMProvider(success_rate=1.0)
agent = Agent(llm=llm, ...)
Run the full offline mock demo:
uv run python tests/test_with_mock.py
Agent API Walkthrough (Offline)
Deterministic walkthrough of Agent APIs:
train/runtrain_sync/run_synctrain_batch/run_batchseed_trajectoriesverify_trajectory
uv run python tests/agent_api_walkthrough.py
Database API Walkthrough (Offline)
Deterministic walkthrough of storage/retrieval/curation/validation APIs:
TrajectoryDatabaseCRUD/searchTrajectoryRetrieverCurationManagerHashEmbedderextract_code_artifactsand validation helpers
uv run python tests/database_api_walkthrough.py
Harbor Coding Agent (Terminal-Bench 2.0 Compatible)
See examples/harbor_coding_agent.py for a coding agent example compatible with Harbor and Terminal-Bench 2.0. This demonstrates:
- A sandboxed coding environment with shell commands (ls, cat, grep, sed, etc.)
- Realistic software engineering tasks (debugging, refactoring, testing)
- Performance improvement tracking before/after ICRL training
export OPENAI_API_KEY=your-key
uv run python examples/harbor_coding_agent.py
The Harbor example shows how ICRL improves agent performance on coding tasks:
- Baseline Evaluation: Agent attempts tasks without learned examples
- Training Phase: Agent learns from successful coding task trajectories
- Improved Evaluation: Re-test shows performance gains from trajectory learning
This pattern integrates with Harbor's agent evaluation framework, allowing you to:
- Benchmark coding agents on Terminal-Bench 2.0 tasks
- Use ICRL's self-generated examples to improve agent performance
- Track improvements across training iterations
Architecture
icrl/
├── agent.py # Main Agent class
├── loop.py # ReAct loop implementation
├── database.py # FAISS-backed trajectory storage
├── retriever.py # Semantic example retrieval
├── curation.py # Automatic trajectory pruning
├── embedder.py # Sentence transformer embeddings
├── models.py # Pydantic data models
├── protocols.py # Environment and LLMProvider protocols
└── providers/
└── litellm.py # LiteLLM integration
Reference
This implementation is based on the algorithm described in:
Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks
The key insight is that LLM agents can bootstrap their own performance by:
- Attempting tasks and recording successful trajectories
- Using semantic retrieval to find relevant examples at each decision point
- Automatically curating the example database to retain high-utility examples
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file icrl_py-0.1.0.tar.gz.
File metadata
- Download URL: icrl_py-0.1.0.tar.gz
- Upload date:
- Size: 904.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffb89cb86ad82ee799a35d03e81efd12471acb6362877729d7bf3199dba90e1c
|
|
| MD5 |
fe89e4c2dfd694bceee3da717f6bf802
|
|
| BLAKE2b-256 |
431fdf14a90acc2d7f60ed106daef6ed53e58c1d9b14db21d97e7f9aad47f45d
|
File details
Details for the file icrl_py-0.1.0-py3-none-any.whl.
File metadata
- Download URL: icrl_py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 109.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
108caffe033ea26aef268fb96cd5a6f93416466b50ddc4614c8873675e8e0b12
|
|
| MD5 |
dddd334328027465dbd8d7b1cbb1326f
|
|
| BLAKE2b-256 |
b2a7e6209ba4f1d29e10589fc631fe6b31d1d226d1025bc2cb967a58cc713920
|