Self-evolving agent framework — agents that create their own tools
Project description
ARISE — Adaptive Runtime Improvement through Self-Evolution
Your agent works great on the tasks you planned for. ARISE handles the ones you didn't.
ARISE is a framework-agnostic middleware that gives LLM agents the ability to create their own tools at runtime. When your agent fails at a task, ARISE detects the capability gap, synthesizes a Python tool, validates it in a sandbox, and promotes it to the active library — no human intervention required.
pip install arise-ai
from arise import ARISE
from arise.rewards import task_success
arise = ARISE(
agent_fn=my_agent, # any (task, tools) -> str function
reward_fn=task_success,
model="gpt-4o-mini", # cheap model for tool synthesis
)
result = arise.run("Fetch all users from the paginated API")
# Agent fails → ARISE synthesizes fetch_all_paginated tool → agent succeeds
How It Works
flowchart TD
A["Agent receives task"] --> B["Execute with current tools"]
B --> C{"Success?"}
C -- "Yes (reward ≥ 0.5)" --> D["Log trajectory, continue"]
C -- "No (reward < 0.5)" --> E["Log failure trajectory"]
E --> F{"Enough failures?"}
F -- No --> D
F -- Yes --> G["Detect capability gaps"]
G --> H["Synthesize new tool via LLM"]
H --> I["Test in sandbox + adversarial validation"]
I --> J{"Pass?"}
J -- Yes --> K["Promote to active library"]
J -- No --> L["Refine and retry"]
L --> H
K --> A
style G fill:#f9d71c,color:#000
style H fill:#f9d71c,color:#000
style I fill:#f9d71c,color:#000
style K fill:#4caf50,color:#fff
What It Looks Like
Episode 1 | FAIL | reward=0.00 | skills=2 Task: "Fetch paginated users with auth"
Episode 2 | FAIL | reward=0.00 | skills=2
Episode 3 | FAIL | reward=0.00 | skills=2
[Evolution triggered — 3 failures on API tasks]
→ Synthesizing 'parse_json_response'... 3/3 tests passed ✓
→ Synthesizing 'fetch_all_paginated'... sandbox fail → refine → 1/1 passed ✓
Episode 4 | OK | reward=1.00 | skills=4 Agent now has the tools it needs
Framework Support
| Framework | Status | How |
|---|---|---|
| Any function | Supported | ARISE(agent_fn=my_func) — any (task, tools) -> str callable |
| Strands Agents | Supported | ARISE(agent=strands_agent) — auto-injects tools alongside your @tool functions |
| Raw OpenAI / Anthropic | Supported | Wrap API calls in an agent_fn — see examples/ |
| LangGraph, CrewAI | Planned | v0.2 |
Core Features
Self-Evolution Pipeline
The core loop: fail → detect gap → synthesize → test → promote.
Tools are synthesized by a cheap LLM (gpt-4o-mini), validated in an isolated sandbox with adversarial testing, and version-controlled in SQLite. Every mutation is checkpointed; rollback anytime.
Distributed Mode
Decouple agent and evolution for stateless deployments (Lambda, ECS, AgentCore):
flowchart LR
subgraph Agent["Agent Process (stateless)"]
A1["Serve requests"]
A2["Read skills from S3"]
A3["Report trajectories"]
end
subgraph Worker["ARISE Worker"]
W1["Consume trajectories"]
W2["Detect gaps & evolve"]
W3["Promote skills"]
end
S3[(S3 Skill Store)]
SQS[[SQS Queue]]
A2 --> S3
A3 --> SQS
SQS --> W1
W3 --> S3
from arise import create_distributed_arise, ARISEConfig
config = ARISEConfig(
s3_bucket="my-skills",
sqs_queue_url="https://sqs.../arise-trajectories",
)
arise = create_distributed_arise(agent_fn=my_agent, reward_fn=task_success, config=config)
pip install arise-ai[aws] # adds boto3
Skill Registry
Share evolved tools across projects — like npm for agent skills:
from arise.registry import SkillRegistry
registry = SkillRegistry(bucket="my-registry")
registry.publish(skill, tags=["json", "parsing"])
# Other projects can pull proven skills
skill = registry.pull("parse_csv")
Set registry_check_before_synthesis=True in config and ARISE checks the registry before calling the LLM.
Multi-Model Routing
Route different synthesis tasks to different models:
config = ARISEConfig(
model_routes={
"gap_detection": "gpt-4o-mini", # cheap
"synthesis": "claude-sonnet-4-5-20250929", # expensive, better code
"refinement": "gpt-4o-mini",
},
auto_select_model=True, # auto-promote best model over time
)
Skill A/B Testing
When ARISE evolves a refined skill, it A/B tests against the original instead of replacing it:
# Automatic — ARISE creates A/B tests during evolution
# Manual — test two versions yourself
from arise.skills.ab_test import SkillABTest
ab = SkillABTest(skill_a=v1, skill_b=v2, min_episodes=20)
# Winner auto-promoted, loser deprecated after min_episodes
Incremental Evolution
Patch existing skills instead of full re-synthesis:
# ARISE does this automatically during evolution:
# 1. Existing skill fails on specific inputs
# 2. forge.patch() applies minimal fix
# 3. Patched version A/B tested against original
# 4. Winner promoted
Reward Learning
Learn reward functions from human feedback:
from arise.rewards.learned import LearnedReward
reward = LearnedReward(min_examples=10, persist_path="./feedback")
reward.add_feedback(trajectory, score=0.9)
# Falls back to task_success until enough examples collected
arise = ARISE(agent_fn=my_agent, reward_fn=reward)
Safety
Generated code is untrusted. ARISE validates through multiple layers:
| Layer | What it does |
|---|---|
| Sandbox | Subprocess or Docker isolation with timeouts |
| Test suite | LLM writes tests alongside the tool |
| Adversarial testing | Separate LLM call tries to break it (edge cases, type boundaries, security) |
| Import restrictions | allowed_imports whitelist blocks subprocess, socket, etc. |
| Promotion gate | Only tools passing all tests become ACTIVE |
| Version control | SQLite checkpoints; arise rollback <version> anytime |
| Rate limiting | max_evolutions_per_hour caps LLM spend |
See SECURITY.md for the full threat model.
Reward Functions
| Function | Scores | Best for |
|---|---|---|
task_success |
1.0 if no error in outcome | General purpose |
code_execution_reward |
1.0 minus 0.25 per error | Tool-use agents |
answer_match_reward |
1.0 exact, 0.7 substring match | Q&A, extraction |
efficiency_reward |
Penalizes extra steps | Concise agents |
llm_judge_reward |
LLM rates 0–1 (~$0.001/call) | Open-ended tasks |
LearnedReward |
Few-shot from human feedback | Custom domains |
CompositeReward |
Weighted blend of any of the above | Production |
CLI
arise status ./skills # Library stats
arise skills ./skills # List active skills with metrics
arise inspect ./skills <id> # View implementation + tests
arise rollback ./skills <ver> # Rollback to previous version
arise export ./skills ./out # Export as .py files
arise evolve --dry-run # Preview what would be synthesized
Configuration
from arise import ARISEConfig
config = ARISEConfig(
model="gpt-4o-mini", # LLM for synthesis (not your agent's model)
sandbox_backend="subprocess", # or "docker"
sandbox_timeout=30,
max_library_size=50,
max_refinement_attempts=3,
failure_threshold=5, # failures before evolution
max_evolutions_per_hour=3, # cost control
allowed_imports=["json", "re", "hashlib", "csv", "math"], # restrict generated code
)
Examples
| Example | Description |
|---|---|
quickstart.py |
Math agent evolves statistics tools |
api_agent.py |
HTTP agent evolves auth + pagination (mock server, no deps) |
devops_agent.py |
DevOps agent evolves log analysis tools |
coding_agent.py |
Code agent evolves file manipulation tools |
strands_agent.py |
Strands integration with Bedrock |
demo/agentcore/ |
Full AgentCore deployment with A2A protocol |
Costs
Each evolution cycle is 3–5 LLM calls with gpt-4o-mini: ~$0.01–0.05 per cycle. With max_evolutions_per_hour=3, worst case ~$0.15/hour.
Dependencies
pip install arise-ai # core (just pydantic)
pip install arise-ai[aws] # + boto3 for distributed mode
pip install arise-ai[litellm] # + litellm for multi-provider LLM
pip install arise-ai[docker] # + docker sandbox backend
pip install arise-ai[all] # everything
Related Work
ARISE builds on ideas from LATM (LLMs as tool makers), VOYAGER (open-ended skill libraries), CREATOR (disentangling reasoning from tool creation), ADAS (automated agent design), and CRAFT (shared tool libraries). ARISE adds the production engineering layer: framework-agnostic integration, sandboxed validation, adversarial testing, version control, distributed deployment, and A/B testing.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arise_ai-0.1.1.tar.gz.
File metadata
- Download URL: arise_ai-0.1.1.tar.gz
- Upload date:
- Size: 89.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36a0a8a6e19e6d7a8c231882e9a8bd8f933ea8ffd2b933bfd8164f368249e27f
|
|
| MD5 |
7ac8e103f407f66fcf6bbbc1694eb5f0
|
|
| BLAKE2b-256 |
1ad745a85ef71fec8b6e3a9373cfb2b1b9207e8c5aefdac052e8b2e1de4f04a9
|
File details
Details for the file arise_ai-0.1.1-py3-none-any.whl.
File metadata
- Download URL: arise_ai-0.1.1-py3-none-any.whl
- Upload date:
- Size: 54.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3f2c838aaa5b7126741e0ff744367c257afa749f86db05a59c6763a78e227d5
|
|
| MD5 |
1342b88d1a5ab89e642d959099dc74df
|
|
| BLAKE2b-256 |
e87447df032bf3f6474808bd3c75735fe88e5857c95387e42c45f279904ab714
|