Skip to main content

Self-evolving agent framework — agents that create their own tools

Project description

ARISE — Adaptive Runtime Improvement through Self-Evolution

PyPI version Python 3.11+ License: MIT

Your agent works great on the tasks you planned for. ARISE handles the ones you didn't.

ARISE is a framework-agnostic middleware that gives LLM agents the ability to create their own tools at runtime. When your agent fails at a task, ARISE detects the capability gap, synthesizes a Python tool, validates it in a sandbox, and promotes it to the active library — no human intervention required.

pip install arise-ai
from arise import ARISE
from arise.rewards import task_success

arise = ARISE(
    agent_fn=my_agent,           # any (task, tools) -> str function
    reward_fn=task_success,
    model="gpt-4o-mini",         # cheap model for tool synthesis
)

result = arise.run("Fetch all users from the paginated API")
# Agent fails → ARISE synthesizes fetch_all_paginated tool → agent succeeds

How It Works

flowchart TD
    A["Agent receives task"] --> B["Execute with current tools"]
    B --> C{"Success?"}
    C -- "Yes (reward ≥ 0.5)" --> D["Log trajectory, continue"]
    C -- "No (reward < 0.5)" --> E["Log failure trajectory"]
    E --> F{"Enough failures?"}
    F -- No --> D
    F -- Yes --> G["Detect capability gaps"]
    G --> H["Synthesize new tool via LLM"]
    H --> I["Test in sandbox + adversarial validation"]
    I --> J{"Pass?"}
    J -- Yes --> K["Promote to active library"]
    J -- No --> L["Refine and retry"]
    L --> H
    K --> A

    style G fill:#f9d71c,color:#000
    style H fill:#f9d71c,color:#000
    style I fill:#f9d71c,color:#000
    style K fill:#4caf50,color:#fff

What It Looks Like

Episode 1  | FAIL  | reward=0.00 | skills=2   Task: "Fetch paginated users with auth"
Episode 2  | FAIL  | reward=0.00 | skills=2
Episode 3  | FAIL  | reward=0.00 | skills=2

[Evolution triggered — 3 failures on API tasks]
  → Synthesizing 'parse_json_response'... 3/3 tests passed ✓
  → Synthesizing 'fetch_all_paginated'... sandbox fail → refine → 1/1 passed ✓

Episode 4  | OK    | reward=1.00 | skills=4   Agent now has the tools it needs

Framework Support

Framework Status How
Any function Supported ARISE(agent_fn=my_func) — any (task, tools) -> str callable
Strands Agents Supported ARISE(agent=strands_agent) — auto-injects tools alongside your @tool functions
Raw OpenAI / Anthropic Supported Wrap API calls in an agent_fn — see examples/
LangGraph, CrewAI Planned v0.2

Core Features

Self-Evolution Pipeline

The core loop: fail → detect gap → synthesize → test → promote.

Tools are synthesized by a cheap LLM (gpt-4o-mini), validated in an isolated sandbox with adversarial testing, and version-controlled in SQLite. Every mutation is checkpointed; rollback anytime.

Distributed Mode

Decouple agent and evolution for stateless deployments (Lambda, ECS, AgentCore):

flowchart LR
    subgraph Agent["Agent Process (stateless)"]
        A1["Serve requests"]
        A2["Read skills from S3"]
        A3["Report trajectories"]
    end

    subgraph Worker["ARISE Worker"]
        W1["Consume trajectories"]
        W2["Detect gaps & evolve"]
        W3["Promote skills"]
    end

    S3[(S3 Skill Store)]
    SQS[[SQS Queue]]

    A2 --> S3
    A3 --> SQS
    SQS --> W1
    W3 --> S3
from arise import create_distributed_arise, ARISEConfig

config = ARISEConfig(
    s3_bucket="my-skills",
    sqs_queue_url="https://sqs.../arise-trajectories",
)

arise = create_distributed_arise(agent_fn=my_agent, reward_fn=task_success, config=config)
pip install arise-ai[aws]   # adds boto3

Skill Registry

Share evolved tools across projects — like npm for agent skills:

from arise.registry import SkillRegistry

registry = SkillRegistry(bucket="my-registry")
registry.publish(skill, tags=["json", "parsing"])

# Other projects can pull proven skills
skill = registry.pull("parse_csv")

Set registry_check_before_synthesis=True in config and ARISE checks the registry before calling the LLM.

Multi-Model Routing

Route different synthesis tasks to different models:

config = ARISEConfig(
    model_routes={
        "gap_detection": "gpt-4o-mini",      # cheap
        "synthesis": "claude-sonnet-4-5-20250929",  # expensive, better code
        "refinement": "gpt-4o-mini",
    },
    auto_select_model=True,  # auto-promote best model over time
)

Skill A/B Testing

When ARISE evolves a refined skill, it A/B tests against the original instead of replacing it:

# Automatic — ARISE creates A/B tests during evolution
# Manual — test two versions yourself
from arise.skills.ab_test import SkillABTest

ab = SkillABTest(skill_a=v1, skill_b=v2, min_episodes=20)
# Winner auto-promoted, loser deprecated after min_episodes

Incremental Evolution

Patch existing skills instead of full re-synthesis:

# ARISE does this automatically during evolution:
# 1. Existing skill fails on specific inputs
# 2. forge.patch() applies minimal fix
# 3. Patched version A/B tested against original
# 4. Winner promoted

Reward Learning

Learn reward functions from human feedback:

from arise.rewards.learned import LearnedReward

reward = LearnedReward(min_examples=10, persist_path="./feedback")
reward.add_feedback(trajectory, score=0.9)

# Falls back to task_success until enough examples collected
arise = ARISE(agent_fn=my_agent, reward_fn=reward)

Safety

Generated code is untrusted. ARISE validates through multiple layers:

Layer What it does
Sandbox Subprocess or Docker isolation with timeouts
Test suite LLM writes tests alongside the tool
Adversarial testing Separate LLM call tries to break it (edge cases, type boundaries, security)
Import restrictions allowed_imports whitelist blocks subprocess, socket, etc.
Promotion gate Only tools passing all tests become ACTIVE
Version control SQLite checkpoints; arise rollback <version> anytime
Rate limiting max_evolutions_per_hour caps LLM spend

See SECURITY.md for the full threat model.


Reward Functions

Function Scores Best for
task_success 1.0 if no error in outcome General purpose
code_execution_reward 1.0 minus 0.25 per error Tool-use agents
answer_match_reward 1.0 exact, 0.7 substring match Q&A, extraction
efficiency_reward Penalizes extra steps Concise agents
llm_judge_reward LLM rates 0–1 (~$0.001/call) Open-ended tasks
LearnedReward Few-shot from human feedback Custom domains
CompositeReward Weighted blend of any of the above Production

CLI

arise status ./skills          # Library stats
arise skills ./skills          # List active skills with metrics
arise inspect ./skills <id>    # View implementation + tests
arise rollback ./skills <ver>  # Rollback to previous version
arise export ./skills ./out    # Export as .py files
arise evolve --dry-run         # Preview what would be synthesized

Configuration

from arise import ARISEConfig

config = ARISEConfig(
    model="gpt-4o-mini",           # LLM for synthesis (not your agent's model)
    sandbox_backend="subprocess",   # or "docker"
    sandbox_timeout=30,
    max_library_size=50,
    max_refinement_attempts=3,
    failure_threshold=5,            # failures before evolution
    max_evolutions_per_hour=3,      # cost control
    allowed_imports=["json", "re", "hashlib", "csv", "math"],  # restrict generated code
)

Examples

Example Description
quickstart.py Math agent evolves statistics tools
api_agent.py HTTP agent evolves auth + pagination (mock server, no deps)
devops_agent.py DevOps agent evolves log analysis tools
coding_agent.py Code agent evolves file manipulation tools
strands_agent.py Strands integration with Bedrock
demo/agentcore/ Full AgentCore deployment with A2A protocol

Costs

Each evolution cycle is 3–5 LLM calls with gpt-4o-mini: ~$0.01–0.05 per cycle. With max_evolutions_per_hour=3, worst case ~$0.15/hour.


Dependencies

pip install arise-ai              # core (just pydantic)
pip install arise-ai[aws]         # + boto3 for distributed mode
pip install arise-ai[litellm]     # + litellm for multi-provider LLM
pip install arise-ai[docker]      # + docker sandbox backend
pip install arise-ai[all]         # everything

Related Work

ARISE builds on ideas from LATM (LLMs as tool makers), VOYAGER (open-ended skill libraries), CREATOR (disentangling reasoning from tool creation), ADAS (automated agent design), and CRAFT (shared tool libraries). ARISE adds the production engineering layer: framework-agnostic integration, sandboxed validation, adversarial testing, version control, distributed deployment, and A/B testing.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arise_ai-0.1.1.tar.gz (89.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arise_ai-0.1.1-py3-none-any.whl (54.6 kB view details)

Uploaded Python 3

File details

Details for the file arise_ai-0.1.1.tar.gz.

File metadata

  • Download URL: arise_ai-0.1.1.tar.gz
  • Upload date:
  • Size: 89.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for arise_ai-0.1.1.tar.gz
Algorithm Hash digest
SHA256 36a0a8a6e19e6d7a8c231882e9a8bd8f933ea8ffd2b933bfd8164f368249e27f
MD5 7ac8e103f407f66fcf6bbbc1694eb5f0
BLAKE2b-256 1ad745a85ef71fec8b6e3a9373cfb2b1b9207e8c5aefdac052e8b2e1de4f04a9

See more details on using hashes here.

File details

Details for the file arise_ai-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: arise_ai-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 54.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for arise_ai-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b3f2c838aaa5b7126741e0ff744367c257afa749f86db05a59c6763a78e227d5
MD5 1342b88d1a5ab89e642d959099dc74df
BLAKE2b-256 e87447df032bf3f6474808bd3c75735fe88e5857c95387e42c45f279904ab714

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page