Self-evolving agent framework — agents that create their own tools

These details have not been verified by PyPI

Project links

Project description

ARISE — Adaptive Runtime Improvement through Self-Evolution

Your agent works great on the tasks you planned for. ARISE handles the ones you didn't.

ARISE is a framework-agnostic middleware that gives LLM agents the ability to create their own tools at runtime. When your agent fails at a task, ARISE detects the capability gap, synthesizes a Python tool, validates it in a sandbox, and promotes it to the active library — no human intervention required.

pip install arise-ai

from arise import ARISE
from arise.rewards import task_success

arise = ARISE(
    agent_fn=my_agent,           # any (task, tools) -> str function
    reward_fn=task_success,
    model="gpt-4o-mini",         # cheap model for tool synthesis
)

result = arise.run("Fetch all users from the paginated API")
# Agent fails → ARISE synthesizes fetch_all_paginated tool → agent succeeds

How It Works

flowchart TD
    A["Agent receives task"] --> B["Execute with current tools"]
    B --> C{"Success?"}
    C -- "Yes (reward ≥ 0.5)" --> D["Log trajectory, continue"]
    C -- "No (reward < 0.5)" --> E["Log failure trajectory"]
    E --> F{"Enough failures?"}
    F -- No --> D
    F -- Yes --> G["Detect capability gaps"]
    G --> H["Synthesize new tool via LLM"]
    H --> I["Test in sandbox + adversarial validation"]
    I --> J{"Pass?"}
    J -- Yes --> K["Promote to active library"]
    J -- No --> L["Refine and retry"]
    L --> H
    K --> A

    style G fill:#f9d71c,color:#000
    style H fill:#f9d71c,color:#000
    style I fill:#f9d71c,color:#000
    style K fill:#4caf50,color:#fff

What It Looks Like

Episode 1  | FAIL  | reward=0.00 | skills=2   Task: "Fetch paginated users with auth"
Episode 2  | FAIL  | reward=0.00 | skills=2
Episode 3  | FAIL  | reward=0.00 | skills=2

[Evolution triggered — 3 failures on API tasks]
  → Synthesizing 'parse_json_response'... 3/3 tests passed ✓
  → Synthesizing 'fetch_all_paginated'... sandbox fail → refine → 1/1 passed ✓

Episode 4  | OK    | reward=1.00 | skills=4   Agent now has the tools it needs

Framework Support

Framework	Status	How
Any function	Supported	`ARISE(agent_fn=my_func)` — any `(task, tools) -> str` callable
Strands Agents	Supported	`ARISE(agent=strands_agent)` — auto-injects tools alongside your `@tool` functions
Raw OpenAI / Anthropic	Supported	Wrap API calls in an `agent_fn` — see examples/
LangGraph, CrewAI	Planned	v0.2

Core Features

Self-Evolution Pipeline

The core loop: fail → detect gap → synthesize → test → promote.

Tools are synthesized by a cheap LLM (gpt-4o-mini), validated in an isolated sandbox with adversarial testing, and version-controlled in SQLite. Every mutation is checkpointed; rollback anytime.

Distributed Mode

Decouple agent and evolution for stateless deployments (Lambda, ECS, AgentCore):

flowchart LR
    subgraph Agent["Agent Process (stateless)"]
        A1["Serve requests"]
        A2["Read skills from S3"]
        A3["Report trajectories"]
    end

    subgraph Worker["ARISE Worker"]
        W1["Consume trajectories"]
        W2["Detect gaps & evolve"]
        W3["Promote skills"]
    end

    S3[(S3 Skill Store)]
    SQS[[SQS Queue]]

    A2 --> S3
    A3 --> SQS
    SQS --> W1
    W3 --> S3

from arise import create_distributed_arise, ARISEConfig

config = ARISEConfig(
    s3_bucket="my-skills",
    sqs_queue_url="https://sqs.../arise-trajectories",
)

arise = create_distributed_arise(agent_fn=my_agent, reward_fn=task_success, config=config)

pip install arise-ai[aws]   # adds boto3

Skill Registry

Share evolved tools across projects — like npm for agent skills:

from arise.registry import SkillRegistry

registry = SkillRegistry(bucket="my-registry")
registry.publish(skill, tags=["json", "parsing"])

# Other projects can pull proven skills
skill = registry.pull("parse_csv")

Set registry_check_before_synthesis=True in config and ARISE checks the registry before calling the LLM.

Multi-Model Routing

Route different synthesis tasks to different models:

config = ARISEConfig(
    model_routes={
        "gap_detection": "gpt-4o-mini",      # cheap
        "synthesis": "claude-sonnet-4-5-20250929",  # expensive, better code
        "refinement": "gpt-4o-mini",
    },
    auto_select_model=True,  # auto-promote best model over time
)

Skill A/B Testing

When ARISE evolves a refined skill, it A/B tests against the original instead of replacing it:

# Automatic — ARISE creates A/B tests during evolution
# Manual — test two versions yourself
from arise.skills.ab_test import SkillABTest

ab = SkillABTest(skill_a=v1, skill_b=v2, min_episodes=20)
# Winner auto-promoted, loser deprecated after min_episodes

Incremental Evolution

Patch existing skills instead of full re-synthesis:

# ARISE does this automatically during evolution:
# 1. Existing skill fails on specific inputs
# 2. forge.patch() applies minimal fix
# 3. Patched version A/B tested against original
# 4. Winner promoted

Reward Learning

Learn reward functions from human feedback:

from arise.rewards.learned import LearnedReward

reward = LearnedReward(min_examples=10, persist_path="./feedback")
reward.add_feedback(trajectory, score=0.9)

# Falls back to task_success until enough examples collected
arise = ARISE(agent_fn=my_agent, reward_fn=reward)

Cost Tracking

Track LLM spend automatically:

from arise import cost_tracker

# After running episodes...
print(cost_tracker.summary())
# {"total_calls": 64, "total_input_tokens": 125000, "total_output_tokens": 42000, "total_cost_usd": 0.26}

Benchmark Results

Evaluated on two proprietary-format domains where LLMs can't cheat with training data:

Model	Condition	AcmeCorp (SRE)	DataCorp (Data Eng)
Claude Sonnet	ARISE	78%	—
Claude Sonnet	No tools	63%	—
GPT-4o-mini	ARISE	57%	92%
GPT-4o-mini	No tools	48%	50%
GPT-4o-mini	Fixed tools	48%	—

Key findings:

ARISE improves both models: +15pp (Claude), +9pp (GPT-4o-mini) on AcmeCorp; +42pp on DataCorp
Self-evolved tools > hand-written tools: 57% vs 48% — ARISE tailors tools to how the agent actually works
Fewer tools can be better: Claude achieved 78% with just 2 tools; GPT-4o-mini needed 21

See benchmarks/ for the full evaluation suite and paper/ for the research paper.

Safety

Generated code is untrusted. ARISE validates through multiple layers:

Layer	What it does
Sandbox	Subprocess or Docker isolation with timeouts
Test suite	LLM writes tests alongside the tool
Adversarial testing	Separate LLM call tries to break it (edge cases, type boundaries, security)
Import restrictions	`allowed_imports` whitelist blocks `subprocess`, `socket`, etc.
Promotion gate	Only tools passing all tests become `ACTIVE`
Version control	SQLite checkpoints; `arise rollback <version>` anytime
Rate limiting	`max_evolutions_per_hour` caps LLM spend

See SECURITY.md for the full threat model.

Reward Functions

Function	Scores	Best for
`task_success`	1.0 if no error in outcome	General purpose
`code_execution_reward`	1.0 minus 0.25 per error	Tool-use agents
`answer_match_reward`	1.0 exact, 0.7 substring match	Q&A, extraction
`efficiency_reward`	Penalizes extra steps	Concise agents
`llm_judge_reward`	LLM rates 0–1 (~$0.001/call)	Open-ended tasks
`LearnedReward`	Few-shot from human feedback	Custom domains
`CompositeReward`	Weighted blend of any of the above	Production

CLI

arise status ./skills          # Library stats
arise skills ./skills          # List active skills with metrics
arise inspect ./skills <id>    # View implementation + tests
arise rollback ./skills <ver>  # Rollback to previous version
arise export ./skills ./out    # Export as .py files
arise evolve --dry-run         # Preview what would be synthesized

Configuration

from arise import ARISEConfig

config = ARISEConfig(
    model="gpt-4o-mini",           # LLM for synthesis (not your agent's model)
    sandbox_backend="subprocess",   # or "docker"
    sandbox_timeout=30,
    max_library_size=50,
    max_refinement_attempts=3,
    failure_threshold=5,            # failures before evolution
    max_evolutions_per_hour=3,      # cost control
    allowed_imports=["json", "re", "hashlib", "csv", "math"],  # restrict generated code
)

Examples

Example	Description
`quickstart.py`	Math agent evolves statistics tools
`api_agent.py`	HTTP agent evolves auth + pagination (mock server, no deps)
`devops_agent.py`	DevOps agent evolves log analysis tools
`coding_agent.py`	Code agent evolves file manipulation tools
`strands_agent.py`	Strands integration with Bedrock
`demo/agentcore/`	Full AgentCore deployment with A2A protocol

Costs

Each evolution cycle is 3–5 LLM calls with gpt-4o-mini: ~$0.01–0.05 per cycle. With max_evolutions_per_hour=3, worst case ~$0.15/hour.

Dependencies

pip install arise-ai              # core (just pydantic)
pip install arise-ai[aws]         # + boto3 for distributed mode
pip install arise-ai[litellm]     # + litellm for multi-provider LLM
pip install arise-ai[docker]      # + docker sandbox backend
pip install arise-ai[all]         # everything

Related Work

ARISE builds on ideas from LATM (LLMs as tool makers), VOYAGER (open-ended skill libraries), CREATOR (disentangling reasoning from tool creation), ADAS (automated agent design), and CRAFT (shared tool libraries). ARISE adds the production engineering layer: framework-agnostic integration, sandboxed validation, adversarial testing, version control, distributed deployment, and A/B testing.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Mar 30, 2026

0.2.0

Mar 25, 2026

0.1.6

Mar 23, 2026

0.1.5

Mar 22, 2026

0.1.4

Mar 22, 2026

0.1.3

Mar 21, 2026

This version

0.1.2

Mar 18, 2026

0.1.1

Mar 17, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arise_ai-0.1.2.tar.gz (223.2 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arise_ai-0.1.2-py3-none-any.whl (56.2 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file arise_ai-0.1.2.tar.gz.

File metadata

Download URL: arise_ai-0.1.2.tar.gz
Upload date: Mar 18, 2026
Size: 223.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for arise_ai-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`54e76f274f514263e602619ddf57423330b03dedc408d0f4d1d84f9d8a9c907f`
MD5	`1ee8f658f8e2debb7e936a3f1a6e9e5d`
BLAKE2b-256	`adbd8547f49bc1830fb1b02522570b7918eff2a458fc18e4ba91e94c52ecface`

See more details on using hashes here.

File details

Details for the file arise_ai-0.1.2-py3-none-any.whl.

File metadata

Download URL: arise_ai-0.1.2-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 56.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for arise_ai-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2281baf0c99f42b56d58aebba13ca3fa1d76ef3cfb68184501f9a2848e2d0efb`
MD5	`2b87ed0af7fa8b78b4ad806d3f95a4a4`
BLAKE2b-256	`701158e1989596dc04f3ef2b420b514930fb6acd05cd56afacb0fd984cf200f5`

See more details on using hashes here.

arise-ai 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ARISE — Adaptive Runtime Improvement through Self-Evolution

How It Works

What It Looks Like

Framework Support

Core Features

Self-Evolution Pipeline

Distributed Mode

Skill Registry

Multi-Model Routing

Skill A/B Testing

Incremental Evolution

Reward Learning

Cost Tracking

Benchmark Results

Safety

Reward Functions

CLI

Configuration

Examples

Costs

Dependencies

Related Work

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes