Skip to main content

Failure memory for AI agents — self-healing retry with structured learning

Project description

ReLoop

Failure memory for AI agents.

Every agent fails. ReLoop is the first framework that gets smarter from failure.

License: Apache 2.0 PyPI version Python 3.11+


The Problem

AI agents retry blindly -- same mistake, same failure, burning tokens and money. No framework treats failure as data. They either retry with no memory, or give up.

The Solution

ReLoop captures every failure into a structured memory graph -- error type, root cause, suggested fix, confidence score, semantic embedding -- so the next retry starts smarter. Your agents don't just recover. They get permanently smarter.


Quick Start

The reloop-ai package is officially published and ready for production use.

pip install reloop-ai
reloop init
reloop demo

Minimal Setup (No Redis Required)

ReLoop works out of the box with SQLite -- no Redis or Blaxel needed:

pip install reloop-ai
export OPENAI_API_KEY=sk-...
reloop serve

Redis and Blaxel are optional -- add them when you need production performance. For local development, ReLoop uses Docker or runs commands directly as the sandbox.


Three Ways to Use ReLoop

1. As a Library (any agent, 3 lines)

from reloop import FailureMemory

memory = FailureMemory(redis_url="redis://localhost:6379")
similar = await memory.search("ImportError sharp")  # Returns past failures + fixes

2. As a Framework (full self-healing loop)

reloop run "Fix and deploy the Next.js project at ./my-broken-app"

3. As an MCP Server (Cursor & Claude Code)

Integrate ReLoop directly into your AI IDEs so they become self-healing. When the MCP server is connected, Claude can:

  • Search Memory: Semantically search past errors across your team's history.
  • Fetch Checkpoints: Restore exact state from past Blaxel Firecracker VM checkpoints.
  • Execute Safely: Run isolated code tests within Blaxel sandboxes directly from the editor.

Add to your claude_desktop_config.json or Cursor settings:

{
  "mcpServers": {
    "reloop": {
      "command": "python",
      "args": ["-m", "reloop.mcp_server"]
    }
  }
}

Architecture

┌─────────────────────┐   MCP Protocol    ┌──────────────────────┐
│  Cursor / Claude    │ ───────────────►  │  ReLoop MCP Server   │
└─────────────────────┘                   └──────────┬───────────┘
                                                     │
                               ┌─────────────────────┴──────────────────────┐
                               │                                             │
                               ▼  Vector Search                              ▼  Isolated Execution
                    ┌──────────────────────┐                    ┌──────────────────────┐
                    │  Redis Agent Memory  │                    │  Blaxel Firecracker  │
                    └──────────┬───────────┘                    └──────────┬───────────┘
                               │ stores                                    │ 25ms resume
                               ▼                                           ▼
                    ┌──────────────────────┐                    ┌──────────────────────┐
                    │  Failure Embeddings  │                    │   State Checkpoints  │
                    └──────────────────────┘                    └──────────────────────┘

Redis: The Memory Backbone

ReLoop utilizes Redis as the core of its Agent Memory Server. The architecture consists of a 3-tier memory system:

  1. Working Memory: Stores the current task's session state and immediate context.
  2. Long-term Memory: A persistent failure graph using Redis Vector Search to semantically match current errors with past distilled solutions.
  3. Episodic Memory: Full execution traces and timeline records for auditing and the dashboard UI.

Blaxel: Perpetual Execution Sandboxes

For safe, deterministic task execution, ReLoop integrates Blaxel's Firecracker microVMs.

  • Perpetual State: The sandbox is never lost. You can pause and resume the exact environment.
  • 25ms Checkpoint/Restore: ReLoop creates instantaneous checkpoints after every step.
  • Time-Travel Rewinds: Hit a roadblock? Rewind the agent to a previous checkpoint in 25ms and try a different fix strategy.

The REJD Loop

The core algorithm: Retrieve -> Execute -> Judge -> Distill

                           ┌─────────────┐
                           │  New Task   │
                           └──────┬──────┘
                                  │
                                  ▼
                    ┌─────────────────────────────┐
                    │           Retrieve           │
                    │  Query Redis for similar     │
                    │      past failures           │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │           Execute            │
                    │    Run in Blaxel sandbox     │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │            Judge             │
                    │     Success or failure?      │
                    └──────┬──────────────┬────────┘
                           │              │
                        Success        Failure
                           │              │
                           ▼              ▼
              ┌────────────────┐  ┌───────────────────────┐
              │ Distill        │  │ Distill Failure        │
              │ Success        │  │ root cause, fix,       │
              │ Store solution │  │ confidence score       │
              └───────┬────────┘  └──────────┬────────────┘
                      │                       │
                      ▼                       ▼
              ┌────────────────┐  ┌───────────────────────┐
              │ Task Complete  │  │  Circuit breaker or    │
              └────────────────┘  │   budget exceeded?    │
                                  └──────┬──────────┬──────┘
                                         │          │
                                        Yes         No
                                         │          │
                                         ▼          └──► Retrieve ↑
                                ┌─────────────────┐
                                │ Task Abandoned  │
                                └─────────────────┘

Powered by:

  • OpenAI Agents SDK -- orchestrates the REJD loop with handoffs between specialist agents
  • Redis -- 3-tier failure memory (working, long-term, episodic) via Agent Memory Server
  • Blaxel -- Firecracker sandbox with 25ms checkpoint/restore (optional; Docker or direct execution for local dev)

Timeline UI

The non-chat interface that makes failure learning visible.

A horizontal timeline of colored nodes tells the full story at a glance:

RED (failed) -> RED (failed) -> RED (failed) -> GREEN (succeeded)

Click any node to inspect the full failure record -- root cause, suggested fix, confidence score, cost, and the exact code diff that resolved it.


Integrations

Works with any agent framework:

  • OpenAI Agents SDK
  • LangGraph
  • CrewAI
  • Claude Agent SDK
  • Raw Python

ReLoop is the memory layer -- bring your own orchestration.


A/B: Memory vs No Memory

Metric Without Memory With Memory
Attempts to fix 4 bugs 12+ 4
Total cost $0.47 $0.18
Same mistake repeated 3x 0x

API Reference

Full API reference: docs/api-reference.md

Method Path Description
POST /v1/tasks Create and run a task
GET /v1/tasks/{id} Get task status and result
GET /v1/tasks/{id}/timeline Full execution timeline
GET /v1/tasks/{id}/sse Server-Sent Events stream
POST /v1/memories/search Semantic search over failure memory
GET /v1/memories/stats Aggregated memory statistics
GET /v1/tasks/{id}/checkpoints List sandbox checkpoints
POST /v1/tasks/{id}/checkpoints/{cid}/restore Rewind to checkpoint
POST /v1/tasks/ab-comparison Run A/B comparison (with vs without memory)
GET /v1/tasks/{id}/circuit-breaker Get circuit breaker state for a task
POST /v1/memories/predict Predict failure likelihood for new code
GET /v1/memories/export Export failure memory as JSON

Configuration

ReLoop is configured via environment variables. Only OPENAI_API_KEY is required -- everything else has sensible defaults.

Variable Required Default Description
OPENAI_API_KEY Yes -- OpenAI API key for code generation and reasoning
REDIS_URL No redis://localhost:6379 Redis connection URL (falls back to SQLite)
REDIS_MEMORY_INDEX No reloop-failures Vector index name for failure embeddings
BLAXEL_API_KEY No -- Blaxel API key for Firecracker sandboxes
BLAXEL_WORKSPACE No -- Blaxel workspace name
CODEX_MODEL No gpt-4o Chat model for planner/distiller
REASONING_MODEL No o1 Deep reasoning model for root cause analysis
FAST_MODEL No gpt-4o-mini Fast/cheap model for classification
EMBEDDING_MODEL No text-embedding-3-small Model for failure memory embeddings
API_PORT No 8000 FastAPI server port
API_HOST No 0.0.0.0 API host
NEXT_PUBLIC_API_URL No http://localhost:8000 Backend API URL for frontend
MAX_RETRIES No 5 Maximum retry attempts per task
MAX_BUDGET_USD No 1.00 Maximum cost budget per task
CIRCUIT_BREAKER_THRESHOLD No 3 Consecutive failures before circuit break

See .env.example for a copy-paste template.


Layer Technology Role
Orchestration OpenAI Agents SDK REJD loop with specialist agent handoffs
Failure Memory Redis Agent Memory Server 3-tier: working memory, long-term failure graph, episodic traces
Execution Sandbox Blaxel Firecracker microVMs Perpetual state, 25ms resume, checkpoint/restore
API FastAPI + SSE Task management, memory search, real-time streaming
Dashboard Next.js + Tailwind + shadcn/ui Timeline, failure sidebar, cost tracker

Contributing

We welcome contributions. See CONTRIBUTING.md for:

  • Development environment setup
  • Code style requirements (ruff, mypy)
  • PR process and review checklist
  • Architecture overview for new contributors

License

Apache 2.0 -- see LICENSE for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reloop_ai-0.3.2.tar.gz (459.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reloop_ai-0.3.2-py3-none-any.whl (105.6 kB view details)

Uploaded Python 3

File details

Details for the file reloop_ai-0.3.2.tar.gz.

File metadata

  • Download URL: reloop_ai-0.3.2.tar.gz
  • Upload date:
  • Size: 459.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for reloop_ai-0.3.2.tar.gz
Algorithm Hash digest
SHA256 a85156c4ba8cee0a436ba81e5a1e5e890d3f7733b34173296aaa6e47a65d4c9e
MD5 3809a2bf4dd757b4ae779c6be45d02fc
BLAKE2b-256 f0997fa340232f93b87834c06f0cc2f00cbb36bbfc8135ce9366227819c850c0

See more details on using hashes here.

File details

Details for the file reloop_ai-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: reloop_ai-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 105.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for reloop_ai-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4c27ad43f405f6cb3a12acf6279dcf55231e428f8f97e52387eca40533cf30cd
MD5 5b48e297eb82adcfcfedc968a4c7b6b2
BLAKE2b-256 7b3ac028a84a79da58bc526abbf4a2b979c0563913587ffc0171ca659fb0164e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page