Failure memory for AI agents — self-healing retry with structured learning
Project description
ReLoop
Failure memory for AI agents.
Every agent fails. ReLoop is the first framework that gets smarter from failure.
The Problem
AI agents retry blindly -- same mistake, same failure, burning tokens and money. No framework treats failure as data. They either retry with no memory, or give up.
The Solution
ReLoop captures every failure into a structured memory graph -- error type, root cause, suggested fix, confidence score, semantic embedding -- so the next retry starts smarter. Your agents don't just recover. They get permanently smarter.
Quick Start
The reloop-ai package is officially published and ready for production use.
pip install reloop-ai
reloop init
reloop demo
Minimal Setup (No Redis Required)
ReLoop works out of the box with SQLite -- no Redis or Blaxel needed:
pip install reloop-ai
export OPENAI_API_KEY=sk-...
reloop serve
Redis and Blaxel are optional -- add them when you need production performance. For local development, ReLoop uses Docker or runs commands directly as the sandbox.
Three Ways to Use ReLoop
1. As a Library (any agent, 3 lines)
from reloop import FailureMemory
memory = FailureMemory(redis_url="redis://localhost:6379")
similar = await memory.search("ImportError sharp") # Returns past failures + fixes
2. As a Framework (full self-healing loop)
reloop run "Fix and deploy the Next.js project at ./my-broken-app"
3. As an MCP Server (Cursor & Claude Code)
Integrate ReLoop directly into your AI IDEs so they become self-healing. When the MCP server is connected, Claude can:
- Search Memory: Semantically search past errors across your team's history.
- Fetch Checkpoints: Restore exact state from past Blaxel Firecracker VM checkpoints.
- Execute Safely: Run isolated code tests within Blaxel sandboxes directly from the editor.
Add to your claude_desktop_config.json or Cursor settings:
{
"mcpServers": {
"reloop": {
"command": "python",
"args": ["-m", "reloop.mcp_server"]
}
}
}
Architecture
┌─────────────────────┐ MCP Protocol ┌──────────────────────┐
│ Cursor / Claude │ ───────────────► │ ReLoop MCP Server │
└─────────────────────┘ └──────────┬───────────┘
│
┌─────────────────────┴──────────────────────┐
│ │
▼ Vector Search ▼ Isolated Execution
┌──────────────────────┐ ┌──────────────────────┐
│ Redis Agent Memory │ │ Blaxel Firecracker │
└──────────┬───────────┘ └──────────┬───────────┘
│ stores │ 25ms resume
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Failure Embeddings │ │ State Checkpoints │
└──────────────────────┘ └──────────────────────┘
Redis: The Memory Backbone
ReLoop utilizes Redis as the core of its Agent Memory Server. The architecture consists of a 3-tier memory system:
- Working Memory: Stores the current task's session state and immediate context.
- Long-term Memory: A persistent failure graph using Redis Vector Search to semantically match current errors with past distilled solutions.
- Episodic Memory: Full execution traces and timeline records for auditing and the dashboard UI.
Blaxel: Perpetual Execution Sandboxes
For safe, deterministic task execution, ReLoop integrates Blaxel's Firecracker microVMs.
- Perpetual State: The sandbox is never lost. You can pause and resume the exact environment.
- 25ms Checkpoint/Restore: ReLoop creates instantaneous checkpoints after every step.
- Time-Travel Rewinds: Hit a roadblock? Rewind the agent to a previous checkpoint in 25ms and try a different fix strategy.
The REJD Loop
The core algorithm: Retrieve -> Execute -> Judge -> Distill
┌─────────────┐
│ New Task │
└──────┬──────┘
│
▼
┌─────────────────────────────┐
│ Retrieve │
│ Query Redis for similar │
│ past failures │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Execute │
│ Run in Blaxel sandbox │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Judge │
│ Success or failure? │
└──────┬──────────────┬────────┘
│ │
Success Failure
│ │
▼ ▼
┌────────────────┐ ┌───────────────────────┐
│ Distill │ │ Distill Failure │
│ Success │ │ root cause, fix, │
│ Store solution │ │ confidence score │
└───────┬────────┘ └──────────┬────────────┘
│ │
▼ ▼
┌────────────────┐ ┌───────────────────────┐
│ Task Complete │ │ Circuit breaker or │
└────────────────┘ │ budget exceeded? │
└──────┬──────────┬──────┘
│ │
Yes No
│ │
▼ └──► Retrieve ↑
┌─────────────────┐
│ Task Abandoned │
└─────────────────┘
Powered by:
- OpenAI Agents SDK -- orchestrates the REJD loop with handoffs between specialist agents
- Redis -- 3-tier failure memory (working, long-term, episodic) via Agent Memory Server
- Blaxel -- Firecracker sandbox with 25ms checkpoint/restore (optional; Docker or direct execution for local dev)
Timeline UI
The non-chat interface that makes failure learning visible.
A horizontal timeline of colored nodes tells the full story at a glance:
RED (failed) -> RED (failed) -> RED (failed) -> GREEN (succeeded)
Click any node to inspect the full failure record -- root cause, suggested fix, confidence score, cost, and the exact code diff that resolved it.
Integrations
Works with any agent framework:
- OpenAI Agents SDK
- LangGraph
- CrewAI
- Claude Agent SDK
- Raw Python
ReLoop is the memory layer -- bring your own orchestration.
A/B: Memory vs No Memory
| Metric | Without Memory | With Memory |
|---|---|---|
| Attempts to fix 4 bugs | 12+ | 4 |
| Total cost | $0.47 | $0.18 |
| Same mistake repeated | 3x | 0x |
API Reference
Full API reference: docs/api-reference.md
| Method | Path | Description |
|---|---|---|
POST |
/v1/tasks |
Create and run a task |
GET |
/v1/tasks/{id} |
Get task status and result |
GET |
/v1/tasks/{id}/timeline |
Full execution timeline |
GET |
/v1/tasks/{id}/sse |
Server-Sent Events stream |
POST |
/v1/memories/search |
Semantic search over failure memory |
GET |
/v1/memories/stats |
Aggregated memory statistics |
GET |
/v1/tasks/{id}/checkpoints |
List sandbox checkpoints |
POST |
/v1/tasks/{id}/checkpoints/{cid}/restore |
Rewind to checkpoint |
POST |
/v1/tasks/ab-comparison |
Run A/B comparison (with vs without memory) |
GET |
/v1/tasks/{id}/circuit-breaker |
Get circuit breaker state for a task |
POST |
/v1/memories/predict |
Predict failure likelihood for new code |
GET |
/v1/memories/export |
Export failure memory as JSON |
Configuration
ReLoop is configured via environment variables. Only OPENAI_API_KEY is required -- everything else has sensible defaults.
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
Yes | -- | OpenAI API key for code generation and reasoning |
REDIS_URL |
No | redis://localhost:6379 |
Redis connection URL (falls back to SQLite) |
REDIS_MEMORY_INDEX |
No | reloop-failures |
Vector index name for failure embeddings |
BLAXEL_API_KEY |
No | -- | Blaxel API key for Firecracker sandboxes |
BLAXEL_WORKSPACE |
No | -- | Blaxel workspace name |
CODEX_MODEL |
No | gpt-4o |
Chat model for planner/distiller |
REASONING_MODEL |
No | o1 |
Deep reasoning model for root cause analysis |
FAST_MODEL |
No | gpt-4o-mini |
Fast/cheap model for classification |
EMBEDDING_MODEL |
No | text-embedding-3-small |
Model for failure memory embeddings |
API_PORT |
No | 8000 |
FastAPI server port |
API_HOST |
No | 0.0.0.0 |
API host |
NEXT_PUBLIC_API_URL |
No | http://localhost:8000 |
Backend API URL for frontend |
MAX_RETRIES |
No | 5 |
Maximum retry attempts per task |
MAX_BUDGET_USD |
No | 1.00 |
Maximum cost budget per task |
CIRCUIT_BREAKER_THRESHOLD |
No | 3 |
Consecutive failures before circuit break |
See .env.example for a copy-paste template.
| Layer | Technology | Role |
|---|---|---|
| Orchestration | OpenAI Agents SDK | REJD loop with specialist agent handoffs |
| Failure Memory | Redis Agent Memory Server | 3-tier: working memory, long-term failure graph, episodic traces |
| Execution Sandbox | Blaxel Firecracker microVMs | Perpetual state, 25ms resume, checkpoint/restore |
| API | FastAPI + SSE | Task management, memory search, real-time streaming |
| Dashboard | Next.js + Tailwind + shadcn/ui | Timeline, failure sidebar, cost tracker |
Contributing
We welcome contributions. See CONTRIBUTING.md for:
- Development environment setup
- Code style requirements (ruff, mypy)
- PR process and review checklist
- Architecture overview for new contributors
License
Apache 2.0 -- see LICENSE for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file reloop_ai-0.4.0.tar.gz.
File metadata
- Download URL: reloop_ai-0.4.0.tar.gz
- Upload date:
- Size: 474.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a0d5c1f8f02af254d2810a259693677ac5feeb3fc281c59c29f9081ba72d6877
|
|
| MD5 |
fbc334520c0eafbc15721fa82b9d0c40
|
|
| BLAKE2b-256 |
add5a84d219bd2fd24b53218824412b248645b7b19eb1a49a892a4d84d2368b6
|
File details
Details for the file reloop_ai-0.4.0-py3-none-any.whl.
File metadata
- Download URL: reloop_ai-0.4.0-py3-none-any.whl
- Upload date:
- Size: 115.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31f362d2794a75710c47d00be070b62e06bc3fc728c15bcca286c5d622112b96
|
|
| MD5 |
6d1529018c694886dda24a823f361ba0
|
|
| BLAKE2b-256 |
b71a5dd768c134971e70ab705c06db4871d56b9aef2a4d95f9439559e81e06c6
|