Skip to main content

Agentic Reliability Framework (OSS Edition): Enterprise-grade multi-agent AI for infrastructure reliability monitoring and advisory-only self-healing.

Project description

AGENTIC RELIABILITY FRAMEWORK

Production-grade multi-agent AI system for infrastructure reliability monitoring and self-healing.

ARF is the first enterprise framework that enables autonomous, context-aware AI agents with advisory healing intelligence (OSS) and executed remediation (Enterprise) for infrastructure reliability monitoring and remediation at scale.

Battle-tested architecture for autonomous incident detection and advisory remediation intelligence.


Agentic Reliability Framework (ARF) v3.3.7 โ€” Production Stability Release

โš ๏ธ IMPORTANT OSS DISCLAIMER

This Apache 2.0 OSS edition is analysis and advisory-only. It does NOT execute actions, does NOT auto-heal, and does NOT perform remediation.

All execution, automation, persistence, and learning loops are Enterprise-only features.

Executive Summary

Modern systems do not fail because metrics are missing.

They fail because decisions arrive too late.

ARF is a graph-native, agentic reliability platform that treats incidents as memory and reasoning problems, not alerting problems. It captures operational experience, reasons over it using AI agents, and enforces stable, production-grade execution boundaries for autonomous healing.

This is not another monitoring tool.

This is operational intelligence.

A dual-architecture reliability framework where OSS analyzes and creates intent, and Enterprise safely executes intent.

This repository contains the Apache 2.0 OSS edition (v3.3.7 Stable). Enterprise components are distributed separately under a commercial license.

v3.3.7 Production Stability Release

This release finalizes import compatibility, eliminates circular dependencies, and enforces clean OSS/Enterprise boundaries.
All public imports are now guaranteed stable for production use.

๐Ÿ”’ Stability Guarantees (v3.3.7+)

ARF v3.3.7 introduces hard stability guarantees for OSS users:

  • โœ… No circular imports
  • โœ… Direct, absolute imports for all public APIs
  • โœ… Pydantic v2 โ†” Dataclass compatibility wrapper
  • โœ… Graceful fallback behavior (no runtime crashes)
  • โœ… Advisory-only execution enforced at runtime

If you can import it, it is safe to use in production.


Why ARF Exists

The Problem

  • AI Agents Fail in Production: 73% of AI agent projects fail due to unpredictability, lack of memory, and unsafe execution
  • MTTR is Too High: Average incident resolution takes 14+ minutes in traditional systems. *Measured MTTR reductions are Enterprise-only and require execution + learning loops.
  • Alert Fatigue: Teams ignore 40%+ of alerts due to false positives and lack of context
  • No Learning: Systems repeat the same failures because they don't remember past incidents

Traditional reliability stacks optimize for:

  • Detection latency
  • Alert volume
  • Dashboard density

But the real business loss happens between:

โ€œSomething is wrongโ€ โ†’ โ€œWe know what to do.โ€

ARF collapses that gap by providing a hybrid intelligence system that advises safely in OSS and executes deterministically in Enterprise.

  • ๐Ÿค– AI Agents for complex pattern recognition
  • โš™๏ธ Deterministic Rules for reliable, predictable responses
  • ๐Ÿง  RAG Graph Memory for context-aware decision making
  • ๐Ÿ”’ MCP Safety Layer for zero-trust execution

๐ŸŽฏ What This Actually Does

OSS

  • Ingests telemetry and incident context
  • Recalls similar historical incidents (FAISS + graph)
  • Applies deterministic safety policies
  • Creates an immutable HealingIntent without executing remediation
  • Never executes actions (advisory-only, permanently)

Enterprise

  • Validates license and usage
  • Applies approval / autonomous policies
  • Executes actions via MCP
  • Persists learning and audit trails

Both

  • Thread-safe
  • Circuit-breaker protected
  • Deterministic, idempotent intent model

Execution, persistence, and autonomous actions are exclusive to Enterprise.


๐Ÿ†“ OSS Edition (Apache 2.0)

Feature Implementation Limits
MCP Mode Advisory only (OSSMCPClient) No execution
RAG Memory In-memory graph + FAISS 1000 incidents (LRU)
Similarity Search FAISS cosine similarity Top-K only
Learning Pattern stats only No persistence
Healing HealingIntent creation Advisory only
Policies Deterministic guardrails Warnings + blocks
Storage RAM only Process-lifetime
Support GitHub Issues No SLA

๐Ÿ’ฐ Enterprise Edition (Commercial)

Feature Implementation Value
MCP Modes Advisory / Approval / Autonomous Controlled execution
Storage Neo4j + FAISS (hybrid) Persistent, unlimited
Dashboard React + FastAPI
Live system view
Live system view
Analytics Graph Neural Networks Predictive MTTR (Enterprise-only)
Compliance SOC2 / GDPR / HIPAA Full audit trails
Pricing $0.10 / incident + $499 / month Usage-based

๏ธ Why Choose ARF Over Alternatives

Comparison Matrix

Solution Learning Capability Safety Guarantees Deterministic Behavior Business ROI
ARF (Hybrid Intelligence) โœ… Continuous learning (RAG Graph memory) โœ… High safety (MCP guardrails + approval workflows) โœ… High determinism (Policy Engine + AI synthesis) โœ… Quantified ROI (Enterprise-only: execution + learning required)
Traditional Monitoring (Datadog, New Relic, Prometheus) โŒ No learning capability โœ… High safety (read-only) โœ… High determinism (rules-based) โŒ Reactive only - alerts after failures occur
LLM-Only Agents (AutoGPT, LangChain, CrewAI) โš ๏ธ Limited learning (context window only) โŒ Low safety (direct API access) โŒ Low determinism (hallucinations) โš ๏ธ Unpredictable - cannot guarantee outcomes
Rule-Based Automation (Ansible, Terraform, scripts) โŒ No learning (static rules) โœ… High safety (manual review) โœ… High determinism (exact execution) โš ๏ธ Brittle - breaks with system changes

Key Differentiatorsย 

๐Ÿ”„ Learning vs Staticย 

  • Alternatives: Static rules or limited context windowsย 

  • ARF: Continuously learns from incidents โ†’ outcomes in RAG Graph memoryย 

๐Ÿ”’ Safety vs Riskย 

  • Alternatives: Either too restrictive (no autonomy) or too risky (direct execution)ย 

  • ARF: Three-mode MCP system (Advisory โ†’ Approval โ†’ Autonomous) with guardrailsย 

๐ŸŽฏ Predictability vs Chaosย 

  • Alternatives: Either brittle rules or unpredictable LLM behaviorย 

  • ARF: Combines deterministic policies with AI-enhanced decision makingย 

๐Ÿ’ฐ ROI Measurementย 

  • Alternatives: Hard to quantify value beyond "fewer alerts"ย 

  • ARF (Enterprise): Tracks revenue saved, auto-heal rates, and MTTR improvements via execution-aware business dashboards

  • OSS: Generates advisory intent only (no execution, no ROI measurement)

Migration Paths

Current Solution Migration Strategy Expected Benefit
Traditional Monitoring Layer ARF on top for predictive insights Shift from reactive to proactive with 6x faster detection
LLM-Only Agents Replace with ARF's MCP boundary for safety Maintain AI capabilities while adding reliability guarantees
Rule-Based Automation Enhance with ARF's learning and context Transform brittle scripts into adaptive, learning systems
Manual Operations Start with ARF in Advisory mode Reduce toil while maintaining control during transition

Decision Frameworkย 

Choose ARF if you need:ย 

  • โœ… Autonomous operation with safety guaranteesย 

  • โœ… Continuous improvement through learningย 

  • โœ… Quantifiable business impact measurementย ย 

  • โœ… Hybrid intelligence (AI + rules)ย 

  • โœ… Production-grade reliability (circuit breakers, thread safety, graceful degradation)ย 

Consider alternatives if you:ย 

  • โŒ Only need basic alerting (use traditional monitoring)ย 

  • โŒ Require simple, static automation (use scripts)ย 

  • โŒ Are experimenting with AI agents (use LLM frameworks)ย 

  • โŒ Have regulatory requirements prohibiting any autonomous actionย 

ARF provides the intelligence of AI agents with the reliability of traditional automation, creating a new category of "Reliable AI Systems."


Conceptual Architecture (Mental Model)

Signals โ†’ Incidents โ†’ Memory Graph โ†’ Decision โ†’ Policy โ†’ Execution
             โ†‘              โ†“
         Outcomes โ† Learning Loop

Key insight: Reliability improves when systems remember.

๐Ÿ”ง Architecture

๐Ÿ—๏ธ Core Architectureย ย 

Three-Layer Hybrid Intelligence: The ARF Paradigmย 

ARF introduces aย hybrid intelligence architectureย that combines the best of three worlds:ย AI reasoning,ย deterministic rules, andย continuous learning. This three-layer approach ensures both innovation and reliability in production environments.

graph TB 
   subgraph "Layer 1: Cognitive Intelligence" 
       A1[Multi-Agent Orchestration] --> A2[Detective Agent] 
       A1 --> A3[Diagnostician Agent] 
       A1 --> A4[Predictive Agent] 
       A2 --> A5[Anomaly Detection & Pattern Recognition] 
       A3 --> A6[Root Cause Analysis & Investigation] 
       A4 --> A7[Future Risk Forecasting & Trend Analysis] 
   end 
    
   subgraph "Layer 2: Memory & Learning" 
       B1[RAG Graph Memory] --> B2[FAISS Vector Database] 
       B1 --> B3[Incident-Outcome Knowledge Graph] 
       B1 --> B4[Historical Effectiveness Database] 
       B2 --> B5[Semantic Similarity Search] 
       B3 --> B6[Connected Incident โ†’ Outcome Edges] 
       B4 --> B7[Success Rate Analytics] 
   end 
    
   subgraph "Layer 3: Execution Control (OSS Advisory / Enterprise Execution)" 
       C1[MCP Server] --> C2[Advisory Mode - OSS Default] 
       C1 --> C3[Approval Mode - Human-in-Loop] 
       C1 --> C4[Autonomous Mode - Enterprise] 
       C1 --> C5[Safety Guardrails & Circuit Breakers] 
       C2 --> C6[What-If Analysis Only] 
       C3 --> C7[Audit Trail & Approval Workflows] 
       C4 --> C8[Auto-Execution with Guardrails] 
   end 
    
   D[Reliability Event] --> A1 
   A1 --> E[Policy Engine] 
   A1 --> B1 
   E & B1 --> C1 
   C1 --> F["Healing Actions (Enterprise Only)"]
   F --> G[Business Impact Dashboard] 
   F --> B1[Continuous Learning Loop] 
   G --> H[Quantified ROI: Revenue Saved, MTTR Reduction]

Healing Actions occur only in Enterprise deployments.

Architecture Philosophy: Each layer addresses a critical failure mode of current AI systems:ย 

  1. Cognitive Layerย preventsย "reasoning from scratch"ย for each incidentย 

  2. Memory Layerย preventsย "forgetting past learnings"ย 

  3. Execution Layerย preventsย "unsafe, unconstrained actions"

OSS Architecture

graph TD
    A[Telemetry / Metrics] --> B[Reliability Engine]
    B --> C[OSSMCPClient]
    C --> D[RAGGraphMemory]
    D --> E[FAISS Similarity]
    D --> F[Incident / Outcome Graph]
    E --> C
    F --> C
    C --> G[HealingIntent]
    G --> H[STOP: Advisory Only]

OSS execution halts permanently at HealingIntent. No actions are performed.

Stop point:ย OSS halts permanently at HealingIntent.

ARF v3.0 Dual-Layer Architecture

          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚        Telemetry          โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ OSS Layer (Advisory Only) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚                                                     โ”‚
  โ”‚  +--------------------+                             โ”‚
  โ”‚  | Detection Agent     |  โ† Anomaly detection       โ”‚
  โ”‚  | (OSS + Enterprise)  |  & forecasting             โ”‚
  โ”‚  +--------------------+                             โ”‚
  โ”‚           โ”‚                                         โ”‚
  โ”‚           โ–ผ                                         โ”‚
  โ”‚  +--------------------+                             โ”‚
  โ”‚  | Recall Agent        |  โ† Retrieve similar        โ”‚
  โ”‚  | (OSS + Enterprise)  |  incidents/actions/outcomes
  โ”‚  +--------------------+                             โ”‚
  โ”‚           โ”‚                                         โ”‚
  โ”‚           โ–ผ                                         โ”‚
  โ”‚  +--------------------+                             โ”‚
  โ”‚  | Decision Agent      |  โ† Policy reasoning        โ”‚
  โ”‚  | (OSS + Enterprise)  |  over historical outcomes  โ”‚
  โ”‚  +--------------------+                             โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Enterprise Layer (Full Execution) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚                                                     โ”‚
 โ”‚  +--------------------+        +-----------------+  โ”‚
 โ”‚  | Safety Agent        |  โ”€โ”€โ”€> | Execution Agent |  โ”‚
 โ”‚  | (Enterprise only)   |       | (MCP modes)     |  โ”‚
 โ”‚  +--------------------+        +-----------------+  โ”‚
 โ”‚           โ”‚                                         โ”‚
 โ”‚           โ–ผ                                         โ”‚
 โ”‚  +--------------------+                             โ”‚
 โ”‚  | Learning Agent      |  โ† Extract outcomes,       โ”‚
 โ”‚  | (Enterprise only)   |  update RAG & predictive   โ”‚
 โ”‚  +--------------------+   models                    โ”‚
 โ”‚           โ”‚                                         โ”‚
 โ”‚           โ–ผ                                         โ”‚
 โ”‚       HealingIntent (Executed, Audit-ready)         โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Core Innovations

1. RAG Graph Memory (Not Vector Soup)

ARF models incidents, actions, and outcomes as a graph, rather than simple embeddings. This allows causal reasoning, pattern recall, and outcome-aware recommendations.

graph TD
    Incident -->|caused_by| Component
    Incident -->|resolved_by| Action
    Incident -->|led_to| Outcome

This enables:

  • Causal reasoning: Understand root causes of failures.

  • Pattern recall: Retrieve similar incidents efficiently using FAISS + graph.

  • Outcome-aware recommendations: Suggest actions based on historical success.

2. Healing Intent Boundary

OSS creates intent.
Enterprise executes intent. The framework separates intent creation from execution

+----------------+         +---------------------+
|   OSS Layer    |         |  Enterprise Layer   |
| (Analysis Only)|         |  (Execution & GNN)  |
+----------------+         +---------------------+
          |                           ^
          |       HealingIntent       |
          +-------------------------->|

3. MCP (Model Context Protocol) Execution Control

Every action passes through:

  • Advisory โ†’ Approval โ†’ Autonomous modes
  • Blast radius checks
  • Human override paths

* All actions in Enterprise flow through

* Controlled execution modes with policy enforcement:

No silent actions. Ever.

graph LR
    Action_Request --> Advisory_Mode --> Approval_Mode --> Autonomous_Mode
    Advisory_Mode -->|recommend| Human_Operator
    Approval_Mode -->|requires_approval| Human_Operator
    Autonomous_Mode -->|auto-execute| Safety_Guardrails
    Safety_Guardrails --> Execution_Log

Execution Safety Features:

  1. Blast radius checks: Limit scope of automated actions.

  2. Human override paths: Operators can halt or adjust actions.

  3. No silent execution: All actions are logged for auditability.

Outcome:

  • Hybrid intelligence combining AI-driven recommendations and deterministic policies.

  • Safe, auditable, and deterministic execution in production.

Key Orchestration Steps:ย 

  1. Event Ingestion & Validationย - Accepts telemetry,ย validatesย withย Pydanticย modelsย 

  2. Multi-Agent Analysisย - Parallel execution of specialized agentsย 

  3. RAG Context Retrievalย - Semantic search for similar historical incidentsย 

  4. Policy Evaluationย - Deterministic rule-based action determinationย 

  5. Action Enhancementย - Historical effectiveness data informs priorityย 

  6. MCP Executionย - Safe tool execution with guardrailsย 

  7. Outcome Recordingย - Results stored in RAG Graph for learningย 

  8. Business Impact Calculationย - Revenue and user impact quantification


Multi-Agent Design (ARF v3.0) โ€“ Coverage Overview

  • Detection, Recall, Decision โ†’ present in both OSS and Enterprise
  • Safety, Execution, Learning โ†’ Enterprise only

Table View

Agent Responsibility OSS Enterprise
Detection Agent Detect anomalies, monitor telemetry, perform time-series forecasting โœ… โœ…
Recall Agent Retrieve similar incidents/actions/outcomes from RAG graph + FAISS โœ… โœ…
Decision Agent Apply deterministic policies, reasoning over historical outcomes โœ… โœ…
Safety Agent Enforce guardrails, circuit breakers, compliance constraints โŒ โœ…
Execution Agent Execute HealingIntents according to MCP modes (advisory/approval/autonomous) โŒ โœ…
Learning Agent Extract outcomes and update predictive models / RAG patterns โŒ โœ…

OSS vs Enterprise Philosophy

OSS (Apache 2.0)

  • Full intelligence
  • Advisory-only execution
  • Hard safety limits
  • Perfect for trust-building

Enterprise

  • Autonomous healing
  • Learning loops
  • Compliance (SOC2, HIPAA, GDPR)
  • Audit trails
  • Multi-tenant control

OSS proves value.
Enterprise captures it.


๐Ÿ’ฐ Business Value and ROI

๐Ÿ”’ Enterprise-Only Metrics

All metrics, benchmarks, MTTR reductions, auto-heal rates, revenue protection figures, and ROI calculations in this section are derived from Enterprise deployments only.

The OSS edition does not execute actions, does not auto-heal, and does not measure business impact.

Detection & Resolution Speed

Enterprise deployments of ARF dramatically reduce incident detection and resolution times compared to industry averages:

Metric Industry Average ARF Performance Improvement
High-Priority Incident Detection 8โ€“14 min 2.3 min 71โ€“83% faster
Major System Failure Resolution 45โ€“90 min 8.5 min 81โ€“91% faster

Efficiency & Accuracy

ARF improves auto-heal rates and reduces false positives, driving operational efficiency:

Metric Industry Average ARF Performance Improvement
Auto-Heal Rate 5โ€“15% 81.7% 5.4ร— better
False Positives 40โ€“60% 8.2% 5โ€“7ร— better

Team Productivity

ARF frees up engineering capacity, increasing productivity:

Metric Industry Average ARF Performance Improvement
Engineer Hours Spent on Manual Response 10โ€“20 h/month 320 h/month recovered 16โ€“32ร— improvement

๐Ÿ† Financial Evolution: From Cost Center to Profit Engine

ARF transforms reliability operations from a high-cost, reactive burden into a high-return strategic asset:

Approach Annual Cost Operational Profile ROI Business Impact
โŒ Cost Center (Traditional Monitoring) $2.5Mโ€“$4.0M 5โ€“15% auto-heal, 40โ€“60% false positives, fully manual response Negative Reliability is a pure expense with diminishing returns
โš™๏ธ Efficiency Tools (Rule-Based Automation) $1.8Mโ€“$2.5M 30โ€“50% auto-heal, brittle scripts, limited scope 1.5โ€“2.5ร— Marginal cost savings; still reactive
๐Ÿง  AI-Assisted (Basic ML/LLM Tools) $1.2Mโ€“$1.8M 50โ€“70% auto-heal, better predictions, requires tuning 3โ€“4ร— Smarter operations, not fully autonomous
โœ… ARF: Profit Engine $0.75Mโ€“$1.2M 81.7% auto-heal, 8.2% false positives, 85% faster resolution 5.2ร—+ Converts reliability into sustainable competitive advantage

Key Insights:

  • Immediate Cost Reduction: Payback in 2โ€“3 months with ~64% incident cost reduction.
  • Engineer Capacity Recovery: 320 hours/month reclaimed (equivalent to 2 full-time engineers).
  • Revenue Protection: $3.2M+ annual revenue protected for mid-market companies.
  • Compounding Value: 3โ€“5% monthly operational improvement as the system learns from outcomes.

๐Ÿข Industry-Specific Impact (Enterprise Deployments)

ARF delivers measurable benefits across industries:

Industry ARF ROI Key Benefit
Finance 8.3ร— $5M/min protection during HFT latency spikes
Healthcare Priceless Zero patient harm, HIPAA-compliant failovers
SaaS 6.8ร— Maintains customer SLA during AI inference spikes
Media & Advertising 7.1ร— Protects $2.1M ad revenue during primetime outages
Logistics 6.5ร— Prevents $12M+ in demurrage and delays

๐Ÿ“Š Performance Summary

Industry Avg Detection Time (Industry) ARF Detection Time Auto-Heal Improvement
Finance 14 min 0.78 min 100% 94% faster
Healthcare 20 min 0.8 min 100% 94% faster
SaaS 45 min 0.75 min 95% 95% faster
Media 30 min 0.8 min 90% 94% faster
Logistics 90 min 0.8 min 85% 94% faster

Bottom Line: Enterprise ARF deployments convert reliability from a cost center (2โ€“5% of engineering budget) into a profit engine, delivering 5.2ร—+ ROI and sustainable competitive advantage.

Before ARF

  • 45 min MTTR
  • Tribal knowledge
  • Repeated failures

After ARF

  • 5โ€“10 min MTTR
  • Institutional memory
  • Institutionalized remediation patterns (Enterprise execution)

This is a revenue protection system in Enterprise deployments, and a trust-building advisory intelligence layer in OSS.


Who Uses ARF

Engineers

  • Fewer pages
  • Better decisions
  • Confidence in automation

Founders

  • Reliability without headcount
  • Faster scaling
  • Reduced churn

Executives

  • Predictable uptime
  • Quantified risk
  • Board-ready narratives

Investors

  • Defensible IP
  • Enterprise expansion path
  • OSS โ†’ Paid flywheel
graph LR 
   ARF["ARF v3.0"] --> Finance 
   ARF --> Healthcare 
   ARF --> SaaS 
   ARF --> Media 
   ARF --> Logistics 
    
   Finance --> |Real-time monitoring| F1[HFT Systems] 
   Finance --> |Compliance| F2[Risk Management] 
    
   Healthcare --> |Patient safety| H1[Medical Devices] 
   Healthcare --> |HIPAA compliance| H2[Health IT] 
    
   SaaS --> |Uptime SLA| S1[Cloud Services] 
   SaaS --> |Multi-tenant| S2[Enterprise SaaS] 
    
   Media --> |Content delivery| M1[Streaming] 
   Media --> |Ad tech| M2[Real-time bidding] 
    
   Logistics --> |Supply chain| L1[Inventory] 
   Logistics --> |Delivery| L2[Tracking] 
    
   style ARF fill:#7c3aed 
   style Finance fill:#3b82f6 
   style Healthcare fill:#10b981 
   style SaaS fill:#f59e0b 
   style Media fill:#ef4444 
   style Logistics fill:#8b5cf6

๐Ÿ”’ Security & Compliance

Safety Guardrails Architecture

ARF implements a multi-layered security model with five protective layers:

# Five-Layer Safety System Configuration
safety_system = { 
   "layer_1": "Action Blacklisting", 
   "layer_2": "Blast Radius Limiting",  
   "layer_3": "Human Approval Workflows", 
   "layer_4": "Business Hour Restrictions", 
   "layer_5": "Circuit Breakers & Cooldowns" 
}

# Environment Configuration
export SAFETY_ACTION_BLACKLIST="DATABASE_DROP,FULL_ROLLOUT,SYSTEM_SHUTDOWN"
export SAFETY_MAX_BLAST_RADIUS=3
export MCP_MODE=approval  # advisory, approval, or autonomous

Layer Breakdown:

  • Action Blacklisting โ€“ Prevent dangerous operations

  • Blast Radius Limiting โ€“ Limit impact scope (max: 3 services)

  • Human Approval Workflows โ€“ Manual review for sensitive changes

  • Business Hour Restrictions โ€“ Control deployment windows

  • Circuit Breakers & Cooldowns โ€“ Automatic rate limiting

Compliance Features

  • Audit Trail: Every MCP request/response logged with justification

  • Approval Workflows: Human review for sensitive actions

  • Data Retention: Configurable retention policies (default: 30 days)

  • Access Control: Tool-level permission requirements

  • Change Management: Business hour restrictions for production changes

Security Best Practices

  1. Start in Advisory Mode

    • Begin with analysis-only mode to understand potential actions without execution risks.
  2. Gradual Rollout

    • Use rollout_percentage parameter to enable features incrementally across your systems.
  3. Regular Audits

    • Review learned patterns and outcomes monthly

    • Adjust safety parameters based on historical data

    • Validate compliance with organizational policies

  4. Environment Segregation

    • Configure different MCP modes per environment:

      • Development: autonomous or advisory

      • Staging: approval

      • Production: advisory or approval

Quick Configuration Example

# Set up basic security parameters
export SAFETY_ACTION_BLACKLIST="DATABASE_DROP,FULL_ROLLOUT,SYSTEM_SHUTDOWN"
export SAFETY_MAX_BLAST_RADIUS=3
export MCP_MODE=approval
export AUDIT_RETENTION_DAYS=30
export BUSINESS_HOURS_START=09:00
export BUSINESS_HOURS_END=17:00

Recommended Implementation Order

  1. Initial Setup: Configure action blacklists and blast radius limits
  2. Testing Phase: Run in advisory mode to analyze behavior
  3. Gradual Enablement: Move to approval mode with human oversight
  4. Production: Maintain approval workflows for critical systems
  5. Optimization: Adjust parameters based on audit findings

โšก Enterprise Performance & Scaling Benchmarks

OSS performance is limited to advisory analysis and intent generation. Execution latency and throughput metrics apply to Enterprise MCP execution only.

Benchmarks

Operation Latency / p99 Throughput Memory Usage
Event Processing 1.8s 550 req/s 45 MB
RAG Similarity Search 120 ms 8300 searches/s 1.5 MB / 1000 incidents
MCP Tool Execution 50 ms - 2 s Varies by tool Minimal
Agent Analysis 450 ms 2200 analyses/s 12 MB

Scaling Guidelines

  • Vertical Scaling: Each engine instance handles ~1000 req/min
  • Horizontal Scaling: Deploy multiple engines behind a load balancer
  • Memory: FAISS index grows ~1.5 MB per 1000 incidents
  • Storage: Incident texts ~50 KB per 1000 incidents
  • CPU: RAG search is O(log n) with FAISS IVF indexes

๐Ÿš€ Quick Start

OSS (โ‰ˆ5 minutes)

pip install agentic-reliability-framework==3.3.7

Runs:

  • OSS MCP (advisory only)

  • In-memory RAG graph

  • FAISS similarity index

Run locally or deploy as a service.

License

Apache 2.0 (OSS) Commercial license required for Enterprise features.

Roadmap (Public)

  • Graph visualization UI
  • Enterprise policy DSL
  • Cross-service causal chains
  • Cost-aware decision optimization

Philosophy

Systems fail. Memory fixes them.

ARF encodes operational experience into software โ€” permanently.


Citing ARF

If you use the Agentic Reliability Framework in production or research, please cite:

BibTeX:

@software{ARF2026,
  title = {Agentic Reliability Framework: Production-Grade Multi-Agent AI for autonomous system reliability intelligence},
  author = {Juan Petter and Contributors},
  year = {2026},
  version = {3.3.7},
  url = {https://github.com/petterjuan/agentic-reliability-framework}
}

Quick Links

๐Ÿ“ž Contact & Supportย 

Primary Contact:ย 

Additional Resources:ย 

  • GitHub Issues:ย For bug reports and technical issuesย 

  • Documentation:ย Check the docs forย common questionsย 

Response Time:ย Typicallyย within 24-48 hours

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentic_reliability_framework-3.3.7.tar.gz (147.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentic_reliability_framework-3.3.7-py3-none-any.whl (144.4 kB view details)

Uploaded Python 3

File details

Details for the file agentic_reliability_framework-3.3.7.tar.gz.

File metadata

File hashes

Hashes for agentic_reliability_framework-3.3.7.tar.gz
Algorithm Hash digest
SHA256 360c52923019a57246dc404273ca876bef449787d65e39fa99fa9c0301770ab6
MD5 3c1407312d2792ea92ef38409068395b
BLAKE2b-256 6fe2cadaf4f1d31c66196022f82e93daed4c4251fcef68d9728be102799f22a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_reliability_framework-3.3.7.tar.gz:

Publisher: publish.yml on petterjuan/agentic-reliability-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentic_reliability_framework-3.3.7-py3-none-any.whl.

File metadata

File hashes

Hashes for agentic_reliability_framework-3.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 9e68014db9a66f6bb80224db1c2051e78c16989409f65b53132c3dcd4f501e0b
MD5 8386a8a0e90bd54caafd12a3586a594e
BLAKE2b-256 f5237b3748a7a580a4d0c2ee8cc1e1709c802f4b1fa94b91045c6e99de929b96

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentic_reliability_framework-3.3.7-py3-none-any.whl:

Publisher: publish.yml on petterjuan/agentic-reliability-framework

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page