Skip to main content

Open-source AI SRE agent - foundation-first incident investigation, root cause analysis, and auto-remediation

Project description

AutoSRE Logo

AutoSRE

๐Ÿค– The AI SRE that investigates incidents like your best on-call engineer โ€” but faster.

CI Status PyPI Version Python Versions License Stars

Quick Start โ€ข Features โ€ข How It Works โ€ข Integrations โ€ข Docs


45-minute investigations โ†’ 5 minutes. Autonomous triage. Evidence-based RCA. Human-in-the-loop for safety.

AutoSRE Demo


โšก Quick Start

# Install
pip install autosre-ai

# Configure (interactive setup)
autosre config init

# Investigate your first incident
autosre investigate "checkout service 500 errors" --service checkout-service

Or with Docker:

docker run -it --rm -v ~/.autosre:/root/.autosre ghcr.io/autosre-ai/autosre investigate "high latency on api-gateway"

That's it. No Neo4j. No Postgres. No infrastructure. Just pip install and go.


โœจ Features

๐Ÿ” Autonomous Investigation

Multi-agent investigation that works like your best SRE: triage โ†’ contain โ†’ investigate โ†’ resolve โ†’ learn.

$ autosre investigate "payment failures spiking"

[Triage] Confirmed: payment-service 5xx rate at 12% (normally <0.1%)
[Scope] Affected: checkout-service, order-service (downstream)
[Hypothesis] Testing: Recent deployment of payment-service v2.3.1
[Evidence] Deployment at 14:02, errors started 14:05 โœ“
[Root Cause] payment-service v2.3.1 introduced null pointer in retry logic
[Recommendation] Rollback to v2.3.0 (requires approval)

๐Ÿง  Episodic Memory

Learns from every investigation. Recalls similar incidents. Gets smarter over time.

$ autosre memory search "database timeout"

Found 3 similar incidents:
โ”œโ”€โ”€ inv_abc123: PostgreSQL connection pool exhaustion (resolved in 8m)
โ”œโ”€โ”€ inv_def456: Slow query blocking connections (resolved in 12m)
โ””โ”€โ”€ inv_ghi789: Network partition to RDS (resolved in 23m)

๐Ÿ“Š SLO-Driven Operations

Error budgets, multi-window burn rates, deployment gating โ€” all built-in.

$ autosre slo status --service checkout-service

checkout-service SLO Status
โ”œโ”€โ”€ Availability: 99.92% (target: 99.9%) โœ“
โ”œโ”€โ”€ Latency p99: 245ms (target: 300ms) โœ“
โ”œโ”€โ”€ Error Budget: 72% remaining
โ”‚   โ”œโ”€โ”€ 1h burn rate: 0.8x
โ”‚   โ”œโ”€โ”€ 6h burn rate: 1.2x
โ”‚   โ””โ”€โ”€ 24h burn rate: 0.9x
โ””โ”€โ”€ Deploys: ALLOWED

๐Ÿ›ก๏ธ AI Safety Built-In

Every decision has confidence scores. Critical actions require human approval. Full audit trails.

  • Hypothesis-driven reasoning with falsifiable criteria
  • Confidence scoring (0.0-1.0) on every decision
  • Human-in-the-loop for remediation actions
  • AI error budgets tracking accuracy over time

๐Ÿ”ง Extensible Skills System

Modular investigation skills: Kubernetes, metrics, logs, traces, infrastructure.

skills/
โ”œโ”€โ”€ kubernetes/         # Pod states, deployments, events
โ”œโ”€โ”€ metrics-analysis/   # Prometheus, Datadog, Grafana
โ”œโ”€โ”€ log-analysis/       # Pattern matching, anomaly detection
โ”œโ”€โ”€ traces/             # Distributed tracing analysis
โ”œโ”€โ”€ infrastructure/     # AWS, GCP resource checks
โ””โ”€โ”€ investigation/      # Methodology and hypothesis testing

๐Ÿ“ Automated Postmortems

Blameless postmortems with auto-generated timelines, metrics snapshots, and action items.


๐ŸŽฏ How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     autosre investigate                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Orchestrator   โ”‚
                    โ”‚   (LangGraph)    โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚           โ”‚               โ”‚           โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”
    โ”‚ Memory  โ”‚ โ”‚Topology โ”‚   โ”‚ Planner  โ”‚ โ”‚  LLM    โ”‚
    โ”‚(SQLite) โ”‚ โ”‚ (YAML)  โ”‚   โ”‚  Agent   โ”‚ โ”‚ Router  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚         โ”‚         โ”‚         โ”‚         โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”
          โ”‚  K8s   โ”‚โ”‚Metrics โ”‚โ”‚  Logs  โ”‚โ”‚ Traces โ”‚โ”‚ Infra  โ”‚
          โ”‚Subagentโ”‚โ”‚Subagentโ”‚โ”‚Subagentโ”‚โ”‚Subagentโ”‚โ”‚Subagentโ”‚
          โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚         Synthesizer         โ”‚
                    โ”‚   (Evidence โ†’ Root Cause)   โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚     Writeup & Actions       โ”‚
                    โ”‚  (Postmortem, Remediation)  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts:

Component What It Does
Orchestrator Coordinates investigation phases (Triage โ†’ Mitigate โ†’ Diagnose โ†’ Resolve)
Episodic Memory SQLite-based learning from past investigations with FTS5 search
Service Topology YAML-defined service dependencies for blast radius analysis
Subagents Parallel specialists (Kubernetes, metrics, logs, traces)
Synthesizer Merges evidence, tests hypotheses, identifies root cause

๐Ÿ”Œ Integrations

Category Supported
Observability Prometheus, Grafana, Datadog
Incident Management PagerDuty, Slack, OpsGenie
Infrastructure Kubernetes, AWS, GCP
Source Control GitHub, GitLab
Issue Tracking Jira, Linear

๐Ÿ“ˆ Why AutoSRE?

Before AutoSRE After AutoSRE
45+ min incident investigations 5 min AI-assisted triage
Lost context between incidents Episodic memory recalls similar issues
Tribal knowledge in runbooks AI executes and learns from runbooks
Manual toil tracking Auto-classified, automation suggested
Blame-filled postmortems Auto-generated blameless documentation

Test Results: 1,053 tests passing | 25+ investigation scenarios validated


๐Ÿ—๏ธ Production Deployment

For production deployments with persistent storage and multiple services, see the Docker Deployment Guide.

Quick Docker Compose Setup
# Clone and setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
make setup

# Configure secrets
cp .env.example .env
vim .env  # Add your API keys

# Start all services
make dev

# Verify health
make health

Services:

Service Port Description
web-ui 3000 Next.js web interface
api-gateway 8000 FastAPI REST API
sre-agent 8080 AI agent service
postgres 5432 PostgreSQL database
neo4j 7474 Graph database (optional)
redis 6379 Cache & pub/sub

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
pip install -e ".[dev]"
pytest  # Run the test suite

Areas we need help:

  • ๐Ÿ”Œ New integrations (Elastic, Splunk, New Relic)
  • ๐Ÿ“Š Investigation scenarios for evaluation
  • ๐Ÿ“š Documentation and examples
  • ๐Ÿ› Bug reports and fixes

๐Ÿ“„ License

Apache 2.0 โ€” See LICENSE for details.


Built by SREs, for SREs.
Tired of 3am pages? Let AutoSRE handle the first 5 minutes.

โญ Star us on GitHub โ€ข ๐Ÿ’ฌ Join Discord โ€ข ๐Ÿฆ Follow on Twitter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosre_ai-0.2.0.tar.gz (592.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autosre_ai-0.2.0-py3-none-any.whl (511.9 kB view details)

Uploaded Python 3

File details

Details for the file autosre_ai-0.2.0.tar.gz.

File metadata

  • Download URL: autosre_ai-0.2.0.tar.gz
  • Upload date:
  • Size: 592.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for autosre_ai-0.2.0.tar.gz
Algorithm Hash digest
SHA256 358e735f5533263554e89c3181a212469517be6cc952c13da091d3a6d2ceb5ae
MD5 5e94fe88782b49617f8ba50555fbdbb1
BLAKE2b-256 e7ae98836021a889c79d18f9b4490e85a94269af88f4cae09eeea5cd4d84834c

See more details on using hashes here.

File details

Details for the file autosre_ai-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: autosre_ai-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 511.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for autosre_ai-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ef8bfde2f008780986accf66203f145b7a013bcae3c927530cb2ad3f0ff34ee
MD5 7a3f3c39d48c2d8cb6aa523154dd1ab8
BLAKE2b-256 13ad5507203a45e670c65ca5894515d33d8c861c1c0af728c0d4e87ec26a454f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page