Skip to main content

Open-source AI SRE agent - foundation-first incident investigation, root cause analysis, and auto-remediation

Project description

AutoSRE Logo

AutoSRE

๐Ÿค– The AI SRE that investigates incidents like your best on-call engineer โ€” but faster.

CI Status PyPI Version Python Versions License Stars

Quick Start โ€ข Features โ€ข How It Works โ€ข Integrations โ€ข Docs


45-minute investigations โ†’ 5 minutes. Autonomous triage. Evidence-based RCA. Human-in-the-loop for safety.

AutoSRE Demo


โšก Quick Start

# Install
pip install autosre-ai

# Configure (interactive setup)
autosre config init

# Investigate your first incident
autosre investigate "checkout service 500 errors" --service checkout-service

Or with Docker:

docker run -it --rm -v ~/.autosre:/root/.autosre ghcr.io/autosre-ai/autosre investigate "high latency on api-gateway"

That's it. No Neo4j. No Postgres. No infrastructure. Just pip install and go.


โœจ Features

๐Ÿ” Autonomous Investigation

Multi-agent investigation that works like your best SRE: triage โ†’ contain โ†’ investigate โ†’ resolve โ†’ learn.

$ autosre investigate "payment failures spiking"

[Triage] Confirmed: payment-service 5xx rate at 12% (normally <0.1%)
[Scope] Affected: checkout-service, order-service (downstream)
[Hypothesis] Testing: Recent deployment of payment-service v2.3.1
[Evidence] Deployment at 14:02, errors started 14:05 โœ“
[Root Cause] payment-service v2.3.1 introduced null pointer in retry logic
[Recommendation] Rollback to v2.3.0 (requires approval)

๐Ÿง  Episodic Memory

Learns from every investigation. Recalls similar incidents. Gets smarter over time.

$ autosre memory search "database timeout"

Found 3 similar incidents:
โ”œโ”€โ”€ inv_abc123: PostgreSQL connection pool exhaustion (resolved in 8m)
โ”œโ”€โ”€ inv_def456: Slow query blocking connections (resolved in 12m)
โ””โ”€โ”€ inv_ghi789: Network partition to RDS (resolved in 23m)

๐Ÿ“Š SLO-Driven Operations

Error budgets, multi-window burn rates, deployment gating โ€” all built-in.

$ autosre slo status --service checkout-service

checkout-service SLO Status
โ”œโ”€โ”€ Availability: 99.92% (target: 99.9%) โœ“
โ”œโ”€โ”€ Latency p99: 245ms (target: 300ms) โœ“
โ”œโ”€โ”€ Error Budget: 72% remaining
โ”‚   โ”œโ”€โ”€ 1h burn rate: 0.8x
โ”‚   โ”œโ”€โ”€ 6h burn rate: 1.2x
โ”‚   โ””โ”€โ”€ 24h burn rate: 0.9x
โ””โ”€โ”€ Deploys: ALLOWED

๐Ÿ›ก๏ธ AI Safety Built-In

Every decision has confidence scores. Critical actions require human approval. Full audit trails.

  • Hypothesis-driven reasoning with falsifiable criteria
  • Confidence scoring (0.0-1.0) on every decision
  • Human-in-the-loop for remediation actions
  • AI error budgets tracking accuracy over time

๐Ÿ”ง Extensible Skills System

Modular investigation skills: Kubernetes, metrics, logs, traces, infrastructure.

skills/
โ”œโ”€โ”€ kubernetes/         # Pod states, deployments, events
โ”œโ”€โ”€ metrics-analysis/   # Prometheus, Datadog, Grafana
โ”œโ”€โ”€ log-analysis/       # Pattern matching, anomaly detection
โ”œโ”€โ”€ traces/             # Distributed tracing analysis
โ”œโ”€โ”€ infrastructure/     # AWS, GCP resource checks
โ””โ”€โ”€ investigation/      # Methodology and hypothesis testing

๐Ÿ“ Automated Postmortems

Blameless postmortems with auto-generated timelines, metrics snapshots, and action items.


๐ŸŽฏ How It Works

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     autosre investigate                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚   Orchestrator   โ”‚
                    โ”‚   (LangGraph)    โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚           โ”‚               โ”‚           โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”
    โ”‚ Memory  โ”‚ โ”‚Topology โ”‚   โ”‚ Planner  โ”‚ โ”‚  LLM    โ”‚
    โ”‚(SQLite) โ”‚ โ”‚ (YAML)  โ”‚   โ”‚  Agent   โ”‚ โ”‚ Router  โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚         โ”‚         โ”‚         โ”‚         โ”‚
          โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”
          โ”‚  K8s   โ”‚โ”‚Metrics โ”‚โ”‚  Logs  โ”‚โ”‚ Traces โ”‚โ”‚ Infra  โ”‚
          โ”‚Subagentโ”‚โ”‚Subagentโ”‚โ”‚Subagentโ”‚โ”‚Subagentโ”‚โ”‚Subagentโ”‚
          โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚         Synthesizer         โ”‚
                    โ”‚   (Evidence โ†’ Root Cause)   โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚     Writeup & Actions       โ”‚
                    โ”‚  (Postmortem, Remediation)  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts:

Component What It Does
Orchestrator Coordinates investigation phases (Triage โ†’ Mitigate โ†’ Diagnose โ†’ Resolve)
Episodic Memory SQLite-based learning from past investigations with FTS5 search
Service Topology YAML-defined service dependencies for blast radius analysis
Subagents Parallel specialists (Kubernetes, metrics, logs, traces)
Synthesizer Merges evidence, tests hypotheses, identifies root cause

๐Ÿ”Œ Integrations

Category Supported
Observability Prometheus, Grafana, Datadog
Incident Management PagerDuty, Slack, OpsGenie
Infrastructure Kubernetes, AWS, GCP
Source Control GitHub, GitLab
Issue Tracking Jira, Linear

๐Ÿ“ˆ Why AutoSRE?

Before AutoSRE After AutoSRE
45+ min incident investigations 5 min AI-assisted triage
Lost context between incidents Episodic memory recalls similar issues
Tribal knowledge in runbooks AI executes and learns from runbooks
Manual toil tracking Auto-classified, automation suggested
Blame-filled postmortems Auto-generated blameless documentation

Test Results: 1,053 tests passing | 25+ investigation scenarios validated


๐Ÿ—๏ธ Production Deployment

For production deployments with persistent storage and multiple services, see the Docker Deployment Guide.

Quick Docker Compose Setup
# Clone and setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
make setup

# Configure secrets
cp .env.example .env
vim .env  # Add your API keys

# Start all services
make dev

# Verify health
make health

Services:

Service Port Description
web-ui 3000 Next.js web interface
api-gateway 8000 FastAPI REST API
sre-agent 8080 AI agent service
postgres 5432 PostgreSQL database
neo4j 7474 Graph database (optional)
redis 6379 Cache & pub/sub

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
pip install -e ".[dev]"
pytest  # Run the test suite

Areas we need help:

  • ๐Ÿ”Œ New integrations (Elastic, Splunk, New Relic)
  • ๐Ÿ“Š Investigation scenarios for evaluation
  • ๐Ÿ“š Documentation and examples
  • ๐Ÿ› Bug reports and fixes

๐Ÿ“„ License

Apache 2.0 โ€” See LICENSE for details.


Built by SREs, for SREs.
Tired of 3am pages? Let AutoSRE handle the first 5 minutes.

โญ Star us on GitHub โ€ข ๐Ÿ’ฌ Join Discord โ€ข ๐Ÿฆ Follow on Twitter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosre_ai-0.2.2.tar.gz (592.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autosre_ai-0.2.2-py3-none-any.whl (512.0 kB view details)

Uploaded Python 3

File details

Details for the file autosre_ai-0.2.2.tar.gz.

File metadata

  • Download URL: autosre_ai-0.2.2.tar.gz
  • Upload date:
  • Size: 592.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for autosre_ai-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7c94c6056c180ffdadaf0817d0e5a5afc28000cfc163ffa4450c9c608a96a744
MD5 6e2e46c832fb2ed12a9f71fb737ddd3e
BLAKE2b-256 c6e497f9597e5114e8f6462f54656cb69e53f52a2c54823a1de8bdf6d995b032

See more details on using hashes here.

File details

Details for the file autosre_ai-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: autosre_ai-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 512.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for autosre_ai-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8c497cbaef64c92afd235d6df60ed933663d8f450c1c1ba3d90643e4b7e2f896
MD5 178cf55fa2a6a533c9120ea7d5cfe530
BLAKE2b-256 9a27c55e99e0b3ed95b78bac830fd1655721e100be099fd1c23fc8801ea92a73

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page