Open-source AI SRE agent - foundation-first incident investigation, root cause analysis, and auto-remediation

These details have not been verified by PyPI

Project links

Project description

AutoSRE Logo

AutoSRE

🤖 The AI SRE that investigates incidents like your best on-call engineer — but faster.

Quick Start • Features • How It Works • Integrations • Docs

45-minute investigations → 5 minutes. Autonomous triage. Evidence-based RCA. Human-in-the-loop for safety.

AutoSRE Demo

⚡ Quick Start

# Install
pip install autosre-ai

# Configure (interactive setup)
autosre config init

# Investigate your first incident
autosre investigate "checkout service 500 errors" --service checkout-service

Or with Docker:

docker run -it --rm -v ~/.autosre:/root/.autosre ghcr.io/autosre-ai/autosre investigate "high latency on api-gateway"

That's it. No Neo4j. No Postgres. No infrastructure. Just pip install and go.

✨ Features

🔍 Autonomous Investigation

Multi-agent investigation that works like your best SRE: triage → contain → investigate → resolve → learn.

$ autosre investigate "payment failures spiking"

[Triage] Confirmed: payment-service 5xx rate at 12% (normally <0.1%)
[Scope] Affected: checkout-service, order-service (downstream)
[Hypothesis] Testing: Recent deployment of payment-service v2.3.1
[Evidence] Deployment at 14:02, errors started 14:05 ✓
[Root Cause] payment-service v2.3.1 introduced null pointer in retry logic
[Recommendation] Rollback to v2.3.0 (requires approval)

🧠 Episodic Memory

Learns from every investigation. Recalls similar incidents. Gets smarter over time.

$ autosre memory search "database timeout"

Found 3 similar incidents:
├── inv_abc123: PostgreSQL connection pool exhaustion (resolved in 8m)
├── inv_def456: Slow query blocking connections (resolved in 12m)
└── inv_ghi789: Network partition to RDS (resolved in 23m)

📊 SLO-Driven Operations

Error budgets, multi-window burn rates, deployment gating — all built-in.

$ autosre slo status --service checkout-service

checkout-service SLO Status
├── Availability: 99.92% (target: 99.9%) ✓
├── Latency p99: 245ms (target: 300ms) ✓
├── Error Budget: 72% remaining
│   ├── 1h burn rate: 0.8x
│   ├── 6h burn rate: 1.2x
│   └── 24h burn rate: 0.9x
└── Deploys: ALLOWED

🛡️ AI Safety Built-In

Every decision has confidence scores. Critical actions require human approval. Full audit trails.

Hypothesis-driven reasoning with falsifiable criteria
Confidence scoring (0.0-1.0) on every decision
Human-in-the-loop for remediation actions
AI error budgets tracking accuracy over time

🔧 Extensible Skills System

Modular investigation skills: Kubernetes, metrics, logs, traces, infrastructure.

skills/
├── kubernetes/         # Pod states, deployments, events
├── metrics-analysis/   # Prometheus, Datadog, Grafana
├── log-analysis/       # Pattern matching, anomaly detection
├── traces/             # Distributed tracing analysis
├── infrastructure/     # AWS, GCP resource checks
└── investigation/      # Methodology and hypothesis testing

📝 Automated Postmortems

Blameless postmortems with auto-generated timelines, metrics snapshots, and action items.

🎯 How It Works

┌────────────────────────────────────────────────────────────────┐
│                     autosre investigate                         │
└────────────────────────────┬───────────────────────────────────┘
                             │
                    ┌────────▼────────┐
                    │   Orchestrator   │
                    │   (LangGraph)    │
                    └────────┬────────┘
                             │
         ┌───────────┬───────┴───────┬───────────┐
         │           │               │           │
    ┌────▼────┐ ┌────▼────┐   ┌─────▼────┐ ┌────▼────┐
    │ Memory  │ │Topology │   │ Planner  │ │  LLM    │
    │(SQLite) │ │ (YAML)  │   │  Agent   │ │ Router  │
    └─────────┘ └─────────┘   └────┬─────┘ └─────────┘
                                   │
               ┌─────────┬─────────┼─────────┬─────────┐
               │         │         │         │         │
          ┌────▼───┐┌────▼───┐┌────▼───┐┌────▼───┐┌────▼───┐
          │  K8s   ││Metrics ││  Logs  ││ Traces ││ Infra  │
          │Subagent││Subagent││Subagent││Subagent││Subagent│
          └────┬───┘└────┬───┘└────┬───┘└────┬───┘└────┬───┘
               └─────────┴─────────┴─────────┴─────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │         Synthesizer         │
                    │   (Evidence → Root Cause)   │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │     Writeup & Actions       │
                    │  (Postmortem, Remediation)  │
                    └─────────────────────────────┘

Key Concepts:

Component	What It Does
Orchestrator	Coordinates investigation phases (Triage → Mitigate → Diagnose → Resolve)
Episodic Memory	SQLite-based learning from past investigations with FTS5 search
Service Topology	YAML-defined service dependencies for blast radius analysis
Subagents	Parallel specialists (Kubernetes, metrics, logs, traces)
Synthesizer	Merges evidence, tests hypotheses, identifies root cause

🔌 Integrations

Category	Supported
Observability	Prometheus, Grafana, Datadog
Incident Management	PagerDuty, Slack, OpsGenie
Infrastructure	Kubernetes, AWS, GCP
Source Control	GitHub, GitLab
Issue Tracking	Jira, Linear

📈 Why AutoSRE?

Before AutoSRE	After AutoSRE
45+ min incident investigations	5 min AI-assisted triage
Lost context between incidents	Episodic memory recalls similar issues
Tribal knowledge in runbooks	AI executes and learns from runbooks
Manual toil tracking	Auto-classified, automation suggested
Blame-filled postmortems	Auto-generated blameless documentation

Test Results: 1,053 tests passing | 25+ investigation scenarios validated

🏗️ Production Deployment

For production deployments with persistent storage and multiple services, see the Docker Deployment Guide.

Quick Docker Compose Setup

# Clone and setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
make setup

# Configure secrets
cp .env.example .env
vim .env  # Add your API keys

# Start all services
make dev

# Verify health
make health

Services:

Service	Port	Description
web-ui	3000	Next.js web interface
api-gateway	8000	FastAPI REST API
sre-agent	8080	AI agent service
postgres	5432	PostgreSQL database
neo4j	7474	Graph database (optional)
redis	6379	Cache & pub/sub

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
pip install -e ".[dev]"
pytest  # Run the test suite

Areas we need help:

🔌 New integrations (Elastic, Splunk, New Relic)
📊 Investigation scenarios for evaluation
📚 Documentation and examples
🐛 Bug reports and fixes

📄 License

Apache 2.0 — See LICENSE for details.

Built by SREs, for SREs.
_{Tired of 3am pages? Let AutoSRE handle the first 5 minutes.}

⭐ Star us on GitHub • 💬 Join Discord • 🐦 Follow on Twitter

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

May 23, 2026

0.2.1

May 22, 2026

0.2.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosre_ai-0.2.2.tar.gz (592.5 kB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autosre_ai-0.2.2-py3-none-any.whl (512.0 kB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file autosre_ai-0.2.2.tar.gz.

File metadata

Download URL: autosre_ai-0.2.2.tar.gz
Upload date: May 23, 2026
Size: 592.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for autosre_ai-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`7c94c6056c180ffdadaf0817d0e5a5afc28000cfc163ffa4450c9c608a96a744`
MD5	`6e2e46c832fb2ed12a9f71fb737ddd3e`
BLAKE2b-256	`c6e497f9597e5114e8f6462f54656cb69e53f52a2c54823a1de8bdf6d995b032`

See more details on using hashes here.

File details

Details for the file autosre_ai-0.2.2-py3-none-any.whl.

File metadata

Download URL: autosre_ai-0.2.2-py3-none-any.whl
Upload date: May 23, 2026
Size: 512.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for autosre_ai-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c497cbaef64c92afd235d6df60ed933663d8f450c1c1ba3d90643e4b7e2f896`
MD5	`178cf55fa2a6a533c9120ea7d5cfe530`
BLAKE2b-256	`9a27c55e99e0b3ed95b78bac830fd1655721e100be099fd1c23fc8801ea92a73`

See more details on using hashes here.

autosre-ai 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoSRE

⚡ Quick Start

✨ Features

🔍 Autonomous Investigation

🧠 Episodic Memory

📊 SLO-Driven Operations

🛡️ AI Safety Built-In

🔧 Extensible Skills System

📝 Automated Postmortems

🎯 How It Works

🔌 Integrations

📈 Why AutoSRE?

🏗️ Production Deployment

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes