Open-source AI SRE agent - foundation-first incident investigation, root cause analysis, and auto-remediation
Project description
AutoSRE
๐ค The AI SRE that investigates incidents like your best on-call engineer โ but faster.
Quick Start โข Features โข How It Works โข Integrations โข Docs
45-minute investigations โ 5 minutes. Autonomous triage. Evidence-based RCA. Human-in-the-loop for safety.
โก Quick Start
# Install
pip install autosre-ai
# Configure (interactive setup)
autosre config init
# Investigate your first incident
autosre investigate "checkout service 500 errors" --service checkout-service
Or with Docker:
docker run -it --rm -v ~/.autosre:/root/.autosre ghcr.io/autosre-ai/autosre investigate "high latency on api-gateway"
That's it. No Neo4j. No Postgres. No infrastructure. Just pip install and go.
โจ Features
๐ Autonomous Investigation
Multi-agent investigation that works like your best SRE: triage โ contain โ investigate โ resolve โ learn.
$ autosre investigate "payment failures spiking"
[Triage] Confirmed: payment-service 5xx rate at 12% (normally <0.1%)
[Scope] Affected: checkout-service, order-service (downstream)
[Hypothesis] Testing: Recent deployment of payment-service v2.3.1
[Evidence] Deployment at 14:02, errors started 14:05 โ
[Root Cause] payment-service v2.3.1 introduced null pointer in retry logic
[Recommendation] Rollback to v2.3.0 (requires approval)
๐ง Episodic Memory
Learns from every investigation. Recalls similar incidents. Gets smarter over time.
$ autosre memory search "database timeout"
Found 3 similar incidents:
โโโ inv_abc123: PostgreSQL connection pool exhaustion (resolved in 8m)
โโโ inv_def456: Slow query blocking connections (resolved in 12m)
โโโ inv_ghi789: Network partition to RDS (resolved in 23m)
๐ SLO-Driven Operations
Error budgets, multi-window burn rates, deployment gating โ all built-in.
$ autosre slo status --service checkout-service
checkout-service SLO Status
โโโ Availability: 99.92% (target: 99.9%) โ
โโโ Latency p99: 245ms (target: 300ms) โ
โโโ Error Budget: 72% remaining
โ โโโ 1h burn rate: 0.8x
โ โโโ 6h burn rate: 1.2x
โ โโโ 24h burn rate: 0.9x
โโโ Deploys: ALLOWED
๐ก๏ธ AI Safety Built-In
Every decision has confidence scores. Critical actions require human approval. Full audit trails.
- Hypothesis-driven reasoning with falsifiable criteria
- Confidence scoring (0.0-1.0) on every decision
- Human-in-the-loop for remediation actions
- AI error budgets tracking accuracy over time
๐ง Extensible Skills System
Modular investigation skills: Kubernetes, metrics, logs, traces, infrastructure.
skills/
โโโ kubernetes/ # Pod states, deployments, events
โโโ metrics-analysis/ # Prometheus, Datadog, Grafana
โโโ log-analysis/ # Pattern matching, anomaly detection
โโโ traces/ # Distributed tracing analysis
โโโ infrastructure/ # AWS, GCP resource checks
โโโ investigation/ # Methodology and hypothesis testing
๐ Automated Postmortems
Blameless postmortems with auto-generated timelines, metrics snapshots, and action items.
๐ฏ How It Works
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ autosre investigate โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโผโโโโโโโโโ
โ Orchestrator โ
โ (LangGraph) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโฌโโโโโโโโดโโโโโโโโฌโโโโโโโโโโโโ
โ โ โ โ
โโโโโโผโโโโโ โโโโโโผโโโโโ โโโโโโโผโโโโโ โโโโโโผโโโโโ
โ Memory โ โTopology โ โ Planner โ โ LLM โ
โ(SQLite) โ โ (YAML) โ โ Agent โ โ Router โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโฌโโโโโโ โโโโโโโโโโโ
โ
โโโโโโโโโโโฌโโโโโโโโโโผโโโโโโโโโโฌโโโโโโโโโโ
โ โ โ โ โ
โโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโผโโโโ
โ K8s โโMetrics โโ Logs โโ Traces โโ Infra โ
โSubagentโโSubagentโโSubagentโโSubagentโโSubagentโ
โโโโโโฌโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโฌโโโโ
โโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโ
โ Synthesizer โ
โ (Evidence โ Root Cause) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ Writeup & Actions โ
โ (Postmortem, Remediation) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Concepts:
| Component | What It Does |
|---|---|
| Orchestrator | Coordinates investigation phases (Triage โ Mitigate โ Diagnose โ Resolve) |
| Episodic Memory | SQLite-based learning from past investigations with FTS5 search |
| Service Topology | YAML-defined service dependencies for blast radius analysis |
| Subagents | Parallel specialists (Kubernetes, metrics, logs, traces) |
| Synthesizer | Merges evidence, tests hypotheses, identifies root cause |
๐ Integrations
| Category | Supported |
|---|---|
| Observability | Prometheus, Grafana, Datadog |
| Incident Management | PagerDuty, Slack, OpsGenie |
| Infrastructure | Kubernetes, AWS, GCP |
| Source Control | GitHub, GitLab |
| Issue Tracking | Jira, Linear |
๐ Why AutoSRE?
| Before AutoSRE | After AutoSRE |
|---|---|
| 45+ min incident investigations | 5 min AI-assisted triage |
| Lost context between incidents | Episodic memory recalls similar issues |
| Tribal knowledge in runbooks | AI executes and learns from runbooks |
| Manual toil tracking | Auto-classified, automation suggested |
| Blame-filled postmortems | Auto-generated blameless documentation |
Test Results: 1,053 tests passing | 25+ investigation scenarios validated
๐๏ธ Production Deployment
For production deployments with persistent storage and multiple services, see the Docker Deployment Guide.
Quick Docker Compose Setup
# Clone and setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
make setup
# Configure secrets
cp .env.example .env
vim .env # Add your API keys
# Start all services
make dev
# Verify health
make health
Services:
| Service | Port | Description |
|---|---|---|
| web-ui | 3000 | Next.js web interface |
| api-gateway | 8000 | FastAPI REST API |
| sre-agent | 8080 | AI agent service |
| postgres | 5432 | PostgreSQL database |
| neo4j | 7474 | Graph database (optional) |
| redis | 6379 | Cache & pub/sub |
๐ค Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
# Development setup
git clone https://github.com/autosre-ai/autosre.git
cd autosre
pip install -e ".[dev]"
pytest # Run the test suite
Areas we need help:
- ๐ New integrations (Elastic, Splunk, New Relic)
- ๐ Investigation scenarios for evaluation
- ๐ Documentation and examples
- ๐ Bug reports and fixes
๐ License
Apache 2.0 โ See LICENSE for details.
Built by SREs, for SREs.
Tired of 3am pages? Let AutoSRE handle the first 5 minutes.
โญ Star us on GitHub โข ๐ฌ Join Discord โข ๐ฆ Follow on Twitter
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autosre_ai-0.2.2.tar.gz.
File metadata
- Download URL: autosre_ai-0.2.2.tar.gz
- Upload date:
- Size: 592.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c94c6056c180ffdadaf0817d0e5a5afc28000cfc163ffa4450c9c608a96a744
|
|
| MD5 |
6e2e46c832fb2ed12a9f71fb737ddd3e
|
|
| BLAKE2b-256 |
c6e497f9597e5114e8f6462f54656cb69e53f52a2c54823a1de8bdf6d995b032
|
File details
Details for the file autosre_ai-0.2.2-py3-none-any.whl.
File metadata
- Download URL: autosre_ai-0.2.2-py3-none-any.whl
- Upload date:
- Size: 512.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c497cbaef64c92afd235d6df60ed933663d8f450c1c1ba3d90643e4b7e2f896
|
|
| MD5 |
178cf55fa2a6a533c9120ea7d5cfe530
|
|
| BLAKE2b-256 |
9a27c55e99e0b3ed95b78bac830fd1655721e100be099fd1c23fc8801ea92a73
|