Stateful monitoring: reconcile declared infrastructure state against reality and reason about the drift.
Project description
steadystate.ai
The operational substrate for IT-Ops — whether a human or an agent drives it. steadystate watches your deployed infrastructure (live, and in CI), tells you whether it's actually working, carries your team's runbook, and closes the loop — but only ever within a bound you commit. A deterministic, stdlib-only core; an optional LLM that advises and proposes but never decides.
The job isn't "find drift." It's the two things an operator — or an agent acting as one — actually needs: grounded truth (what's declared, what's observed, is it working, what changed) and a governed way to act (a vetted catalog, an impact×reversibility bound, approval, an immutable audit). Rent your monitoring for the metrics; steadystate is the layer that knows your desired state and can safely return you to it.
Install
pipx install git+https://github.com/jedi12many/steadystate.ai # today (pre-release)
pip install steadystate # once published to PyPI
steadystate is a CLI you run from inside your IaC repo — its config, runbook, and state are
read CWD-relative, so the repo never imports it as a library. --silo <name> chdirs into a
registered deployment (like git -C); the container image lives under deploy/.
Two postures, one core
steadystate runs in two shapes that share the same deterministic engine and the same committed runbook:
| Live watcher | Repo-native (GitOps) | |
|---|---|---|
| Runs | a long-running server/CLI next to a deployment | stateless, in CI, inside the IaC repo |
| Holds | creds, a kubeconfig, a state db | nothing but the repo + a token |
| Acts by | guardrailed remediation on live infra (when you grant it) | opening a PR / an issue — a human merges |
| Driven by | you, or an agent over MCP | one CI line: steadystate ci |
Post-deploy and pre-deploy, the same tool. The PR-bot posture is the safest actuator there is — it
has zero infra access; its only power is a proposal you review. See
docs/repo-native-posture.md.
Is it working? — the verdict, function-first
Running ≠ working. steadystate leads with the question an operator actually asks — is it
working? — and answers WORKING | DEGRADED | DOWN:
- Smoke tests — the strongest signal is to exercise the service: an
httpcheck GETs an endpoint and asserts the response. A service that won't answer is the service being down. - Live malfunction (
Symptom) vs drift — aCrashLoopBackOffis the app failing now; a diverged config is drift. steadystate keeps them distinct, and when a failing resource also drifted, folds them into one root-caused alert ("failing — likely cause: this drift"). - Function-first triage —
summaryleads with what's impaired (a live malfunction worth attention), not a wall of findings, so neither you nor an agent chases a red herring. A serious config drift (an opened firewall) is still flagged for review — never buried, never called a malfunction. - Correlation + enrichment — scoped to a workload, it correlates the smoke result, the live symptoms, and the drift that likely caused them, and folds in live metrics from your monitoring (Prometheus) as context. It rents the metrics; it never reimplements monitoring.
Your runbook — author a fix once, then it's everywhere
The other half of grounded truth is what to do about it. A solution is an operator-vouched
problem → fix — "for evicted pods, run this; for a hung gateway, reboot it" — committed to
steadystate/solutions.json, the catalog you grow over time:
- Authored or learned. Write one (
add-solution), describe it in plain English (define-solution), or letlearnnotice a fix you keep applying by hand and hand you the exact command to capture it. Each is signed by an author — the audit anchor. - Matched + surfaced. When a finding matches (by category or a title regex),
shownames the documented fix and who vouched, a CI-opened issue carries it, and an agent over MCP sees the same — your runbook, right where the problem is. - Run through the gate. A matched fix becomes a one-
approveremediation, run as an argv (no shell) and audited with author + approver. Opt in to auto-apply (STEADYSTATE_SOLUTION_AUTO) and a low-impact, reversible one runs unattended — anything bigger always waits for a human.
Act — within a bound you commit
Acting is gated identically everywhere — terminal, chat, agent, CI:
- A vetted catalog — the only commands it can run are a fixed menu of safe shapes, re-validated at run time against an injection-proof allow-pattern; an authored solution is your extension of that catalog, vouched and audited.
- The bound — every action carries an envelope (impact × reversibility). The bound is the one
decision that should never be casual, so you commit it in
steadystate/config.toml's[bound]table — reviewed in a PR, not a loose env var. A lossless, tenant-scoped fix runs in bound; scaling to zero or deleting a node escalates to a human. - Approval + an immutable audit — nothing effectful runs unseen; every approve / decline /
auto / break-glass appends to
history. - Autonomy is a switch, not a track record — off by default, granted by you (
STEADYSTATE_*_AUTO), bounded by the envelope above.
The LLM proposes what; a deterministic gate decides whether. The full control model — and, just as honestly, where the guarantee ends (a shell-enabled agent's real limit is its RBAC, not us) — is LLM_SAFETY.md. Ask the tool itself with
steadystate posture.
Drive it — terminal, chat, agents, CI
The same vetted command grammar, four ways in:
- Terminal —
steadystate healthfor the working/degraded/down verdict;summaryfor the one-glance rollup;findings/showto inspect;chatfor a local REPL. - Chat (Slack / Teams / Discord) — signed webhooks;
@steadystate probe <target>, approve from a button. With an LLM, plain English works ("why is web crashlooping?") — a read-only ask runs, an effectful one is echoed back to confirm. Chat is a trigger, never a bypass. - Agents over MCP —
steadystate mcpruns as a Model Context Protocol server (stdio, stdlib-only) so Claude Code/Desktop or any agent drives the same verbs through the same guardrails. Three grant tiers: read-only (default) →--author(write checks + runbook solutions, not infra) →--write(remediate). Make it an agent's sole actuator — no shell, steadystate holds the creds — and the gate becomes a real fence (see thecontained-agentexample). - CI —
steadystate ci: stateless, deterministic, no creds; scan the IaC, gate the merge (non-zero on a problem), and open a PR/issue that already says how to fix it.
Detect — the grounded truth it's built on
Everything here runs with no model: fully deterministic, fully testable. It rides each tool's own
machine-readable output (terraform show -json, kubectl, helm, …), never raw-file parsing.
- Sources (
--source) —terraform · terraform-state · ansible · kubernetes · rancher · argocd · docker-compose · helm, plus live variants.terraform-statediffs config-vs-state with-refresh=false— no per-resource cloud refresh, so a CI gate needs only state-bucket read, not broad cloud creds. - Drift vs malfunction (
--probe) — config diverged vs failing-right-now, folded when both. - Custom checks — declare what healthy means for your app (is postfix routing mail? is
squid up?) as a vetted, read-only rule that emits a finding and never runs code; author by
talking (
define-check) or let an agent fill the schema (add-check). - Domain packs + live compliance — security (AWS · GCP · Azure → ATT&CK), Docker CIS, k8s Pod Security; a CIS/STIG live-posture scan, honest about what's checkable live.
- Surfaces (
--to) —console · slack · teams · discord · github · servicenow · pagerduty · prometheus · grafana · webhook.githubopens an issue when sure (deduped by fingerprint, auto-closed when it clears, and carrying the matched runbook fix); an unconfigured surface says so and skips. - Silos — name your deployments (
silo add,--silo <name>works likegit -C) so a laptop drives deployment 1, 2, 3 — each its own db + targets + kubeconfig — without collision. - Extensible — sources, packs, surfaces, probes, metric adapters are entry-point plugins; a
third-party package adds one without a fork. Self-describing:
catalog/commands.
Config as code
The same convention all the way down: committed beside your IaC, reviewed in PRs.
your-iac-repo/
├── main.tf, ...
├── steadystate/ # COMMITTED intent
│ ├── config.toml # [defaults] source/path · [bound] the envelope · [ci] the gate
│ ├── solutions.json # your runbook (problem → fix)
│ └── checks.json # what "healthy" means for your app
└── .steadystate/ # gitignored ephemeral state (state.db, patches)
Precedence is 12-factor and non-breaking: flag > env var > config > built-in default. Every
variable is in CONFIG.md; steadystate doctor shows what's set and each dial's
live value.
The optional LLM — advises, never decides
An LLM adds the plain-language "why this matters", groups events by root cause, answers questions in chat, drafts a check/solution from your words, and — where you grant it — proposes a remediation or drives the tool as an agent. But detection, scoring, correlation-fallback, and the apply decision stay deterministic, and a proposed action runs only if the gate authorizes it.
- Providers — Anthropic (
ANTHROPIC_API_KEY) or any OpenAI-compatible endpoint. - Kill switch + egress gate —
--no-llmmakes zero calls;--confirm-llmshows the exact prompt- destination and asks before anything is sent (decline → degrades to deterministic).
- Cost — every scan prints
LLM: N calls · ~$X;steadystate costrolls it up; surfacesteadystate_llm_cost_usd_totalto Prometheus.
Honest about what it is
The pieces aren't all novel — CI drift detection exists (Spacelift, env0, Terraform Cloud, driftctl) and IaC PR-bots exist (Atlantis). What's different is the combination: a committed, matched runbook (your problem→fix knowledge, not just a diff); a function-first verdict with the bound (is it working, and is this safe to auto-fix?); one substrate across both postures sharing that runbook; and the PR-bot as a deliberately zero-access actuator. A deployment model and a coherence, more than a single unique feature — and the lowest-friction front door to all of it.
Pointers
- CONFIG.md — every variable + the committed
config.toml. - LLM_SAFETY.md — the control model, and where the guarantee ends.
- docs/repo-native-posture.md — the GitOps posture, end to end.
- ARCHITECTURE.md — the state model, the seams, build-vs-rent.
- examples/ — worked scenarios: repo-native CI, custom checks, the runbook, a contained agent, brokered creds, fleet health, an MCP-driven wall.
- SECURITY.md — scope + how to report a vulnerability.
Built with
Python, stdlib-only at the core (HTTP/LLM via urllib; typer + rich for the CLI). Apache-2.0 —
see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file steadystate-0.1.0.tar.gz.
File metadata
- Download URL: steadystate-0.1.0.tar.gz
- Upload date:
- Size: 625.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43aeb0289200fd0ac21f2f5c7e5fac9d28aa26c1abb30366e3f89365926346dc
|
|
| MD5 |
10837d2405e67819b08ec24375ffa1a0
|
|
| BLAKE2b-256 |
5a95ac7bfea8dcb06823e490db60c2f85965237d6930a261ab509b404fea6a87
|
Provenance
The following attestation bundles were made for steadystate-0.1.0.tar.gz:
Publisher:
release.yml on jedi12many/steadystate.ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
steadystate-0.1.0.tar.gz -
Subject digest:
43aeb0289200fd0ac21f2f5c7e5fac9d28aa26c1abb30366e3f89365926346dc - Sigstore transparency entry: 1752830076
- Sigstore integration time:
-
Permalink:
jedi12many/steadystate.ai@897704f4b0b4a73b3227d895d3d65dc83af95614 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jedi12many
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@897704f4b0b4a73b3227d895d3d65dc83af95614 -
Trigger Event:
push
-
Statement type:
File details
Details for the file steadystate-0.1.0-py3-none-any.whl.
File metadata
- Download URL: steadystate-0.1.0-py3-none-any.whl
- Upload date:
- Size: 371.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2ce9baeaf4b0793ec916c25a1a4619eb13d2ea68cb9503358a72f93ad753749
|
|
| MD5 |
737d024b37445e4bee4ccd1f90e049aa
|
|
| BLAKE2b-256 |
6f00637b006a3f9708e7c7c6802ac8043638b105e2bacff38a7820558fa83d9b
|
Provenance
The following attestation bundles were made for steadystate-0.1.0-py3-none-any.whl:
Publisher:
release.yml on jedi12many/steadystate.ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
steadystate-0.1.0-py3-none-any.whl -
Subject digest:
b2ce9baeaf4b0793ec916c25a1a4619eb13d2ea68cb9503358a72f93ad753749 - Sigstore transparency entry: 1752830222
- Sigstore integration time:
-
Permalink:
jedi12many/steadystate.ai@897704f4b0b4a73b3227d895d3d65dc83af95614 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jedi12many
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@897704f4b0b4a73b3227d895d3d65dc83af95614 -
Trigger Event:
push
-
Statement type: