Skip to main content

SLO + error-budget tracker for Python services (FastAPI middleware + Prometheus exporter)

Project description

slo-budget-tracker

CI Python License: MIT

SLO + error-budget tracker for Python services — drop-in FastAPI middleware, Prometheus exporter, and a small standalone library you can wire into any ASGI app or background worker.

Built around the math in the Google SRE Workbook: one rolling window, multi-window burn-rate alerts (defaults to 1h + 6h at burn rate ≥ 14.4), and an explicit error-budget remaining gauge so dashboards stop lying about reliability.


Why

Most "SLO dashboards" you find in the wild conflate availability with uptime and surface neither error budget nor burn rate. You can't tell, at a glance, whether the freshly deployed service is burning the next 30 days of error budget in the next 30 minutes. This library makes that visible by default.

Two things matter:

  1. Error budget remaining — a [1.0 → ≤0] ratio on every dashboard.
  2. Burn rate(1 − actual_success_ratio) / (1 − target), sampled at short windows so fast-burn incidents page before the budget is spent.

Install

pip install slo-budget-tracker
# or, with the FastAPI extras:
pip install "slo-budget-tracker[fastapi]"

Python 3.11+. Single runtime dep: prometheus-client.


Quick start — standalone library

from slo_budget_tracker import SLODefinition, SLOTracker

tracker = SLOTracker(
    SLODefinition(
        name="availability",
        target=0.999,                # three nines
        window_seconds=30 * 24 * 3600,  # 30-day rolling window
        burn_rate_windows=(3600, 21600),  # alert on 1h and 6h
        burn_rate_threshold=14.4,         # SRE workbook fast-burn page
    )
)

# Hot path — O(1)
tracker.record_success()
tracker.record_failure()

snap = tracker.snapshot()
print(f"success ratio: {snap.success_ratio:.4f}")
print(f"budget left:   {snap.error_budget_remaining:.2%}")
print(f"burn rate:     {snap.burn_rate:.2f}")

if snap.is_budget_exhausted:
    print("Freeze deploys.")

for alert in tracker.check_burn_rate():
    print(f"FAST BURN over {alert.window_seconds}s: {alert.burn_rate:.1f}x budget")

FastAPI middleware

SLOMiddleware auto-classifies every HTTP response — by default 5xx and unhandled exceptions are failures, everything else is a success. Override with your own classifier when 4xx (or specific routes) should burn budget.

from fastapi import FastAPI
from fastapi.responses import Response
from slo_budget_tracker import (
    PrometheusExporter,
    SLODefinition,
    SLOMiddleware,
    SLORegistry,
)

registry = SLORegistry()
registry.define(SLODefinition(name="availability", target=0.999))
registry.define(SLODefinition(name="freshness",    target=0.99))

app = FastAPI()
app.add_middleware(SLOMiddleware, registry=registry, slo_name="availability")

exporter = PrometheusExporter(registry)


@app.get("/metrics")
async def metrics() -> Response:
    body, content_type = exporter.render()
    return Response(content=body, media_type=content_type)


@app.get("/slo")
async def slo_snapshot() -> dict[str, object]:
    return {"slos": [s.__dict__ for s in registry.snapshot_all()]}

Point your Prometheus scrape at /metrics and you get:

slo_target{slo="availability"} 0.999
slo_success_ratio{slo="availability"} 0.9991
slo_error_budget_remaining{slo="availability"} 0.42
slo_burn_rate{slo="availability",window_seconds="3600"} 2.1
slo_burn_rate{slo="availability",window_seconds="21600"} 0.8
slo_breached{slo="availability"} 0.0

Custom classification

Default: anything < 500 and no exception is a success. Want 4xx to burn budget? Pass classify=:

app.add_middleware(
    SLOMiddleware,
    registry=registry,
    slo_name="availability",
    classify=lambda status, exc: exc is None and status < 400,
)

The classifier receives (status_code, exception_or_None) and returns True for success.


API surface

Object Purpose
SLODefinition Frozen dataclass: name, target, window, burn-rate windows + threshold. Validates at construction.
SLOTracker Records observations, computes snapshots and burn-rate alerts.
SLORegistry Holds many named trackers; supports snapshot_all() and check_burn_rates().
SLOMiddleware ASGI middleware that auto-records HTTP outcomes against a tracker.
PrometheusExporter Renders the registry as Prometheus text format on demand.
Observation (timestamp, success) event.
SLOSnapshot Point-in-time view: ratios, failures, budget remaining, burn rate.
BurnRateAlert One short window has crossed the configured threshold.
BurnRateSample One short-window measurement attached to a snapshot.

Burn-rate math

error_budget   = (1 - target) * total_requests_in_window
budget_used    = failures_in_window
remaining_pct  = (error_budget - budget_used) / error_budget

burn_rate(short_window) = (1 - success_ratio(short_window)) / (1 - target)

A burn_rate == 1.0 means the service is failing at exactly the rate the SLO allows. burn_rate == 14.4 means the next 30-day budget is being eaten in ~2 days. The default threshold of 14.4 follows the SRE Workbook fast-burn page.


Storage backends

The default InMemoryStore keeps a thread-safe deque trimmed to the window. For services pushing > ~100 rps you'll want a sampling or bucketed backend — wire one in by passing store= to SLOTracker. The protocol is small:

class ObservationStore(Protocol):
    def record(self, observation: Observation) -> None: ...
    def window(self, now: float, seconds: int) -> list[Observation]: ...
    def trim(self, before: float) -> None: ...
    def __len__(self) -> int: ...

A Redis sorted-set backend is on the roadmap (ZADD/ZREMRANGEBYSCORE); contributions welcome.


Tests

pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
mypy src
pytest -v

The CI matrix runs Python 3.11 / 3.12 / 3.13.


Related work in this ecosystem

This is part of the Platform Reliability Stack — small, focused libraries that compose into a production reliability story:

  • procurement-decision-api — drafts AI Procurement Decision Cards from vendor Suite documents.
  • reliability-toolkit-rs — async rate-limit + circuit-breaker + retry + bulkhead in Rust (coming next).
  • More at kineticgain.com.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slo_budget_tracker-0.1.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slo_budget_tracker-0.1.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file slo_budget_tracker-0.1.0.tar.gz.

File metadata

  • Download URL: slo_budget_tracker-0.1.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slo_budget_tracker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 268bbcf2db9af2a3fa2a92cbdcb2af7bd3673d585de063713039d4d4a9291253
MD5 e3624537dc82a5b54efe7becb6b0a134
BLAKE2b-256 d9a344103eccc84899c27066d61f2bb0885e625c49a1622460df4cd7e58e2902

See more details on using hashes here.

Provenance

The following attestation bundles were made for slo_budget_tracker-0.1.0.tar.gz:

Publisher: publish.yml on mizcausevic-dev/slo-budget-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slo_budget_tracker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for slo_budget_tracker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e8b8a00f22f463ee2d7286ce5ce5bb2153a474ffc38936ea68081d0e2314f45f
MD5 caf3d26f63d2cf76aa222d22123fa4ad
BLAKE2b-256 f248ff0b8a24cad8bd1e767b7331909e7913c2cdb857fa162a6e74aca6790a8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for slo_budget_tracker-0.1.0-py3-none-any.whl:

Publisher: publish.yml on mizcausevic-dev/slo-budget-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page