Skip to main content

SLO + error-budget tracker for Python services (FastAPI middleware + Prometheus exporter). Optional audit-stream-py integration via AUDIT_STREAM_URL.

Project description

slo-budget-tracker

CI Python License: MIT

SLO + error-budget tracker for Python services — drop-in FastAPI middleware, Prometheus exporter, and a small standalone library you can wire into any ASGI app or background worker.

Built around the math in the Google SRE Workbook: one rolling window, multi-window burn-rate alerts (defaults to 1h + 6h at burn rate ≥ 14.4), and an explicit error-budget remaining gauge so dashboards stop lying about reliability.


Why

Most "SLO dashboards" you find in the wild conflate availability with uptime and surface neither error budget nor burn rate. You can't tell, at a glance, whether the freshly deployed service is burning the next 30 days of error budget in the next 30 minutes. This library makes that visible by default.

Two things matter:

  1. Error budget remaining — a [1.0 → ≤0] ratio on every dashboard.
  2. Burn rate(1 − actual_success_ratio) / (1 − target), sampled at short windows so fast-burn incidents page before the budget is spent.

Install

pip install slo-budget-tracker
# or, with the FastAPI extras:
pip install "slo-budget-tracker[fastapi]"

Python 3.11+. Single runtime dep: prometheus-client.


Quick start — standalone library

from slo_budget_tracker import SLODefinition, SLOTracker

tracker = SLOTracker(
    SLODefinition(
        name="availability",
        target=0.999,                # three nines
        window_seconds=30 * 24 * 3600,  # 30-day rolling window
        burn_rate_windows=(3600, 21600),  # alert on 1h and 6h
        burn_rate_threshold=14.4,         # SRE workbook fast-burn page
    )
)

# Hot path — O(1)
tracker.record_success()
tracker.record_failure()

snap = tracker.snapshot()
print(f"success ratio: {snap.success_ratio:.4f}")
print(f"budget left:   {snap.error_budget_remaining:.2%}")
print(f"burn rate:     {snap.burn_rate:.2f}")

if snap.is_budget_exhausted:
    print("Freeze deploys.")

for alert in tracker.check_burn_rate():
    print(f"FAST BURN over {alert.window_seconds}s: {alert.burn_rate:.1f}x budget")

FastAPI middleware

SLOMiddleware auto-classifies every HTTP response — by default 5xx and unhandled exceptions are failures, everything else is a success. Override with your own classifier when 4xx (or specific routes) should burn budget.

from fastapi import FastAPI
from fastapi.responses import Response
from slo_budget_tracker import (
    PrometheusExporter,
    SLODefinition,
    SLOMiddleware,
    SLORegistry,
)

registry = SLORegistry()
registry.define(SLODefinition(name="availability", target=0.999))
registry.define(SLODefinition(name="freshness",    target=0.99))

app = FastAPI()
app.add_middleware(SLOMiddleware, registry=registry, slo_name="availability")

exporter = PrometheusExporter(registry)


@app.get("/metrics")
async def metrics() -> Response:
    body, content_type = exporter.render()
    return Response(content=body, media_type=content_type)


@app.get("/slo")
async def slo_snapshot() -> dict[str, object]:
    return {"slos": [s.__dict__ for s in registry.snapshot_all()]}

Point your Prometheus scrape at /metrics and you get:

slo_target{slo="availability"} 0.999
slo_success_ratio{slo="availability"} 0.9991
slo_error_budget_remaining{slo="availability"} 0.42
slo_burn_rate{slo="availability",window_seconds="3600"} 2.1
slo_burn_rate{slo="availability",window_seconds="21600"} 0.8
slo_breached{slo="availability"} 0.0

Custom classification

Default: anything < 500 and no exception is a success. Want 4xx to burn budget? Pass classify=:

app.add_middleware(
    SLOMiddleware,
    registry=registry,
    slo_name="availability",
    classify=lambda status, exc: exc is None and status < 400,
)

The classifier receives (status_code, exception_or_None) and returns True for success.


API surface

Object Purpose
SLODefinition Frozen dataclass: name, target, window, burn-rate windows + threshold. Validates at construction.
SLOTracker Records observations, computes snapshots and burn-rate alerts.
SLORegistry Holds many named trackers; supports snapshot_all() and check_burn_rates().
SLOMiddleware ASGI middleware that auto-records HTTP outcomes against a tracker.
PrometheusExporter Renders the registry as Prometheus text format on demand.
Observation (timestamp, success) event.
SLOSnapshot Point-in-time view: ratios, failures, budget remaining, burn rate.
BurnRateAlert One short window has crossed the configured threshold.
BurnRateSample One short-window measurement attached to a snapshot.

Burn-rate math

error_budget   = (1 - target) * total_requests_in_window
budget_used    = failures_in_window
remaining_pct  = (error_budget - budget_used) / error_budget

burn_rate(short_window) = (1 - success_ratio(short_window)) / (1 - target)

A burn_rate == 1.0 means the service is failing at exactly the rate the SLO allows. burn_rate == 14.4 means the next 30-day budget is being eaten in ~2 days. The default threshold of 14.4 follows the SRE Workbook fast-burn page.


Storage backends

The default InMemoryStore keeps a thread-safe deque trimmed to the window. For services pushing > ~100 rps you'll want a sampling or bucketed backend — wire one in by passing store= to SLOTracker. The protocol is small:

class ObservationStore(Protocol):
    def record(self, observation: Observation) -> None: ...
    def window(self, now: float, seconds: int) -> list[Observation]: ...
    def trim(self, before: float) -> None: ...
    def __len__(self) -> int: ...

A Redis sorted-set backend is on the roadmap (ZADD/ZREMRANGEBYSCORE); contributions welcome.


Tests

pip install -e ".[dev]"
ruff check src tests && ruff format --check src tests
mypy src
pytest -v

The CI matrix runs Python 3.11 / 3.12 / 3.13.


Related work in this ecosystem

This is part of the Platform Reliability Stack — small, focused libraries that compose into a production reliability story:

  • procurement-decision-api — drafts AI Procurement Decision Cards from vendor Suite documents.
  • reliability-toolkit-rs — async rate-limit + circuit-breaker + retry + bulkhead in Rust (coming next).
  • More at kineticgain.com.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

slo_budget_tracker-0.1.1.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

slo_budget_tracker-0.1.1-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file slo_budget_tracker-0.1.1.tar.gz.

File metadata

  • Download URL: slo_budget_tracker-0.1.1.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for slo_budget_tracker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5965ecd194aba1e144ffae569e7d49cbdbdb4e6da85c8d326a049bcc0077a90a
MD5 590ac0f1945abfe9adfbfe1e073cecc4
BLAKE2b-256 542c6b3119ed961c15718f1f871accce18f93010b6eb2ab4724986bd29c721c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for slo_budget_tracker-0.1.1.tar.gz:

Publisher: publish.yml on mizcausevic-dev/slo-budget-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file slo_budget_tracker-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for slo_budget_tracker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 23b898c0427d71401dc5c1cc56effcb4b5119f7168937e6e74726c9ce9402c86
MD5 f751d22823be22ad9596fc59628275ea
BLAKE2b-256 83510e046bf99b8f6038d648cfa8c564551c46049cae4cacc3ed77e049096105

See more details on using hashes here.

Provenance

The following attestation bundles were made for slo_budget_tracker-0.1.1-py3-none-any.whl:

Publisher: publish.yml on mizcausevic-dev/slo-budget-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page