Skip to main content

NthLayer - The Missing Layer of Reliability

Project description

NthLayer

The Missing Layer of Reliability

Reliability requirements as code.

Status: Alpha PyPI License: MIT

NthLayer lets you define what "production-ready" means for a service, then generates, validates, and enforces those requirements automatically.

Define once. Generate everything. Block bad deploys.


The Problem

For every new service, teams are expected to:

  • Manually create dashboards
  • Hand-craft alerts and recording rules
  • Define SLOs and error budgets
  • Configure incident escalation
  • Decide if a service is "ready" for production

These decisions are usually made after deployment, enforced inconsistently, or revisited only during incidents.

The Solution

NthLayer moves reliability left in the delivery lifecycle:

┌─────────────────────────────────────────────────────────────────────────────┐
│ service.yaml → generate → lint → verify → check-deploy → deploy            │
│                   ↓         ↓       ↓           ↓                          │
│               artifacts   valid?  metrics?  budget ok?                     │
│                                                                            │
│ "Is this production-ready?" - answered BEFORE deployment                   │
└─────────────────────────────────────────────────────────────────────────────┘
# In your Tekton/GitHub Actions pipeline:
nthlayer apply service.yaml --lint    # Generate + validate PromQL syntax
nthlayer verify service.yaml          # Verify declared metrics exist
nthlayer check-deploy service.yaml    # Check error budget gate
# Only if all pass: deploy to production

Works with: Tekton, GitHub Actions, GitLab CI, ArgoCD, Mimir/Cortex


🚦 Shift Left Features

Command What It Does Pipeline Exit Code
nthlayer verify Validates declared metrics exist in Prometheus 1 if missing metrics
nthlayer check-deploy Checks error budget - blocks if exhausted 2 if budget exhausted
nthlayer apply --lint Validates PromQL syntax with pint 1 if invalid queries

Deployment Gate Example

nthlayer check-deploy demo

⚡ Quick Start

pipx install nthlayer

nthlayer apply service.yaml

# Output: generated/payment-api/
#   ├── dashboard.json       → Grafana
#   ├── alerts.yaml          → Prometheus
#   ├── slos.yaml            → OpenSLO
#   └── recording-rules.yaml → Prometheus

What NthLayer Is

  • A reliability specification that defines production-readiness
  • A compiler from service intent to operational reality
  • A CI/CD-native way to standardize reliability across teams

NthLayer integrates with existing tools (Prometheus, Grafana, PagerDuty) but operates before them - deciding what is allowed to reach production.

What NthLayer Is Not

  • Not a service catalog
  • Not an observability platform
  • Not an incident management system
  • Not a runtime control plane

NthLayer complements these systems by ensuring services meet reliability expectations before they are deployed.

Why NthLayer?

With NthLayer Without NthLayer
Platform teams encode reliability standards once Standards recreated per service
Service teams inherit sane defaults automatically Each team invents their own
"Is this production-ready?" = deterministic check "Is this ready?" = negotiated opinion
Reliability is enforced by default Reliability is reactive and inconsistent

📥 What You Put In

1. Service Spec (service.yaml)

# Minimal example (5 lines)
name: payment-api
tier: critical
type: api
dependencies:
  - postgresql

2. Environment Variables (optional)

# 📟 PagerDuty - auto-create team, escalation policy, service
export PAGERDUTY_API_KEY=...

# 📊 Grafana - auto-push dashboards
export NTHLAYER_GRAFANA_URL=...
export NTHLAYER_GRAFANA_API_KEY=...
export NTHLAYER_GRAFANA_ORG_ID=1              # Default: 1

# 🔍 Prometheus - metric discovery for intent resolution
export NTHLAYER_PROMETHEUS_URL=...
export NTHLAYER_METRICS_USER=...              # If auth required
export NTHLAYER_METRICS_PASSWORD=...

📤 What You Get Out

Output File Deploy To
📊 Dashboard generated/<service>/dashboard.json Grafana
🚨 Alerts generated/<service>/alerts.yaml Prometheus
🎯 SLOs generated/<service>/slos.yaml OpenSLO-compatible
⚡ Recording Rules generated/<service>/recording-rules.yaml Prometheus
📟 PagerDuty Created via API Team, escalation policy, service

📊 SLO Portfolio

Track reliability across your entire organization:

nthlayer portfolio demo
nthlayer portfolio              # Org-wide reliability view
nthlayer portfolio --format json  # Machine-readable for dashboards
nthlayer slo collect service.yaml  # Query current budget from Prometheus

📝 Full Service Example

name: payment-api
tier: critical              # critical | standard | low
type: api                   # api | worker | stream
team: payments

slos:
  availability: 99.95       # Generates Prometheus alerts
  latency_p99_ms: 200       # Generates histogram queries

dependencies:
  - postgresql              # Adds PostgreSQL panels
  - redis                   # Adds Redis panels
  - kubernetes              # Adds K8s pod metrics

pagerduty:
  enabled: true
  support_model: self       # self | shared | sre | business_hours

💰 The Value

Generation: 20 hours → 5 minutes per service

Task Manual Effort With NthLayer
🎯 Define SLOs & error budgets 6 hours Generated from tier
🚨 Research & configure alerts 4 hours 400+ battle-tested rules
📊 Build Grafana dashboards 5 hours 12-28 panels auto-generated
📟 PagerDuty escalation setup 2 hours Tier-based defaults
📋 Write recording rules 3 hours 20+ pre-computed metrics

Validation: Catch issues before production

Problem Without NthLayer With NthLayer
Missing metrics Discover after deploy nthlayer verify blocks promotion
Invalid PromQL Prometheus rejects rules --lint catches in CI
Policy violations Manual review nthlayer validate-spec enforces
Exhausted budget Deploy anyway, incident check-deploy blocks risky deploys

At Scale

Scale Generation Saved Incidents Prevented*
🚀 50 services 996 hours ($100K) ~12/year
📈 200 services 3,983 hours ($400K) ~48/year
🏢 1,000 services 19,917 hours ($2M) ~240/year

*Estimated based on 60% reduction in "missing monitoring" incidents. Value at $100/hr engineering cost.


🧠 How It Works

Generation

Step What Happens
🎯 Intent Resolution Maps "availability SLO" → best matching PromQL query
🔀 Type Routing API services get HTTP metrics, workers get job metrics
Tier Defaults Critical = 99.95% SLO + 5min escalation, Low = 99.5% + 60min
🏗️ Technology Templates 23 built-in: PostgreSQL, Redis, Kafka, MongoDB, etc.

CI/CD Pipeline

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Generate  │───▶│   Validate  │───▶│   Protect   │───▶│   Deploy    │
│ nthlayer    │    │ --lint      │    │ check-deploy│    │ kubectl     │
│ apply       │    │ verify      │    │             │    │ argocd      │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
      │                  │                  │
      ▼                  ▼                  ▼
  artifacts         exit 1 if          exit 2 if
  to git            invalid            budget exhausted

Works with: GitHub Actions, GitLab CI, ArgoCD, Tekton, Jenkins


🛠️ CLI Commands

Generate

nthlayer init                   # Interactive service.yaml creation
nthlayer plan service.yaml      # Preview what will be generated
nthlayer apply service.yaml     # Generate all artifacts
nthlayer apply --push           # Also push dashboard to Grafana
nthlayer apply --push-ruler     # Push alerts to Mimir/Cortex Ruler API

Validate

nthlayer apply --lint           # Validate PromQL syntax (pint)
nthlayer validate-spec service.yaml  # Check against policies (OPA/Rego)
nthlayer verify service.yaml    # Verify metrics exist in Prometheus

Protect

nthlayer check-deploy service.yaml  # Check error budget gate (exit 2 = blocked)
nthlayer portfolio              # Org-wide SLO health
nthlayer slo collect service.yaml   # Query current budget from Prometheus

🔮 Coming Soon

Feature Description Status
💰 Error Budgets Track budget consumption, correlate with deploys ✅ Done
📊 SLO Portfolio Org-wide reliability view across all services ✅ Done
🚦 Deployment Gates Block deploys when error budget exhausted ✅ Done
Contract Verification Verify declared metrics exist before promotion ✅ Done
📝 Loki Integration Generate LogQL alert rules, technology-specific log patterns 🔨 Next
🤖 AI Generation Conversational service.yaml creation via MCP 📋 Planned

📦 Installation

# Recommended
pipx install nthlayer

# Or with pip
pip install nthlayer

# Verify
nthlayer --version

🌐 Live Demo

See NthLayer in action with real Grafana dashboards and generated configs:

Live Dashboards Interactive Demo


📚 Documentation

Full Documentation - Comprehensive guides and reference. Ask DeepWiki

Quick Links
🚀 Quick Start Get running in 5 minutes
🔧 Setup Wizard Interactive configuration
📊 SLO Portfolio Org-wide reliability view
🔌 18 Technologies PostgreSQL, Redis, Kafka...
📖 CLI Reference All commands
🤝 Contributing How to contribute
Build docs locally
uv sync --extra docs
uv run mkdocs serve  # Opens at http://localhost:8000

🤝 Contributing

# Install uv (https://docs.astral.sh/uv/)
curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/rsionnach/nthlayer.git
cd nthlayer
make setup    # Install deps, start services
make test     # Run tests

See CONTRIBUTING.md for details.


📄 License

MIT - See LICENSE.txt


🙏 Acknowledgments

Core Dependencies

Architecture Inspiration

  • autograf - Dynamic Prometheus metric discovery
  • Sloth - SLO specification and burn rate calculations
  • OpenSLO - SLO specification standard

CLI & Documentation

  • Rich - Terminal formatting and styling (MIT)
  • Questionary - Interactive CLI prompts (MIT)
  • MkDocs Material - Documentation theme (MIT)
  • VHS - Terminal demo recordings (MIT)
  • Nord Theme - Color palette inspiration (MIT)

Tooling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nthlayer-0.1.0a10.tar.gz (439.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nthlayer-0.1.0a10-py3-none-any.whl (324.1 kB view details)

Uploaded Python 3

File details

Details for the file nthlayer-0.1.0a10.tar.gz.

File metadata

  • Download URL: nthlayer-0.1.0a10.tar.gz
  • Upload date:
  • Size: 439.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nthlayer-0.1.0a10.tar.gz
Algorithm Hash digest
SHA256 74f18893e7f364f4c09c3ecfc3ddd50062dc0cddb44345e19f56d626fc00ce74
MD5 eedb88e9520a6dc6788b2b68d52254f5
BLAKE2b-256 a55c095e6df16931b4cc3c02dc77209057de8f3fa72d92b41721739c21e22428

See more details on using hashes here.

Provenance

The following attestation bundles were made for nthlayer-0.1.0a10.tar.gz:

Publisher: release.yml on rsionnach/nthlayer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nthlayer-0.1.0a10-py3-none-any.whl.

File metadata

  • Download URL: nthlayer-0.1.0a10-py3-none-any.whl
  • Upload date:
  • Size: 324.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nthlayer-0.1.0a10-py3-none-any.whl
Algorithm Hash digest
SHA256 204240671cc2c480f8eaf559362638fb64402cb8ac763f19e70f030e2f5c1054
MD5 4e04c6d8b00b3788a77342b5f78992a7
BLAKE2b-256 f71f206849f63d2fcc8a3bbac4d6b0b75ef356e1677d99da8624de8cf5fdfa2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for nthlayer-0.1.0a10-py3-none-any.whl:

Publisher: release.yml on rsionnach/nthlayer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page