Skip to main content

Production observability for Agent OS - OpenTelemetry traces, Prometheus metrics, Grafana dashboards

Project description

Agent OS Observability

Production-ready observability stack for Agent OS kernel.

Status: Alpha

This package provides metrics, tracing, and dashboards for monitoring Agent OS deployments.

Features

  • Prometheus Metrics: Kernel, agent, and CMVK metrics
  • OpenTelemetry Tracing: Distributed tracing for agent operations
  • Grafana Dashboards: Pre-built dashboards for SOC, ML Ops, and SRE teams
  • Prometheus Alerts: Safety, performance, and availability alerts

Quick Start

Install Package

pip install agent-os-observability

Basic Usage

from agent_os_observability import KernelMetrics, KernelTracer

# Initialize metrics
metrics = KernelMetrics()

# Record policy check
with metrics.policy_check_latency():
    result = policy_engine.check(action)

# Record violation
if not result.allowed:
    metrics.record_violation(agent_id, action, policy="data-access", severity="high")
    metrics.record_blocked(agent_id, action)

# CMVK metrics
metrics.record_cmvk_verification(
    result="verified",
    confidence=0.95,
    drift_score=0.08,
    duration_seconds=2.3,
    model_count=3
)

# Expose /metrics endpoint (FastAPI example)
from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/metrics")
def get_metrics():
    return Response(
        content=metrics.export(),
        media_type=metrics.content_type()
    )

Full Observability Stack (Docker)

cd packages/observability
docker-compose up -d

# Open dashboards
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:16686 # Jaeger
open http://localhost:9090  # Prometheus

Metrics Reference

Kernel Metrics

Metric Type Description
agent_os_violations_total Counter Policy violations by agent, action, policy, severity
agent_os_violations_blocked_total Counter Violations blocked (SIGKILL issued)
agent_os_violation_rate Gauge Violations per 1000 requests
agent_os_policy_check_duration_seconds Histogram Policy check latency
agent_os_signals_total Counter Signals sent by type and reason
agent_os_sigkill_total Counter SIGKILL signals by agent and reason
agent_os_mttr_seconds Histogram Mean Time To Recovery
agent_os_kernel_uptime_seconds Gauge Kernel uptime

CMVK Metrics

Metric Type Description
agent_os_cmvk_verifications_total Counter Verifications by result (verified/flagged/rejected)
agent_os_cmvk_consensus_ratio Gauge Current model agreement (0.0-1.0)
agent_os_cmvk_model_disagreements_total Counter Disagreements by model pair
agent_os_cmvk_drift_score Histogram Drift score distribution
agent_os_cmvk_verification_duration_seconds Histogram Verification latency
agent_os_cmvk_model_latency_seconds Histogram Per-model response latency

Agent Metrics

Metric Type Description
agent_os_agent_llm_calls_total Counter LLM API calls by agent and model
agent_os_agent_errors_total Counter Errors by agent and type
agent_os_agent_execution_duration_seconds Histogram Task execution time

Dashboards

agent-os-overview (10 panels)

Main dashboard for SOC teams: violation rate, SIGKILL count, latency, throughput.

agent-os-cmvk (12 panels)

ML Ops dashboard: consensus rate, drift scores, model latency, verification results.

agent-os-amb (13 panels)

AMB (Agent Message Bus): throughput, queue depth, backpressure, delivery latency.

agent-os-safety (1 panel)

CISO dashboard: 30-day violation count.

Export Dashboards

python scripts/export_dashboards.py

This creates JSON files in grafana/dashboards/ for Grafana provisioning.

Alerts

Alert rules are defined in alerts/agent-os-alerts.yaml:

Critical Alerts (Page Immediately)

  • AgentOSHighViolationRate: Violation rate >1%
  • AgentOSSIGKILLSpike: >5 SIGKILL in 5 minutes
  • AgentOSKernelCrash: Kernel panic

Warning Alerts

  • AgentOSHighPolicyLatency: p99 latency >10ms
  • CMVKLowConsensus: Consensus <80%
  • CMVKHighDrift: p95 drift >0.25

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│  ┌──────────────────┐  ┌──────────────────┐                 │
│  │   Agent OS       │  │   KernelMetrics  │                 │
│  │   Kernel         │──│   .export()      │───► /metrics    │
│  └──────────────────┘  └──────────────────┘                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 Docker Compose Stack                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐            │
│  │ Prometheus │─►│  Grafana   │  │   Jaeger   │            │
│  │   :9090    │  │   :3000    │  │  :16686    │            │
│  └────────────┘  └────────────┘  └────────────┘            │
│         │               ▲               ▲                   │
│         ▼               │               │                   │
│  ┌────────────┐        │        ┌────────────┐             │
│  │AlertManager│        │        │   OTEL     │             │
│  │   :9093    │        │        │ Collector  │             │
│  └────────────┘        │        └────────────┘             │
│         │              │               ▲                    │
│         ▼              │               │                    │
│  [Slack/PagerDuty]     └───────────────┘                   │
└─────────────────────────────────────────────────────────────┘

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Export dashboards
python scripts/export_dashboards.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_os_observability-0.2.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_os_observability-0.2.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file agent_os_observability-0.2.0.tar.gz.

File metadata

  • Download URL: agent_os_observability-0.2.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for agent_os_observability-0.2.0.tar.gz
Algorithm Hash digest
SHA256 36f4e127e20fd9c754018891b0c85eec667f98178952bd36be787c49953dbed5
MD5 1d39e1b2ca15faf32d99e428aa04d477
BLAKE2b-256 01a820b83fbbed247fb56a59a1c1e2a5477f44445d1870892bbcb576b0fc767e

See more details on using hashes here.

File details

Details for the file agent_os_observability-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_os_observability-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1a0f9f91168b5d771653cfe79c5b87f41b1c43f96d82ab59ac4473348490073f
MD5 f5ead2295b5d64b55f17386b84a59361
BLAKE2b-256 868327d6aca3057930642b2d773800902a46d80acf9ea4dba01715d6742ad279

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page