Skip to main content

Production observability for Agent OS - OpenTelemetry traces, Prometheus metrics, Grafana dashboards

Project description

Agent OS Observability

Part of Agent OS - Kernel-level governance for AI agents

Production-ready observability stack for Agent OS kernel.

Status: Alpha

This package provides metrics, tracing, and dashboards for monitoring Agent OS deployments.

Features

  • Prometheus Metrics: Kernel, agent, and CMVK metrics
  • OpenTelemetry Tracing: Distributed tracing for agent operations
  • Grafana Dashboards: Pre-built dashboards for SOC, ML Ops, and SRE teams
  • Prometheus Alerts: Safety, performance, and availability alerts

Quick Start

Install Package

pip install agent-os-kernel[observability]

Basic Usage

from agent_os_observability import KernelMetrics, KernelTracer

# Initialize metrics
metrics = KernelMetrics()

# Record policy check
with metrics.policy_check_latency():
    result = policy_engine.check(action)

# Record violation
if not result.allowed:
    metrics.record_violation(agent_id, action, policy="data-access", severity="high")
    metrics.record_blocked(agent_id, action)

# CMVK metrics
metrics.record_cmvk_verification(
    result="verified",
    confidence=0.95,
    drift_score=0.08,
    duration_seconds=2.3,
    model_count=3
)

# Expose /metrics endpoint (FastAPI example)
from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/metrics")
def get_metrics():
    return Response(
        content=metrics.export(),
        media_type=metrics.content_type()
    )

Full Observability Stack (Docker)

cd agent-governance-python/agent-os/modules/observability
docker-compose up -d

# Open dashboards
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:16686 # Jaeger
open http://localhost:9090  # Prometheus

Metrics Reference

Kernel Metrics

Metric Type Description
agent_os_violations_total Counter Policy violations by agent, action, policy, severity
agent_os_violations_blocked_total Counter Violations blocked (SIGKILL issued)
agent_os_violation_rate Gauge Violations per 1000 requests
agent_os_policy_check_duration_seconds Histogram Policy check latency
agent_os_signals_total Counter Signals sent by type and reason
agent_os_sigkill_total Counter SIGKILL signals by agent and reason
agent_os_mttr_seconds Histogram Mean Time To Recovery
agent_os_kernel_uptime_seconds Gauge Kernel uptime

CMVK Metrics

Metric Type Description
agent_os_cmvk_verifications_total Counter Verifications by result (verified/flagged/rejected)
agent_os_cmvk_consensus_ratio Gauge Current model agreement (0.0-1.0)
agent_os_cmvk_model_disagreements_total Counter Disagreements by model pair
agent_os_cmvk_drift_score Histogram Drift score distribution
agent_os_cmvk_verification_duration_seconds Histogram Verification latency
agent_os_cmvk_model_latency_seconds Histogram Per-model response latency

Agent Metrics

Metric Type Description
agent_os_agent_llm_calls_total Counter LLM API calls by agent and model
agent_os_agent_errors_total Counter Errors by agent and type
agent_os_agent_execution_duration_seconds Histogram Task execution time

Dashboards

agent-os-overview (10 panels)

Main dashboard for SOC teams: violation rate, SIGKILL count, latency, throughput.

agent-os-cmvk (12 panels)

ML Ops dashboard: consensus rate, drift scores, model latency, verification results.

agent-os-amb (13 panels)

AMB (Agent Message Bus): throughput, queue depth, backpressure, delivery latency.

agent-os-safety (1 panel)

CISO dashboard: 30-day violation count.

Export Dashboards

python scripts/export_dashboards.py

This creates JSON files in grafana/dashboards/ for Grafana provisioning.

Alerts

Alert rules are defined in alerts/agent-os-alerts.yaml:

Critical Alerts (Page Immediately)

  • AgentOSHighViolationRate: Violation rate >1%
  • AgentOSSIGKILLSpike: >5 SIGKILL in 5 minutes
  • AgentOSKernelCrash: Kernel panic

Warning Alerts

  • AgentOSHighPolicyLatency: p99 latency >10ms
  • CMVKLowConsensus: Consensus <80%
  • CMVKHighDrift: p95 drift >0.25

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│  ┌──────────────────┐  ┌──────────────────┐                 │
│  │   Agent OS       │  │   KernelMetrics  │                 │
│  │   Kernel         │──│   .export()      │───► /metrics    │
│  └──────────────────┘  └──────────────────┘                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 Docker Compose Stack                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐            │
│  │ Prometheus │─►│  Grafana   │  │   Jaeger   │            │
│  │   :9090    │  │   :3000    │  │  :16686    │            │
│  └────────────┘  └────────────┘  └────────────┘            │
│         │               ▲               ▲                   │
│         ▼               │               │                   │
│  ┌────────────┐        │        ┌────────────┐             │
│  │AlertManager│        │        │   OTEL     │             │
│  │   :9093    │        │        │ Collector  │             │
│  └────────────┘        │        └────────────┘             │
│         │              │               ▲                    │
│         ▼              │               │                    │
│  [Slack/PagerDuty]     └───────────────┘                   │
└─────────────────────────────────────────────────────────────┘

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Export dashboards
python scripts/export_dashboards.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentmesh_observability-3.6.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentmesh_observability-3.6.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file agentmesh_observability-3.6.0.tar.gz.

File metadata

File hashes

Hashes for agentmesh_observability-3.6.0.tar.gz
Algorithm Hash digest
SHA256 0a7b92d15bb047af3176c7d931bdc199113f258c6ca50ed2e8d5183941ea989d
MD5 d800e3b266a4f5ce5e0d279efd2b6a39
BLAKE2b-256 d2eb50f8dcde1ad732ef49049e6348f576b765ff0fce7a8f89869570e2217d89

See more details on using hashes here.

File details

Details for the file agentmesh_observability-3.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentmesh_observability-3.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bb8dc6c569ef023885c5911a1522ac591606d65dc839935ea159fe847ca1ac6c
MD5 91a254c00ce3fc10c1a19a4c5ea50d44
BLAKE2b-256 402498c79498f518f4123154599052fd1baff7b379278dd324337c6ce84038d7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page