Skip to main content

Production observability for Agent OS - OpenTelemetry traces, Prometheus metrics, Grafana dashboards

Project description

Agent OS Observability

Part of Agent OS - Kernel-level governance for AI agents

Production-ready observability stack for Agent OS kernel.

Status: Alpha

This package provides metrics, tracing, and dashboards for monitoring Agent OS deployments.

Features

  • Prometheus Metrics: Kernel, agent, and CMVK metrics
  • OpenTelemetry Tracing: Distributed tracing for agent operations
  • Grafana Dashboards: Pre-built dashboards for SOC, ML Ops, and SRE teams
  • Prometheus Alerts: Safety, performance, and availability alerts

Quick Start

Install Package

pip install agent-os-kernel[observability]

Basic Usage

from agent_os_observability import KernelMetrics, KernelTracer

# Initialize metrics
metrics = KernelMetrics()

# Record policy check
with metrics.policy_check_latency():
    result = policy_engine.check(action)

# Record violation
if not result.allowed:
    metrics.record_violation(agent_id, action, policy="data-access", severity="high")
    metrics.record_blocked(agent_id, action)

# CMVK metrics
metrics.record_cmvk_verification(
    result="verified",
    confidence=0.95,
    drift_score=0.08,
    duration_seconds=2.3,
    model_count=3
)

# Expose /metrics endpoint (FastAPI example)
from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/metrics")
def get_metrics():
    return Response(
        content=metrics.export(),
        media_type=metrics.content_type()
    )

Full Observability Stack (Docker)

cd agent-governance-python/agent-os/modules/observability
docker-compose up -d

# Open dashboards
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:16686 # Jaeger
open http://localhost:9090  # Prometheus

Metrics Reference

Kernel Metrics

Metric Type Description
agent_os_violations_total Counter Policy violations by agent, action, policy, severity
agent_os_violations_blocked_total Counter Violations blocked (SIGKILL issued)
agent_os_violation_rate Gauge Violations per 1000 requests
agent_os_policy_check_duration_seconds Histogram Policy check latency
agent_os_signals_total Counter Signals sent by type and reason
agent_os_sigkill_total Counter SIGKILL signals by agent and reason
agent_os_mttr_seconds Histogram Mean Time To Recovery
agent_os_kernel_uptime_seconds Gauge Kernel uptime

CMVK Metrics

Metric Type Description
agent_os_cmvk_verifications_total Counter Verifications by result (verified/flagged/rejected)
agent_os_cmvk_consensus_ratio Gauge Current model agreement (0.0-1.0)
agent_os_cmvk_model_disagreements_total Counter Disagreements by model pair
agent_os_cmvk_drift_score Histogram Drift score distribution
agent_os_cmvk_verification_duration_seconds Histogram Verification latency
agent_os_cmvk_model_latency_seconds Histogram Per-model response latency

Agent Metrics

Metric Type Description
agent_os_agent_llm_calls_total Counter LLM API calls by agent and model
agent_os_agent_errors_total Counter Errors by agent and type
agent_os_agent_execution_duration_seconds Histogram Task execution time

Dashboards

agent-os-overview (10 panels)

Main dashboard for SOC teams: violation rate, SIGKILL count, latency, throughput.

agent-os-cmvk (12 panels)

ML Ops dashboard: consensus rate, drift scores, model latency, verification results.

agent-os-amb (13 panels)

AMB (Agent Message Bus): throughput, queue depth, backpressure, delivery latency.

agent-os-safety (1 panel)

CISO dashboard: 30-day violation count.

Export Dashboards

python scripts/export_dashboards.py

This creates JSON files in grafana/dashboards/ for Grafana provisioning.

Alerts

Alert rules are defined in alerts/agent-os-alerts.yaml:

Critical Alerts (Page Immediately)

  • AgentOSHighViolationRate: Violation rate >1%
  • AgentOSSIGKILLSpike: >5 SIGKILL in 5 minutes
  • AgentOSKernelCrash: Kernel panic

Warning Alerts

  • AgentOSHighPolicyLatency: p99 latency >10ms
  • CMVKLowConsensus: Consensus <80%
  • CMVKHighDrift: p95 drift >0.25

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│  ┌──────────────────┐  ┌──────────────────┐                 │
│  │   Agent OS       │  │   KernelMetrics  │                 │
│  │   Kernel         │──│   .export()      │───► /metrics    │
│  └──────────────────┘  └──────────────────┘                 │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                 Docker Compose Stack                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐            │
│  │ Prometheus │─►│  Grafana   │  │   Jaeger   │            │
│  │   :9090    │  │   :3000    │  │  :16686    │            │
│  └────────────┘  └────────────┘  └────────────┘            │
│         │               ▲               ▲                   │
│         ▼               │               │                   │
│  ┌────────────┐        │        ┌────────────┐             │
│  │AlertManager│        │        │   OTEL     │             │
│  │   :9093    │        │        │ Collector  │             │
│  └────────────┘        │        └────────────┘             │
│         │              │               ▲                    │
│         ▼              │               │                    │
│  [Slack/PagerDuty]     └───────────────┘                   │
└─────────────────────────────────────────────────────────────┘

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Export dashboards
python scripts/export_dashboards.py

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentmesh_observability-3.4.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentmesh_observability-3.4.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file agentmesh_observability-3.4.0.tar.gz.

File metadata

File hashes

Hashes for agentmesh_observability-3.4.0.tar.gz
Algorithm Hash digest
SHA256 7ff467bf60a476383159f8635f4086890242e936d44190f4ae8e43d49ce5a032
MD5 bc5791a4f6bdfd141f1d23c7656a5311
BLAKE2b-256 54e24266e5bfa2bb4d742452f0ecfb8e475a01bc4355cdf77f5101f0ca807b39

See more details on using hashes here.

File details

Details for the file agentmesh_observability-3.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agentmesh_observability-3.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1bfe582c78c143736e561b62effc023d561d926b8b568a4ef094e0395e222f8f
MD5 209d1c69268aa4d8514b4804256a1807
BLAKE2b-256 19cca008caf18dfccb18950982a33b6908cc8627365d76f4143e2c18aaa9fe26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page