Skip to main content

GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters

Project description

DC Overview

PyPI License: MIT Docker

Complete GPU datacenter monitoring suite. Deploy Prometheus, Grafana, and GPU monitoring with a single command. Features unified authentication through Fleet Management and seamless integration with IPMI Monitor.

Dashboard

What's Included

Component Description Port
cryptolabs-proxy Unified HTTPS reverse proxy with Fleet authentication 443
DC Overview Server Manager web UI for managing workers 5001
Prometheus Time-series metrics database 9090
Grafana Dashboards and alerting 3000
IPMI Monitor BMC/IPMI server health monitoring (optional) 5000
node_exporter CPU, RAM, disk metrics (on workers) 9100
dc-exporter GPU VRAM temps, hotspot, power (on workers) 9835
vastai-exporter Vast.ai earnings & rentals (optional) 8622
runpod-exporter RunPod earnings & GPU utilization (optional) 8623

What's New in v1.2.0

Feature Description
CryptoLabs Vast.ai Exporter Native Vast.ai exporter built by CryptoLabs with multi-account support
Multi-Account Support Both Vast.ai and RunPod exporters support multiple API keys with account labels
Fleet Management Updates Health API shows update status for all services including new exporters
Favicon Support Service icons use actual favicons for Grafana, Prometheus, RunPod, Vast.ai
Improved Update Logic Container restart after updates now preserves original configuration

v1.1.0

Feature Description
RunPod Integration Track RunPod earnings, GPU utilization, and reliability with multi-account support
Site Name Branding Customize your landing page and IPMI Monitor with your datacenter name
Auto-Scaling Dashboards Grafana table panels automatically resize based on your server count
Vast.ai/RunPod Logs IPMI Monitor auto-collects daemon logs when exporters are enabled
Improved Dashboard Layout Machine column displays on the left for better readability

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    cryptolabs-proxy (Port 443)                   │
│              Unified Authentication & Reverse Proxy              │
│         Roles: admin | readwrite | readonly                      │
└─────────────────────────────────────────────────────────────────┘
                              │
          ┌───────────────────┼───────────────────┬───────────────┐
          ▼                   ▼                   ▼               ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ ┌─────────────┐
│  dc-overview    │ │    Grafana      │ │ Prometheus  │ │ipmi-monitor │
│  (Server Mgr)   │ │   Dashboards    │ │   Metrics   │ │ BMC Health  │
│  /dc/           │ │  /grafana/      │ │/prometheus/ │ │   /ipmi/    │
│  :5001          │ │    :3000        │ │   :9090     │ │   :5000     │
└─────────────────┘ └─────────────────┘ └─────────────┘ └─────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│   GPU Worker  │     │   GPU Worker  │     │   GPU Worker  │
│  ┌──────────┐ │     │  ┌──────────┐ │     │  ┌──────────┐ │
│  │node_exp. │ │     │  │node_exp. │ │     │  │node_exp. │ │
│  │  :9100   │ │     │  │  :9100   │ │     │  │  :9100   │ │
│  ├──────────┤ │     │  ├──────────┤ │     │  ├──────────┤ │
│  │dc-export.│ │     │  │dc-export.│ │     │  │dc-export.│ │
│  │  :9835   │ │     │  │  :9835   │ │     │  │  :9835   │ │
│  └──────────┘ │     │  └──────────┘ │     │  └──────────┘ │
└───────────────┘     └───────────────┘     └───────────────┘
   (systemd)             (systemd)             (systemd)

Master Server: Docker containers on cryptolabs network for easy management
Workers: Native systemd services for GPU compatibility (Vast.ai, RunPod, etc.)


Quick Start

Automated Deployment (Recommended)

Deploy everything with a single command using a config file:

# Install from dev branch (latest features)
pip install git+https://github.com/cryptolabsza/cryptolabs-proxy.git@dev --break-system-packages
pip install git+https://github.com/cryptolabsza/dc-overview.git@dev --break-system-packages

# Deploy with config file (no prompts)
sudo dc-overview quickstart -c /path/to/config.yaml -y

Or install from PyPI (stable):

pipx install dc-overview
sudo dc-overview quickstart -c config.yaml -y

Interactive Setup

For first-time users or when you don't have a config file:

sudo dc-overview quickstart

The Fleet Wizard guides you through:

  1. Site Name - Customize your datacenter branding (e.g., "CryptoLabs", "AmericanColo")
  2. Components - DC Overview, IPMI Monitor, Vast.ai exporter, RunPod exporter
  3. Credentials - Fleet admin, Grafana, SSH, BMC/IPMI
  4. Servers - Import from IPMI Monitor or enter manually
  5. SSL - Let's Encrypt or self-signed

Then deploys everything automatically without further prompts.

Note: This automatically deploys cryptolabs-proxy as the unified entry point and authentication layer for all services.


Configuration File

Create a YAML config for automated deployments. See test-config.yaml for a complete example.

# fleet-config.yaml
site_name: My Datacenter

# Fleet Management Login (unified auth via cryptolabs-proxy)
fleet_admin_user: admin
fleet_admin_pass: YOUR_ADMIN_PASSWORD

# SSH Access (for all servers)
ssh:
  username: root
  key_path: ~/.ssh/id_rsa
  port: 22

# BMC/IPMI Access (default for all servers)
bmc:
  username: admin
  password: YOUR_BMC_PASSWORD

# SSL Configuration
ssl:
  mode: letsencrypt  # Options: letsencrypt, selfsigned
  domain: dc.example.com
  email: admin@example.com

# Components to install
components:
  dc_overview: true
  ipmi_monitor: true
  vast_exporter: false    # Set to true if using Vast.ai
  runpod_exporter: false  # Set to true if using RunPod

# Vast.ai API Keys (only needed if vast_exporter is true)
# Supports multiple accounts with labels
vast:
  api_keys:
    - name: VastMain
      key: YOUR_VAST_API_KEY
    - name: VastSecondary
      key: YOUR_SECOND_VAST_API_KEY
  # Legacy single key format also supported:
  # api_key: YOUR_VAST_API_KEY

# RunPod API Keys (only needed if runpod_exporter is true)
# Supports multiple accounts with labels
runpod:
  api_keys:
    - name: RunpodCCC
      key: YOUR_RUNPOD_API_KEY
    - name: Brickbox
      key: YOUR_SECOND_RUNPOD_API_KEY

# Servers to monitor
servers:
  - name: gpu-01
    server_ip: 192.168.1.101
    bmc_ip: 192.168.1.201

  - name: gpu-02
    server_ip: 192.168.1.102
    bmc_ip: 192.168.1.202

# Grafana settings
grafana:
  admin_password: YOUR_GRAFANA_PASSWORD
  # Home dashboard: dc-overview-main, vast-dashboard, node-exporter-full, or null
  home_dashboard: dc-overview-main

# IPMI Monitor settings (if ipmi_monitor is enabled)
ipmi_monitor:
  admin_password: YOUR_IPMI_MONITOR_PASSWORD

Deploy with:

sudo dc-overview quickstart -c fleet-config.yaml -y

Security Note: Never commit config files with real credentials. Use placeholder values in examples and store actual credentials securely.


User Permissions & Authentication

Fleet Authentication

All services authenticate through cryptolabs-proxy with unified credentials:

Role DC Overview Grafana IPMI Monitor
admin Full access Admin Full access
readwrite Edit servers Editor Edit servers
readonly View only Viewer View only

How It Works

  1. User logs into Fleet Management landing page (https://domain/)
  2. Proxy sets authentication headers on all requests:
    • X-Fleet-Authenticated: true
    • X-Fleet-Auth-User: <username>
    • X-Fleet-Auth-Role: <admin|readwrite|readonly>
  3. Sub-services read headers and auto-authenticate users

Grafana Role Sync

Grafana roles are synced via API endpoint:

# Sync current user's Fleet role to Grafana
curl -X POST https://domain/dc/api/grafana/sync-role \
  -H "X-Fleet-Auth-User: admin" \
  -H "X-Fleet-Auth-Role: admin"

CLI Commands

# Setup & Deployment
dc-overview quickstart              # Interactive setup wizard
dc-overview quickstart -c FILE -y   # Deploy from config file

# Container Management
dc-overview status                  # Show container status
dc-overview logs [-f] [SERVICE]     # View logs
dc-overview stop                    # Stop all containers
dc-overview start                   # Start all containers
dc-overview restart                 # Restart containers
dc-overview upgrade                 # Pull latest images and restart

# Worker Management
dc-overview install-exporters       # Install exporters locally
dc-overview add-machine IP          # Add a worker to monitor

# SSL/Proxy
dc-overview setup-ssl               # Configure HTTPS

Development Workflow

GitHub Actions

The repository uses GitHub Actions for CI/CD:

Workflow Trigger Output
docker-build.yml Push to main, dev, tags ghcr.io/cryptolabsza/dc-overview:dev
publish.yml GitHub Release PyPI package

Dev Branch Images

Push to dev branch automatically builds Docker images:

  • ghcr.io/cryptolabsza/dc-overview:dev
  • ghcr.io/cryptolabsza/dc-overview:develop
  • ghcr.io/cryptolabsza/dc-overview:sha-<commit>

Testing Dev Builds

# Install from dev branch (cryptolabs-proxy first for SSL management)
pip install git+https://github.com/cryptolabsza/cryptolabs-proxy.git@dev --break-system-packages
pip install git+https://github.com/cryptolabsza/dc-overview.git@dev --break-system-packages

# Run quickstart
dc-overview quickstart -c /path/to/test-config.yaml -y

Deployment Flow

The quickstart command executes these steps:

  1. Prerequisites - Install Docker, nginx, ipmitool, certbot
  2. SSH Keys - Generate fleet key and deploy to workers
  3. Core Services - Start Prometheus & Grafana containers
  4. Exporters - Install node_exporter and dc-exporter on workers via SSH
  5. Prometheus Config - Configure scrape targets
  6. Dashboards - Import Grafana dashboards
  7. IPMI Monitor - Deploy if enabled (with automatic Vast.ai/RunPod log collection)
  8. Vast.ai Exporter - Deploy if enabled
  9. RunPod Exporter - Deploy if enabled (supports multiple API keys)
  10. Reverse Proxy - Configure cryptolabs-proxy with SSL and site branding

Access URLs

After setup, access your monitoring at:

Service URL Description
Fleet Landing https://domain/ Unified login & service links
DC Overview https://domain/dc/ Server management
Grafana https://domain/grafana/ Dashboards
Prometheus https://domain/prometheus/ Metrics queries
IPMI Monitor https://domain/ipmi/ BMC health (if enabled)

Port Reference

Service Port Description
cryptolabs-proxy 443 HTTPS reverse proxy
dc-overview 5001 Server Manager
ipmi-monitor 5000 BMC/IPMI monitoring
grafana 3000 Dashboards
prometheus 9090 Metrics database
node_exporter 9100 System metrics
dc-exporter 9835 GPU metrics
dcgm-exporter 9400 NVIDIA DCGM
vastai-exporter 8622 Vast.ai earnings
runpod-exporter 8623 RunPod earnings

Integration with IPMI Monitor

DC Overview and IPMI Monitor share infrastructure:

# If IPMI Monitor is already installed, quickstart detects it: IPMI Monitor Detected!
✓ CryptoLabs Proxy Already Running!
✓ Imported 12 servers from IPMI Monitor
✓ Imported SSH keys from IPMI Monitor

Shared components:

  • cryptolabs-proxy - Unified authentication
  • cryptolabs Docker network - Service communication
  • Server list - Imported automatically
  • SSH keys - Shared between services
  • Watchtower - Auto-updates for all containers

Grafana Dashboards

Pre-installed dashboards (auto-scaled to fit your fleet size):

Dashboard Description
DC Overview Fleet overview with all GPU metrics
DC Exporter Details Detailed GPU metrics (VRAM temp, hotspot, power, PCIe errors)
Node Exporter Full CPU, RAM, disk, network
NVIDIA DCGM Exporter GPU performance metrics
Vast Dashboard Vast.ai provider earnings & machine status
RunPod Dashboard RunPod earnings, GPU utilization & reliability
IPMI Monitor BMC/IPMI sensor data

Note: Dashboard table panels automatically scale based on your server count to ensure all machines are visible without scrolling.


Manual Worker Setup

If automatic SSH deployment fails:

# On each GPU worker
pip install dc-overview
sudo dc-overview install-exporters

Or install exporters individually:

# Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo systemctl enable --now node_exporter

# DC Exporter (GPU metrics)
curl -L https://github.com/cryptolabsza/dc-exporter-releases/releases/latest/download/dc-exporter-rs -o /usr/local/bin/dc-exporter-rs
chmod +x /usr/local/bin/dc-exporter-rs
sudo systemctl enable --now dc-exporter

Troubleshooting

Check Status

dc-overview status
docker ps
docker logs cryptolabs-proxy

View Service Logs

dc-overview logs -f              # All services
docker logs -f dc-overview       # Server Manager
docker logs -f grafana           # Grafana
docker logs -f prometheus        # Prometheus

Restart Services

dc-overview restart
# Or individually:
docker restart dc-overview grafana prometheus

Update to Latest

dc-overview upgrade              # Pull and restart containers
pipx upgrade dc-overview         # Update CLI tool

Test Exporter Connectivity

curl http://worker-ip:9100/metrics  # node_exporter
curl http://worker-ip:9835/metrics  # dc-exporter

Nginx Config Issues

docker exec cryptolabs-proxy nginx -t  # Test config
docker exec cryptolabs-proxy nginx -s reload  # Reload

Marketplace Exporters

Vast.ai Exporter

CryptoLabs-built Prometheus exporter for Vast.ai host metrics.

# Single account
docker run -d --name vastai-exporter \
  -p 8622:8622 \
  ghcr.io/cryptolabsza/vastai-exporter:latest \
  -api-key YOUR_API_KEY

# Multiple accounts
docker run -d --name vastai-exporter \
  -p 8622:8622 \
  ghcr.io/cryptolabsza/vastai-exporter:latest \
  -api-key VastMain:KEY1 \
  -api-key VastSecondary:KEY2

# Using environment variable
docker run -d --name vastai-exporter \
  -p 8622:8622 \
  -e VASTAI_API_KEYS="VastMain:KEY1,VastSecondary:KEY2" \
  ghcr.io/cryptolabsza/vastai-exporter:latest

Metrics exposed:

  • vastai_account_balance - Account balance in USD
  • vast_machine_* - Machine status (Listed, Verified, Reliability, timeout)
  • vastai_machine_* - Detailed metrics (disk, inet, rentals, earnings)
  • vastai_machine_gpu_occupancy - Per-GPU occupancy state

RunPod Exporter

CryptoLabs-built Prometheus exporter for RunPod host metrics.

# Single account
docker run -d --name runpod-exporter \
  -p 8623:8623 \
  ghcr.io/cryptolabsza/runpod-exporter:latest \
  -api-key YOUR_API_KEY

# Multiple accounts
docker run -d --name runpod-exporter \
  -p 8623:8623 \
  ghcr.io/cryptolabsza/runpod-exporter:latest \
  -api-key RunpodCCC:KEY1 \
  -api-key Brickbox:KEY2

Metrics exposed:

  • runpod_host_balance - Host balance per account
  • runpod_machine_* - GPU counts, earnings, uptime, listings
  • runpod_account_* - Account-level totals

Related Projects

Project Description
ipmi-monitor BMC/IPMI health monitoring
dc-exporter GPU VRAM temperature exporter
cryptolabs-proxy Unified reverse proxy with auth
vastai-exporter Vast.ai host metrics exporter
runpod-exporter RunPod host metrics exporter

Support


License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dc_overview-1.1.1.tar.gz (228.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dc_overview-1.1.1-py3-none-any.whl (240.8 kB view details)

Uploaded Python 3

File details

Details for the file dc_overview-1.1.1.tar.gz.

File metadata

  • Download URL: dc_overview-1.1.1.tar.gz
  • Upload date:
  • Size: 228.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dc_overview-1.1.1.tar.gz
Algorithm Hash digest
SHA256 225d4f7903005fe5b97948ccfdd602977f39a488809bfd838f018478efdddc3f
MD5 5914451a589057ffe14f814919694d72
BLAKE2b-256 f6d68d78236b11a99daca391e8fa634d9442e43d97ac3d2a7c13f9f92fea8e9c

See more details on using hashes here.

File details

Details for the file dc_overview-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: dc_overview-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 240.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dc_overview-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d79d72b14ab7a2673759b9cc78be450f75dcb4ce9ec8c782f314fbbf6dab79a7
MD5 0c29eff544bdc0bd9d1dfc7f51b9c3b0
BLAKE2b-256 c6f6998e5b4196d6d5f8fc7db486d0c806876d8f8f7ac70ad68e8f62d5da9c1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page