GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters
Project description
DC Overview
Complete GPU datacenter monitoring suite. Deploy Prometheus, Grafana, and GPU monitoring with a single command. Features unified authentication through Fleet Management and seamless integration with IPMI Monitor.
What's Included
| Component | Description | Port |
|---|---|---|
| cryptolabs-proxy | Unified HTTPS reverse proxy with Fleet authentication | 443 |
| DC Overview | Server Manager web UI for managing workers | 5001 |
| Prometheus | Time-series metrics database | 9090 |
| Grafana | Dashboards and alerting | 3000 |
| IPMI Monitor | BMC/IPMI server health monitoring (optional) | 5000 |
| node_exporter | CPU, RAM, disk metrics (on workers) | 9100 |
| dc-exporter | GPU VRAM temps, hotspot, power (on workers) | 9835 |
| vastai-exporter | Vast.ai earnings & rentals (optional) | 8622 |
| runpod-exporter | RunPod earnings & GPU utilization (optional) | 8623 |
What's New in v1.1.1
| Feature | Description |
|---|---|
| CryptoLabs Vast.ai Exporter | Native Vast.ai exporter built by CryptoLabs with multi-account support |
| Multi-Account Support | Both Vast.ai and RunPod exporters support multiple API keys with account labels |
| Fleet Management Updates | Health API shows update status for all services including new exporters |
| Favicon Support | Service icons use actual favicons for Grafana, Prometheus, RunPod, Vast.ai |
| Improved Update Logic | Container restart after updates now preserves original configuration |
v1.1.1
| Feature | Description |
|---|---|
| RunPod Integration | Track RunPod earnings, GPU utilization, and reliability with multi-account support |
| Site Name Branding | Customize your landing page and IPMI Monitor with your datacenter name |
| Auto-Scaling Dashboards | Grafana table panels automatically resize based on your server count |
| Vast.ai/RunPod Logs | IPMI Monitor auto-collects daemon logs when exporters are enabled |
| Improved Dashboard Layout | Machine column displays on the left for better readability |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ cryptolabs-proxy (Port 443) │
│ Unified Authentication & Reverse Proxy │
│ Roles: admin | readwrite | readonly │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┬───────────────┐
▼ ▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ ┌─────────────┐
│ dc-overview │ │ Grafana │ │ Prometheus │ │ipmi-monitor │
│ (Server Mgr) │ │ Dashboards │ │ Metrics │ │ BMC Health │
│ /dc/ │ │ /grafana/ │ │/prometheus/ │ │ /ipmi/ │
│ :5001 │ │ :3000 │ │ :9090 │ │ :5000 │
└─────────────────┘ └─────────────────┘ └─────────────┘ └─────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ GPU Worker │ │ GPU Worker │ │ GPU Worker │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │node_exp. │ │ │ │node_exp. │ │ │ │node_exp. │ │
│ │ :9100 │ │ │ │ :9100 │ │ │ │ :9100 │ │
│ ├──────────┤ │ │ ├──────────┤ │ │ ├──────────┤ │
│ │dc-export.│ │ │ │dc-export.│ │ │ │dc-export.│ │
│ │ :9835 │ │ │ │ :9835 │ │ │ │ :9835 │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└───────────────┘ └───────────────┘ └───────────────┘
(systemd) (systemd) (systemd)
Master Server: Docker containers on cryptolabs network for easy management
Workers: Native systemd services for GPU compatibility (Vast.ai, RunPod, etc.)
Quick Start
Automated Deployment (Recommended)
Deploy everything with a single command using a config file:
# Install from dev branch (latest features)
pip install git+https://github.com/cryptolabsza/cryptolabs-proxy.git@dev --break-system-packages
pip install git+https://github.com/cryptolabsza/dc-overview.git@dev --break-system-packages
# Deploy with config file (no prompts)
sudo dc-overview setup -c /path/to/config.yaml -y
Or install from PyPI (stable):
apt install pipx -y && pipx ensurepath
source ~/.bashrc
pipx install dc-overview
sudo dc-overview setup -c config.yaml -y
Interactive Setup
For first-time users or when you don't have a config file:
sudo dc-overview setup
The Fleet Wizard guides you through:
- Site Name - Customize your datacenter branding (e.g., "CryptoLabs", "AmericanColo")
- Components - DC Overview, IPMI Monitor, Vast.ai exporter, RunPod exporter
- Credentials - Fleet admin, Grafana, SSH, BMC/IPMI
- Servers - Import from IPMI Monitor or enter manually
- SSL - Let's Encrypt or self-signed
Then deploys everything automatically without further prompts.
Note: This automatically deploys cryptolabs-proxy as the unified entry point and authentication layer for all services.
Configuration File
Create a YAML config for automated deployments. See test-config.yaml for a complete example.
# fleet-config.yaml
site_name: My Datacenter
# Fleet Management Login (unified auth via cryptolabs-proxy)
fleet_admin_user: admin
fleet_admin_pass: YOUR_ADMIN_PASSWORD
# SSH Access (for all servers)
ssh:
username: root
key_path: ~/.ssh/id_rsa
port: 22
# BMC/IPMI Access (default for all servers)
bmc:
username: admin
password: YOUR_BMC_PASSWORD
# SSL Configuration
ssl:
mode: letsencrypt # Options: letsencrypt, selfsigned
domain: dc.example.com
email: admin@example.com
# Components to install
components:
dc_overview: true
ipmi_monitor: true
vast_exporter: false # Set to true if using Vast.ai
runpod_exporter: false # Set to true if using RunPod
# Vast.ai API Keys (only needed if vast_exporter is true)
# Supports multiple accounts with labels
vast:
api_keys:
- name: VastMain
key: YOUR_VAST_API_KEY
- name: VastSecondary
key: YOUR_SECOND_VAST_API_KEY
# Legacy single key format also supported:
# api_key: YOUR_VAST_API_KEY
# RunPod API Keys (only needed if runpod_exporter is true)
# Supports multiple accounts with labels
runpod:
api_keys:
- name: RunpodCCC
key: YOUR_RUNPOD_API_KEY
- name: Brickbox
key: YOUR_SECOND_RUNPOD_API_KEY
# Servers to monitor
servers:
- name: gpu-01
server_ip: 192.168.1.101
bmc_ip: 192.168.1.201
- name: gpu-02
server_ip: 192.168.1.102
bmc_ip: 192.168.1.202
# Grafana settings
grafana:
admin_password: YOUR_GRAFANA_PASSWORD
# Home dashboard: dc-overview-main, vast-dashboard, node-exporter-full, or null
home_dashboard: dc-overview-main
# IPMI Monitor settings (if ipmi_monitor is enabled)
ipmi_monitor:
admin_password: YOUR_IPMI_MONITOR_PASSWORD
Deploy with:
sudo dc-overview setup -c fleet-config.yaml -y
Security Note: Never commit config files with real credentials. Use placeholder values in examples and store actual credentials securely.
User Permissions & Authentication
Fleet Authentication
All services authenticate through cryptolabs-proxy with unified credentials:
| Role | DC Overview | Grafana | IPMI Monitor |
|---|---|---|---|
admin |
Full access | Admin | Full access |
readwrite |
Edit servers | Editor | Edit servers |
readonly |
View only | Viewer | View only |
How It Works
- User logs into Fleet Management landing page (
https://domain/) - Proxy sets authentication headers on all requests:
X-Fleet-Authenticated: trueX-Fleet-Auth-User: <username>X-Fleet-Auth-Role: <admin|readwrite|readonly>
- Sub-services read headers and auto-authenticate users
Grafana Role Sync
Grafana roles are synced via API endpoint:
# Sync current user's Fleet role to Grafana
curl -X POST https://domain/dc/api/grafana/sync-role \
-H "X-Fleet-Auth-User: admin" \
-H "X-Fleet-Auth-Role: admin"
CLI Commands
# Setup & Deployment
dc-overview setup # Interactive setup wizard
dc-overview setup -c FILE -y # Deploy from config file
# Container Management
dc-overview status # Show container status
dc-overview logs [-f] [SERVICE] # View logs
dc-overview stop # Stop all containers
dc-overview start # Start all containers
dc-overview restart # Restart containers
dc-overview upgrade # Pull latest images and restart
# Worker Management
dc-overview install-exporters # Install exporters locally
dc-overview add-machine IP # Add a worker to monitor
# SSL/Proxy
dc-overview setup-ssl # Configure HTTPS
Development Workflow
GitHub Actions
The repository uses GitHub Actions for CI/CD:
| Workflow | Trigger | Output |
|---|---|---|
docker-build.yml |
Push to main, dev, tags |
ghcr.io/cryptolabsza/dc-overview:dev |
publish.yml |
GitHub Release | PyPI package |
Dev Branch Images
Push to dev branch automatically builds Docker images:
ghcr.io/cryptolabsza/dc-overview:devghcr.io/cryptolabsza/dc-overview:developghcr.io/cryptolabsza/dc-overview:sha-<commit>
Testing Dev Builds
# Install from dev branch (cryptolabs-proxy first for SSL management)
pip install git+https://github.com/cryptolabsza/cryptolabs-proxy.git@dev --break-system-packages
pip install git+https://github.com/cryptolabsza/dc-overview.git@dev --break-system-packages
# Run setup
dc-overview setup -c /path/to/test-config.yaml -y
Deployment Flow
The setup command executes these steps:
- Prerequisites - Install Docker, nginx, ipmitool, certbot
- SSH Keys - Generate fleet key and deploy to workers
- Core Services - Start Prometheus & Grafana containers
- Exporters - Install node_exporter and dc-exporter on workers via SSH
- Prometheus Config - Configure scrape targets
- Dashboards - Import Grafana dashboards
- IPMI Monitor - Deploy if enabled (with automatic Vast.ai/RunPod log collection)
- Vast.ai Exporter - Deploy if enabled
- RunPod Exporter - Deploy if enabled (supports multiple API keys)
- Reverse Proxy - Configure cryptolabs-proxy with SSL and site branding
Access URLs
After setup, access your monitoring at:
| Service | URL | Description |
|---|---|---|
| Fleet Landing | https://domain/ |
Unified login & service links |
| DC Overview | https://domain/dc/ |
Server management |
| Grafana | https://domain/grafana/ |
Dashboards |
| Prometheus | https://domain/prometheus/ |
Metrics queries |
| IPMI Monitor | https://domain/ipmi/ |
BMC health (if enabled) |
Port Reference
| Service | Port | Description |
|---|---|---|
| cryptolabs-proxy | 443 | HTTPS reverse proxy |
| dc-overview | 5001 | Server Manager |
| ipmi-monitor | 5000 | BMC/IPMI monitoring |
| grafana | 3000 | Dashboards |
| prometheus | 9090 | Metrics database |
| node_exporter | 9100 | System metrics |
| dc-exporter | 9835 | GPU metrics |
| dcgm-exporter | 9400 | NVIDIA DCGM |
| vastai-exporter | 8622 | Vast.ai earnings |
| runpod-exporter | 8623 | RunPod earnings |
Integration with IPMI Monitor
DC Overview and IPMI Monitor share infrastructure and auto-detect each other's configuration:
# If IPMI Monitor is already installed, setup detects it:
✓ IPMI Monitor Detected!
✓ CryptoLabs Proxy Already Running!
✓ Fleet admin: admin (from existing proxy)
✓ Site name: My GPU Farm (from existing proxy)
✓ AI license key detected from existing deployment
✓ Imported 12 servers from IPMI Monitor
✓ Imported SSH keys from IPMI Monitor
Shared components:
- cryptolabs-proxy - Unified authentication, reverse proxy, and cryptolabs-watchtower for auto-updates
- cryptolabs Docker network - Service communication
- Server list - Imported automatically
- SSH keys - Shared between services
cryptolabs-watchtower is deployed by cryptolabs-proxy when the proxy is configured. It primarily auto-updates cryptolabs-proxy (the main entry point); other labeled containers (dc-overview, prometheus, grafana, ipmi-monitor, etc.) can also be updated. Fleet Manager UI has manual update for services.
Cross-Tool Config Auto-Detection
Both setup commands can be run in either order. The second tool automatically reuses configuration from the first:
# Scenario A: dc-overview first, ipmi-monitor second
sudo dc-overview setup -c dc-config.yaml -y
sudo ipmi-monitor setup -c ipmi-config.yaml -y # Skips credential prompts
# Scenario B: ipmi-monitor first, dc-overview second
sudo ipmi-monitor setup -c ipmi-config.yaml -y
sudo dc-overview setup -c dc-config.yaml -y # Skips credential prompts
Auto-detected from an existing proxy:
| Value | Source |
|---|---|
| Fleet admin credentials | Proxy env vars (FLEET_ADMIN_USER / FLEET_ADMIN_PASS) |
| Site name | Proxy env var (SITE_NAME) |
| AI / Watchdog license key | Proxy env var (WATCHDOG_API_KEY) or /etc/ipmi-monitor/.env |
| Domain and SSL mode | Proxy nginx config |
| SSH keys | /etc/ipmi-monitor/ssh_keys/ or /etc/dc-overview/ssh_keys/ |
| Server list | IPMI Monitor database or servers.yaml |
Priority order (highest to lowest):
| Priority | Source | When used |
|---|---|---|
| 1 | Config file (-c) |
Always takes precedence if value is present |
| 2 | Running proxy env vars | Fills in missing values from config file |
| 3 | Interactive prompt | Only if neither of the above provides a value |
Note: Config file values always win. If your two config files specify different credentials, each deployment uses its own file's values. The auto-detection only fills in values missing from the config file — it never silently overrides what you explicitly set.
Grafana Dashboards
Pre-installed dashboards (auto-scaled to fit your fleet size):
| Dashboard | Description |
|---|---|
| DC Overview | Fleet overview with all GPU metrics |
| DC Exporter Details | Detailed GPU metrics (VRAM temp, hotspot, power, PCIe errors) |
| Node Exporter Full | CPU, RAM, disk, network |
| NVIDIA DCGM Exporter | GPU performance metrics |
| Vast Dashboard | Vast.ai provider earnings & machine status |
| RunPod Dashboard | RunPod earnings, GPU utilization & reliability |
| IPMI Monitor | BMC/IPMI sensor data |
Note: Dashboard table panels automatically scale based on your server count to ensure all machines are visible without scrolling.
Manual Worker Setup
If automatic SSH deployment fails:
# On each GPU worker
pipx install dc-overview
sudo dc-overview install-exporters
Or install exporters individually:
# Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo systemctl enable --now node_exporter
# DC Exporter (GPU metrics)
curl -L https://github.com/cryptolabsza/dc-exporter-releases/releases/latest/download/dc-exporter-rs -o /usr/local/bin/dc-exporter-rs
chmod +x /usr/local/bin/dc-exporter-rs
sudo systemctl enable --now dc-exporter
Troubleshooting
Check Status
dc-overview status
docker ps
docker logs cryptolabs-proxy
View Service Logs
dc-overview logs -f # All services
docker logs -f dc-overview # Server Manager
docker logs -f grafana # Grafana
docker logs -f prometheus # Prometheus
Restart Services
dc-overview restart
# Or individually:
docker restart dc-overview grafana prometheus
Update to Latest
dc-overview upgrade # Pull and restart containers
pipx upgrade dc-overview # Update CLI tool
Test Exporter Connectivity
curl http://worker-ip:9100/metrics # node_exporter
curl http://worker-ip:9835/metrics # dc-exporter
Nginx Config Issues
docker exec cryptolabs-proxy nginx -t # Test config
docker exec cryptolabs-proxy nginx -s reload # Reload
Marketplace Exporters
Vast.ai Exporter
CryptoLabs-built Prometheus exporter for Vast.ai host metrics.
# Single account
docker run -d --name vastai-exporter \
-p 8622:8622 \
ghcr.io/cryptolabsza/vastai-exporter:latest \
-api-key YOUR_API_KEY
# Multiple accounts
docker run -d --name vastai-exporter \
-p 8622:8622 \
ghcr.io/cryptolabsza/vastai-exporter:latest \
-api-key VastMain:KEY1 \
-api-key VastSecondary:KEY2
# Using environment variable
docker run -d --name vastai-exporter \
-p 8622:8622 \
-e VASTAI_API_KEYS="VastMain:KEY1,VastSecondary:KEY2" \
ghcr.io/cryptolabsza/vastai-exporter:latest
Metrics exposed:
vastai_account_balance- Account balance in USDvast_machine_*- Machine status (Listed, Verified, Reliability, timeout)vastai_machine_*- Detailed metrics (disk, inet, rentals, earnings)vastai_machine_gpu_occupancy- Per-GPU occupancy state
RunPod Exporter
CryptoLabs-built Prometheus exporter for RunPod host metrics.
# Single account
docker run -d --name runpod-exporter \
-p 8623:8623 \
ghcr.io/cryptolabsza/runpod-exporter:latest \
-api-key YOUR_API_KEY
# Multiple accounts
docker run -d --name runpod-exporter \
-p 8623:8623 \
ghcr.io/cryptolabsza/runpod-exporter:latest \
-api-key RunpodCCC:KEY1 \
-api-key Brickbox:KEY2
Metrics exposed:
runpod_host_balance- Host balance per accountrunpod_machine_*- GPU counts, earnings, uptime, listingsrunpod_account_*- Account-level totals
Related Projects
| Project | Description |
|---|---|
| ipmi-monitor | BMC/IPMI health monitoring |
| dc-exporter | GPU VRAM temperature exporter |
| cryptolabs-proxy | Unified reverse proxy with auth |
| vastai-exporter | Vast.ai host metrics exporter |
| runpod-exporter | RunPod host metrics exporter |
Support
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dc_overview-1.1.3.tar.gz.
File metadata
- Download URL: dc_overview-1.1.3.tar.gz
- Upload date:
- Size: 259.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44b8e620355951295aa1254aa4986c65a985ef0957e2d9bc9741d1b157394e6e
|
|
| MD5 |
8bb66c59a487893eeb10c66e6c3fd589
|
|
| BLAKE2b-256 |
f5ba4d5b6bb70430cfd66b63d70fde55592b9458847aa89af2e66e3d731312ed
|
File details
Details for the file dc_overview-1.1.3-py3-none-any.whl.
File metadata
- Download URL: dc_overview-1.1.3-py3-none-any.whl
- Upload date:
- Size: 276.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70edab383c13a9389359681d644fd5bb5e8710b268fe728c6500e307cf8cb3ac
|
|
| MD5 |
803646ce94e0037b6e9fde8bad2bdef9
|
|
| BLAKE2b-256 |
122aef231baeaa229956b6f986a7e09d2c05aafeeb07d546d2e260e36c435cd2
|