Skip to main content

GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters

Project description

DC Overview

PyPI License: MIT

Complete GPU datacenter monitoring suite. Monitor your GPU servers with Prometheus, Grafana, and optional AI-powered insights.

Dashboard

โœจ What's Included

Component Description Port
Prometheus Time-series database for metrics 9090
Grafana Beautiful dashboards and alerting 3000
node_exporter CPU, RAM, disk, network metrics 9100
dcgm-exporter NVIDIA GPU metrics (utilization, temp, power) 9400
dc-exporter VRAM temperature, hotspot, fan speed 9500
vastai-exporter Vast.ai earnings and reliability (optional) 8622

๐Ÿš€ Quick Start

Prerequisites

  • Linux (Ubuntu 20.04+, Debian, CentOS)
  • Python 3.9+ with pip
  • Root/sudo access for installing services

One Command Setup

Ubuntu 24.04+ / Python 3.12+ (uses pipx):

sudo apt install pipx
pipx install dc-overview
sudo ~/.local/bin/dc-overview quickstart

Ubuntu 22.04 / Python 3.10 (direct pip):

pip install dc-overview
sudo dc-overview quickstart

Alternative (if you get "externally-managed-environment" error):

pip install dc-overview --break-system-packages
sudo dc-overview quickstart

For remote worker deployment, set up passwordless sudo on workers:

sudo bash -c 'echo "YOUR_USER ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/nopasswd && chmod 440 /etc/sudoers.d/nopasswd'

The wizard guides you through everything:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚           DC Overview - Quick Setup              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Step 1: What is this machine?
  โ—‹ GPU Worker (has GPUs to monitor)
  โ— Master Server (monitors other machines)
  โ—‹ Both (has GPUs + monitors others)

Step 2: Setting up Monitoring Dashboard
  Set Grafana admin password: ******
  โœ“ Prometheus running on port 9090
  โœ“ Grafana running on port 3000

Step 3: Add Machines to Monitor
  How do you want to add servers?
    โ— Import from file/paste (recommended)
    โ—‹ Enter manually

  Paste your server list:
  global:root,mypassword
  192.168.1.101
  192.168.1.102
  192.168.1.103
  [Enter]

  Installing on 192.168.1.101... โœ“
  Installing on 192.168.1.102... โœ“
  Installing on 192.168.1.103... โœ“
  โœ“ Added 3 workers to Prometheus

Step 4: Vast.ai Integration (Optional)
  Are you a Vast.ai provider? [y/N]: y
  Vast.ai API Key: ******
  โœ“ vastai-exporter running (port 8622)

โœ“ Setup Complete!
  Grafana: http://192.168.1.100:3000

๐Ÿ“‹ Import File Format

Create a simple text file to add many servers at once:

Option 1: Global credentials (same for all)

global:root,mypassword
192.168.1.101
192.168.1.102
192.168.1.103
192.168.1.104

Option 2: Per-server credentials

192.168.1.101,root,password1
192.168.1.102,ubuntu,password2
192.168.1.103,admin,password3

Option 3: Mixed (global default + overrides)

global:root,defaultpass
192.168.1.101
192.168.1.102,ubuntu,custompass
192.168.1.103

๐Ÿ”ง Manual Installation

On Master Server (monitoring hub)

pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server"

On GPU Workers

pip install dc-overview
sudo dc-overview quickstart
# Select "GPU Worker"

Or from the master, provide SSH credentials and the wizard installs remotely.


๐Ÿ“Š Available Commands

dc-overview quickstart          # โšก One-command setup (recommended)
dc-overview status              # Check what's running
dc-overview add-machine IP      # Add another machine to monitor
dc-overview install-exporters   # Install exporters on current machine
dc-overview setup-ssl           # Set up reverse proxy with SSL

๐Ÿ”’ Reverse Proxy & SSL Setup

Set up a secure HTTPS frontend with a branded landing page:

Self-Signed Certificate (Default)

# Basic setup (IP access only)
sudo dc-overview setup-ssl

# With custom site name
sudo dc-overview setup-ssl --site-name "My GPU Farm"

# Include IPMI Monitor
sudo dc-overview setup-ssl --ipmi --vastai

Let's Encrypt (Free SSL)

For a valid SSL certificate (no browser warnings):

sudo dc-overview setup-ssl \
  --domain monitor.example.com \
  --letsencrypt \
  --email admin@example.com \
  --ipmi --vastai

DNS Setup (Required for Domain)

Add these DNS records pointing to your server IP:

Type Name Value Purpose
A monitor.example.com <server-ip> Main dashboard
A grafana.monitor.example.com <server-ip> Grafana subdomain (optional)
A ipmi.monitor.example.com <server-ip> IPMI subdomain (optional)

After Setup

Access your monitoring at:

https://<server-ip>/           # Landing page
https://<server-ip>/grafana/   # Grafana dashboards
https://<server-ip>/prometheus/# Prometheus UI
https://<server-ip>/ipmi/      # IPMI Monitor (if enabled)

Or with domain:

https://monitor.example.com/
https://grafana.monitor.example.com/  (if subdomain configured)

๐Ÿณ Docker Alternative

If you prefer Docker Compose:

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

๐Ÿ”— Related Tools

Tool Purpose Install
IPMI Monitor Server health, SEL logs, ECC errors pip install ipmi-monitor
dc-exporter GPU VRAM temperatures Included in quickstart

๐Ÿ“– Full Suite Setup (Master + Workers)

For a complete datacenter setup with IPMI monitoring:

1. On Master Server

# Install dc-overview (Grafana + Prometheus)
pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server", add your workers

# Install ipmi-monitor (optional - for BMC/IPMI)
pip install ipmi-monitor
sudo ipmi-monitor quickstart

2. Workers are configured automatically

The quickstart installs exporters on workers via SSH.

3. Import your servers

Create servers.txt:

global:root,sshpassword
192.168.1.101
192.168.1.102
192.168.1.103

Then paste when prompted, or run:

dc-overview add-machine 192.168.1.101 --ssh-pass mypassword

๐Ÿ’ฌ Support


๐Ÿ“„ License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dc_overview-1.0.0.tar.gz (41.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dc_overview-1.0.0-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file dc_overview-1.0.0.tar.gz.

File metadata

  • Download URL: dc_overview-1.0.0.tar.gz
  • Upload date:
  • Size: 41.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dc_overview-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5e3fa41a6eedc236a9964181cb0c05a6b8ad44a3e0acc5919264af405dfd7124
MD5 b9366c6331e9d589dbecebac7e2952cb
BLAKE2b-256 9ea1b02b93f45fefdbf3a98decc14db995e2dd517e858ccf04577103fc44768d

See more details on using hashes here.

File details

Details for the file dc_overview-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dc_overview-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dc_overview-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e19281057f09f1dc51a5f25022456c8d8a94019ff1b53a1fb2b3d2ddc2e63507
MD5 3575763e1adec0e6a0edf877d88c7290
BLAKE2b-256 9d57385a2ec969a9c02a32a6656855a3ab51d0563a7351afea582c57d20af797

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page