Skip to main content

GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters

Project description

DC Overview

PyPI License: MIT

Complete GPU datacenter monitoring suite. Monitor your GPU servers with Prometheus, Grafana, and optional AI-powered insights.

Dashboard

โœจ What's Included

Component Description Port
Prometheus Time-series database for metrics 9090
Grafana Beautiful dashboards and alerting 3000
node_exporter CPU, RAM, disk, network metrics 9100
dcgm-exporter NVIDIA GPU metrics (utilization, temp, power) 9400
dc-exporter VRAM temperature, hotspot, fan speed 9500
vastai-exporter Vast.ai earnings and reliability (optional) 8622

๐Ÿš€ Quick Start

Prerequisites

  • Linux (Ubuntu 20.04+, Debian, CentOS)
  • Python 3.9+ with pip
  • Root/sudo access for installing services

One Command Setup

Ubuntu 24.04+ / Python 3.12+ (uses pipx):

sudo apt install pipx -y
pipx install dc-overview
pipx ensurepath && source ~/.bashrc
sudo dc-overview quickstart

Ubuntu 22.04 / Python 3.10 (direct pip):

pip install dc-overview
sudo dc-overview quickstart

Alternative (if you get "externally-managed-environment" error):

pip install dc-overview --break-system-packages
sudo dc-overview quickstart

For remote worker deployment, set up passwordless sudo on workers:

sudo bash -c 'echo "YOUR_USER ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/nopasswd && chmod 440 /etc/sudoers.d/nopasswd'

The wizard guides you through everything:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚           DC Overview - Quick Setup              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Step 1: What is this machine?
  โ—‹ GPU Worker (has GPUs to monitor)
  โ— Master Server (monitors other machines)
  โ—‹ Both (has GPUs + monitors others)

Step 2: Setting up Monitoring Dashboard
  Set Grafana admin password: ******
  โœ“ Prometheus running on port 9090
  โœ“ Grafana running on port 3000

Step 3: Add Machines to Monitor
  How do you want to add servers?
    โ— Import from file/paste (recommended)
    โ—‹ Enter manually

  Paste your server list:
  global:root,mypassword
  192.168.1.101
  192.168.1.102
  192.168.1.103
  [Enter]

  Installing on 192.168.1.101... โœ“
  Installing on 192.168.1.102... โœ“
  Installing on 192.168.1.103... โœ“
  โœ“ Added 3 workers to Prometheus

Step 4: Vast.ai Integration (Optional)
  Are you a Vast.ai provider? [y/N]: y
  Vast.ai API Key: ******
  โœ“ vastai-exporter running (port 8622)

โœ“ Setup Complete!
  Grafana: http://192.168.1.100:3000

๐Ÿ“‹ Import File Format

Create a simple text file to add many servers at once:

Option 1: Global credentials (same for all)

global:root,mypassword
192.168.1.101
192.168.1.102
192.168.1.103
192.168.1.104

Option 2: Per-server credentials

192.168.1.101,root,password1
192.168.1.102,ubuntu,password2
192.168.1.103,admin,password3

Option 3: Mixed (global default + overrides)

global:root,defaultpass
192.168.1.101
192.168.1.102,ubuntu,custompass
192.168.1.103

๐Ÿ”ง Manual Installation

On Master Server (monitoring hub)

pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server"

On GPU Workers

pip install dc-overview
sudo dc-overview quickstart
# Select "GPU Worker"

Or from the master, provide SSH credentials and the wizard installs remotely.


๐Ÿ“Š Available Commands

dc-overview quickstart          # โšก One-command setup (recommended)
dc-overview status              # Check what's running
dc-overview add-machine IP      # Add another machine to monitor
dc-overview install-exporters   # Install exporters on current machine
dc-overview setup-ssl           # Set up reverse proxy with SSL

๐Ÿ”’ Reverse Proxy & SSL Setup

Set up a secure HTTPS frontend with a branded landing page:

Self-Signed Certificate (Default)

# Basic setup (IP access only)
sudo dc-overview setup-ssl

# With custom site name
sudo dc-overview setup-ssl --site-name "My GPU Farm"

# Include IPMI Monitor
sudo dc-overview setup-ssl --ipmi --vastai

Let's Encrypt (Free SSL)

For a valid SSL certificate (no browser warnings):

sudo dc-overview setup-ssl \
  --domain monitor.example.com \
  --letsencrypt \
  --email admin@example.com \
  --ipmi --vastai

DNS Setup (Required for Domain)

Add these DNS records pointing to your server IP:

Type Name Value Purpose
A monitor.example.com <server-ip> Main dashboard
A grafana.monitor.example.com <server-ip> Grafana subdomain (optional)
A ipmi.monitor.example.com <server-ip> IPMI subdomain (optional)

After Setup

Access your monitoring at:

https://<server-ip>/           # Landing page
https://<server-ip>/grafana/   # Grafana dashboards
https://<server-ip>/prometheus/# Prometheus UI
https://<server-ip>/ipmi/      # IPMI Monitor (if enabled)

Or with domain:

https://monitor.example.com/
https://grafana.monitor.example.com/  (if subdomain configured)

๐Ÿณ Docker Alternative

If you prefer Docker Compose:

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

๐Ÿ”— Related Tools

Tool Purpose Install
IPMI Monitor Server health, SEL logs, ECC errors pip install ipmi-monitor
dc-exporter GPU VRAM temperatures Included in quickstart

๐Ÿ“– Full Suite Setup (Master + Workers)

For a complete datacenter setup with IPMI monitoring:

1. On Master Server

# Install dc-overview (Grafana + Prometheus)
pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server", add your workers

# Install ipmi-monitor (optional - for BMC/IPMI)
pip install ipmi-monitor
sudo ipmi-monitor quickstart

2. Workers are configured automatically

The quickstart installs exporters on workers via SSH.

3. Import your servers

Create servers.txt:

global:root,sshpassword
192.168.1.101
192.168.1.102
192.168.1.103

Then paste when prompted, or run:

dc-overview add-machine 192.168.1.101 --ssh-pass mypassword

๐Ÿ’ฌ Support


๐Ÿ“„ License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dc_overview-1.0.5.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dc_overview-1.0.5-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file dc_overview-1.0.5.tar.gz.

File metadata

  • Download URL: dc_overview-1.0.5.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dc_overview-1.0.5.tar.gz
Algorithm Hash digest
SHA256 d5febcfd46d8bee0c75a2470e5be9f5c2a790fd08a433588ca5799f821bef0a7
MD5 aa7ca29966dbce53302058fb259f8465
BLAKE2b-256 b2c965903e0743230a5624db584731021982d6aee4f767eb1f3695322b0e003a

See more details on using hashes here.

File details

Details for the file dc_overview-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: dc_overview-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 49.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dc_overview-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 65a4f039192832379a3225c0cdf8f46e17828473ce651aae913c9d1c5d035f5a
MD5 3cbdbe6034c33c3870e7711452cb774b
BLAKE2b-256 c45e3d7d010823cfb4e2d854902877c1b642ffdb23b15b1444a4d105a1974ff6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page