Skip to main content

GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters

Project description

DC Overview

PyPI License: MIT

Complete GPU datacenter monitoring suite. Monitor your GPU servers with Prometheus, Grafana, and optional AI-powered insights.

Dashboard

โœจ What's Included

Component Description Port
Prometheus Time-series database for metrics 9090
Grafana Beautiful dashboards and alerting 3000
node_exporter CPU, RAM, disk, network metrics 9100
dcgm-exporter NVIDIA GPU metrics (utilization, temp, power) 9400
dc-exporter VRAM temperature, hotspot, fan speed 9500
vastai-exporter Vast.ai earnings and reliability (optional) 8622

๐Ÿš€ Quick Start

Prerequisites

  • Linux (Ubuntu 20.04+, Debian, CentOS)
  • Python 3.9+ with pip
  • Root/sudo access for installing services

One Command Setup

Ubuntu 24.04+ / Python 3.12+ (uses pipx):

sudo apt install pipx -y
pipx install dc-overview
pipx ensurepath && source ~/.bashrc
sudo dc-overview quickstart

Ubuntu 22.04 / Python 3.10 (direct pip):

pip install dc-overview
sudo dc-overview quickstart

Alternative (if you get "externally-managed-environment" error):

pip install dc-overview --break-system-packages
sudo dc-overview quickstart

For remote worker deployment, set up passwordless sudo on workers:

sudo bash -c 'echo "YOUR_USER ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/nopasswd && chmod 440 /etc/sudoers.d/nopasswd'

The wizard guides you through everything:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚           DC Overview - Quick Setup              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Step 1: What is this machine?
  โ—‹ GPU Worker (has GPUs to monitor)
  โ— Master Server (monitors other machines)
  โ—‹ Both (has GPUs + monitors others)

Step 2: Setting up Monitoring Dashboard
  Set Grafana admin password: ******
  โœ“ Prometheus running on port 9090
  โœ“ Grafana running on port 3000

Step 3: Add Machines to Monitor
  How do you want to add servers?
    โ— Import from file/paste (recommended)
    โ—‹ Enter manually

  Paste your server list:
  global:root,mypassword
  192.168.1.101
  192.168.1.102
  192.168.1.103
  [Enter]

  Installing on 192.168.1.101... โœ“
  Installing on 192.168.1.102... โœ“
  Installing on 192.168.1.103... โœ“
  โœ“ Added 3 workers to Prometheus

Step 4: Vast.ai Integration (Optional)
  Are you a Vast.ai provider? [y/N]: y
  Vast.ai API Key: ******
  โœ“ vastai-exporter running (port 8622)

โœ“ Setup Complete!
  Grafana: http://192.168.1.100:3000

๐Ÿ“‹ Import File Format

Create a simple text file to add many servers at once:

Option 1: Global credentials (same for all)

global:root,mypassword
192.168.1.101
192.168.1.102
192.168.1.103
192.168.1.104

Option 2: Per-server credentials

192.168.1.101,root,password1
192.168.1.102,ubuntu,password2
192.168.1.103,admin,password3

Option 3: Mixed (global default + overrides)

global:root,defaultpass
192.168.1.101
192.168.1.102,ubuntu,custompass
192.168.1.103

๐Ÿ”ง Manual Installation

On Master Server (monitoring hub)

pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server"

On GPU Workers

pip install dc-overview
sudo dc-overview quickstart
# Select "GPU Worker"

Or from the master, provide SSH credentials and the wizard installs remotely.


๐Ÿ“Š Available Commands

dc-overview quickstart          # โšก One-command setup (recommended)
dc-overview status              # Check what's running
dc-overview add-machine IP      # Add another machine to monitor
dc-overview install-exporters   # Install exporters on current machine
dc-overview setup-ssl           # Set up reverse proxy with SSL

๐Ÿ”’ Reverse Proxy & SSL Setup

Set up a secure HTTPS frontend with a branded landing page:

Self-Signed Certificate (Default)

# Basic setup (IP access only)
sudo dc-overview setup-ssl

# With custom site name
sudo dc-overview setup-ssl --site-name "My GPU Farm"

# Include IPMI Monitor
sudo dc-overview setup-ssl --ipmi --vastai

Let's Encrypt (Free SSL)

For a valid SSL certificate (no browser warnings):

sudo dc-overview setup-ssl \
  --domain monitor.example.com \
  --letsencrypt \
  --email admin@example.com \
  --ipmi --vastai

DNS Setup (Required for Domain)

Add these DNS records pointing to your server IP:

Type Name Value Purpose
A monitor.example.com <server-ip> Main dashboard
A grafana.monitor.example.com <server-ip> Grafana subdomain (optional)
A ipmi.monitor.example.com <server-ip> IPMI subdomain (optional)

After Setup

Access your monitoring at:

https://<server-ip>/           # Landing page
https://<server-ip>/grafana/   # Grafana dashboards
https://<server-ip>/prometheus/# Prometheus UI
https://<server-ip>/ipmi/      # IPMI Monitor (if enabled)

Or with domain:

https://monitor.example.com/
https://grafana.monitor.example.com/  (if subdomain configured)

๐Ÿณ Docker Alternative

If you prefer Docker Compose:

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

๐Ÿ”— Related Tools

Tool Purpose Install
IPMI Monitor Server health, SEL logs, ECC errors pip install ipmi-monitor
dc-exporter GPU VRAM temperatures Included in quickstart

๐Ÿ“– Full Suite Setup (Master + Workers)

For a complete datacenter setup with IPMI monitoring:

1. On Master Server

# Install dc-overview (Grafana + Prometheus)
pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server", add your workers

# Install ipmi-monitor (optional - for BMC/IPMI)
pip install ipmi-monitor
sudo ipmi-monitor quickstart

2. Workers are configured automatically

The quickstart installs exporters on workers via SSH.

3. Import your servers

Create servers.txt:

global:root,sshpassword
192.168.1.101
192.168.1.102
192.168.1.103

Then paste when prompted, or run:

dc-overview add-machine 192.168.1.101 --ssh-pass mypassword

๐Ÿ’ฌ Support


๐Ÿ“„ License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dc_overview-1.0.1.tar.gz (41.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dc_overview-1.0.1-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file dc_overview-1.0.1.tar.gz.

File metadata

  • Download URL: dc_overview-1.0.1.tar.gz
  • Upload date:
  • Size: 41.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dc_overview-1.0.1.tar.gz
Algorithm Hash digest
SHA256 315a5a053e6cac50270b5122f53caac332b7236aee52d3220cf2c69bc1887d11
MD5 5206e21b2fc60c419e94b078b1da4a99
BLAKE2b-256 29e02b1bf5e9f8d77d60f996396050cca5bb00971aec91630d0f2abfba91b8d0

See more details on using hashes here.

File details

Details for the file dc_overview-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dc_overview-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 44.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dc_overview-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af1ada73fe21c6f79fc813bd1a425178dbc1cbca4cc663f01dcb416772fc82a8
MD5 cc4268a2e65d14858b6f689c2812305d
BLAKE2b-256 576d79d9090d05210bcf45c2970d6409032ba0cd9c232f4d12db703a82db95a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page