Skip to main content

GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters

Project description

DC Overview

PyPI License: MIT

Complete GPU datacenter monitoring suite. Monitor your GPU servers with Prometheus, Grafana, and optional AI-powered insights.

Dashboard

โœจ What's Included

Component Description Port
Prometheus Time-series database for metrics 9090
Grafana Beautiful dashboards and alerting 3000
node_exporter CPU, RAM, disk, network metrics 9100
dcgm-exporter NVIDIA GPU metrics (utilization, temp, power) 9400
dc-exporter VRAM temperature, hotspot, fan speed 9500
vastai-exporter Vast.ai earnings and reliability (optional) 8622

๐Ÿš€ Quick Start

Prerequisites

  • Linux (Ubuntu 20.04+, Debian, CentOS)
  • Python 3.9+ with pip
  • Root/sudo access for installing services

One Command Setup

Ubuntu 24.04+ / Python 3.12+ (uses pipx):

sudo apt install pipx -y
pipx install dc-overview
pipx ensurepath && source ~/.bashrc
sudo dc-overview quickstart

Ubuntu 22.04 / Python 3.10 (direct pip):

pip install dc-overview
sudo dc-overview quickstart

Alternative (if you get "externally-managed-environment" error):

pip install dc-overview --break-system-packages
sudo dc-overview quickstart

For remote worker deployment, set up passwordless sudo on workers:

sudo bash -c 'echo "YOUR_USER ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/nopasswd && chmod 440 /etc/sudoers.d/nopasswd'

The wizard guides you through everything:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚           DC Overview - Quick Setup              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Step 1: What is this machine?
  โ—‹ GPU Worker (has GPUs to monitor)
  โ— Master Server (monitors other machines)
  โ—‹ Both (has GPUs + monitors others)

Step 2: Setting up Monitoring Dashboard
  Set Grafana admin password: ******
  โœ“ Prometheus running on port 9090
  โœ“ Grafana running on port 3000

Step 3: Add Machines to Monitor
  How do you want to add servers?
    โ— Import from file/paste (recommended)
    โ—‹ Enter manually

  Paste your server list:
  global:root,mypassword
  192.168.1.101
  192.168.1.102
  192.168.1.103
  [Enter]

  Installing on 192.168.1.101... โœ“
  Installing on 192.168.1.102... โœ“
  Installing on 192.168.1.103... โœ“
  โœ“ Added 3 workers to Prometheus

Step 4: Vast.ai Integration (Optional)
  Are you a Vast.ai provider? [y/N]: y
  Vast.ai API Key: ******
  โœ“ vastai-exporter running (port 8622)

โœ“ Setup Complete!
  Grafana: http://192.168.1.100:3000

๐Ÿ“‹ Import File Format

Create a simple text file to add many servers at once:

Option 1: Global credentials (same for all)

global:root,mypassword
192.168.1.101
192.168.1.102
192.168.1.103
192.168.1.104

Option 2: Per-server credentials

192.168.1.101,root,password1
192.168.1.102,ubuntu,password2
192.168.1.103,admin,password3

Option 3: Mixed (global default + overrides)

global:root,defaultpass
192.168.1.101
192.168.1.102,ubuntu,custompass
192.168.1.103

๐Ÿ”ง Manual Installation

On Master Server (monitoring hub)

pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server"

On GPU Workers

pip install dc-overview
sudo dc-overview quickstart
# Select "GPU Worker"

Or from the master, provide SSH credentials and the wizard installs remotely.


๐Ÿ“Š Available Commands

dc-overview quickstart          # โšก One-command setup (recommended)
dc-overview status              # Check what's running
dc-overview add-machine IP      # Add another machine to monitor
dc-overview install-exporters   # Install exporters on current machine
dc-overview setup-ssl           # Set up reverse proxy with SSL

๐Ÿ”’ Reverse Proxy & SSL Setup

Set up a secure HTTPS frontend with a branded landing page:

Self-Signed Certificate (Default)

# Basic setup (IP access only)
sudo dc-overview setup-ssl

# With custom site name
sudo dc-overview setup-ssl --site-name "My GPU Farm"

# Include IPMI Monitor
sudo dc-overview setup-ssl --ipmi --vastai

Let's Encrypt (Free SSL)

For a valid SSL certificate (no browser warnings):

sudo dc-overview setup-ssl \
  --domain monitor.example.com \
  --letsencrypt \
  --email admin@example.com \
  --ipmi --vastai

DNS Setup (Required for Domain)

Add these DNS records pointing to your server IP:

Type Name Value Purpose
A monitor.example.com <server-ip> Main dashboard
A grafana.monitor.example.com <server-ip> Grafana subdomain (optional)
A ipmi.monitor.example.com <server-ip> IPMI subdomain (optional)

After Setup

Access your monitoring at:

https://<server-ip>/           # Landing page
https://<server-ip>/grafana/   # Grafana dashboards
https://<server-ip>/prometheus/# Prometheus UI
https://<server-ip>/ipmi/      # IPMI Monitor (if enabled)

Or with domain:

https://monitor.example.com/
https://grafana.monitor.example.com/  (if subdomain configured)

๐Ÿณ Docker Alternative

If you prefer Docker Compose:

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

๐Ÿ”— Related Tools

Tool Purpose Install
IPMI Monitor Server health, SEL logs, ECC errors pip install ipmi-monitor
dc-exporter GPU VRAM temperatures Included in quickstart

๐Ÿ“– Full Suite Setup (Master + Workers)

For a complete datacenter setup with IPMI monitoring:

1. On Master Server

# Install dc-overview (Grafana + Prometheus)
pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server", add your workers

# Install ipmi-monitor (optional - for BMC/IPMI)
pip install ipmi-monitor
sudo ipmi-monitor quickstart

2. Workers are configured automatically

The quickstart installs exporters on workers via SSH.

3. Import your servers

Create servers.txt:

global:root,sshpassword
192.168.1.101
192.168.1.102
192.168.1.103

Then paste when prompted, or run:

dc-overview add-machine 192.168.1.101 --ssh-pass mypassword

๐Ÿ’ฌ Support


๐Ÿ“„ License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dc_overview-1.0.9.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dc_overview-1.0.9-py3-none-any.whl (49.5 kB view details)

Uploaded Python 3

File details

Details for the file dc_overview-1.0.9.tar.gz.

File metadata

  • Download URL: dc_overview-1.0.9.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dc_overview-1.0.9.tar.gz
Algorithm Hash digest
SHA256 f1fc4303d6574e00170cd241dbd836128685b0f9e516a9cb43d44d2c31514da3
MD5 498205059295cb985331065d20ae341f
BLAKE2b-256 93de9ddeb42cccf376ee0a65c8e7d2b4aad384d00a9720cc3318d075a2e8814c

See more details on using hashes here.

File details

Details for the file dc_overview-1.0.9-py3-none-any.whl.

File metadata

  • Download URL: dc_overview-1.0.9-py3-none-any.whl
  • Upload date:
  • Size: 49.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for dc_overview-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 4f9c0245ad441e5ce45b61bf97863cc23489591ad09745b80a3c59c7691a20bc
MD5 2241b411e35e97dc39515ab36383a510
BLAKE2b-256 9f30ce694fce5237faa3e67fb984e39b09853ad501f0b1cb7d026a930fd12529

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page