GPU Datacenter Monitoring Suite - Prometheus, Grafana & Exporters
Project description
DC Overview
Complete GPU datacenter monitoring suite. Monitor your GPU servers with Prometheus, Grafana, and optional AI-powered insights.
โจ What's Included
| Component | Description | Port |
|---|---|---|
| Prometheus | Time-series database for metrics | 9090 |
| Grafana | Beautiful dashboards and alerting | 3000 |
| node_exporter | CPU, RAM, disk, network metrics | 9100 |
| dcgm-exporter | NVIDIA GPU metrics (utilization, temp, power) | 9400 |
| dc-exporter | VRAM temperature, hotspot, fan speed | 9500 |
| vastai-exporter | Vast.ai earnings and reliability (optional) | 8622 |
๐ Quick Start
Prerequisites
- Linux (Ubuntu 20.04+, Debian, CentOS)
- Python 3.9+ with pip
- Root/sudo access for installing services
One Command Setup
Ubuntu 24.04+ / Python 3.12+ (uses pipx):
sudo apt install pipx
pipx install dc-overview
sudo ~/.local/bin/dc-overview quickstart
Ubuntu 22.04 / Python 3.10 (direct pip):
pip install dc-overview
sudo dc-overview quickstart
Alternative (if you get "externally-managed-environment" error):
pip install dc-overview --break-system-packages
sudo dc-overview quickstart
For remote worker deployment, set up passwordless sudo on workers:
sudo bash -c 'echo "YOUR_USER ALL=(ALL) NOPASSWD: ALL" > /etc/sudoers.d/nopasswd && chmod 440 /etc/sudoers.d/nopasswd'
The wizard guides you through everything:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ DC Overview - Quick Setup โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Step 1: What is this machine?
โ GPU Worker (has GPUs to monitor)
โ Master Server (monitors other machines)
โ Both (has GPUs + monitors others)
Step 2: Setting up Monitoring Dashboard
Set Grafana admin password: ******
โ Prometheus running on port 9090
โ Grafana running on port 3000
Step 3: Add Machines to Monitor
How do you want to add servers?
โ Import from file/paste (recommended)
โ Enter manually
Paste your server list:
global:root,mypassword
192.168.1.101
192.168.1.102
192.168.1.103
[Enter]
Installing on 192.168.1.101... โ
Installing on 192.168.1.102... โ
Installing on 192.168.1.103... โ
โ Added 3 workers to Prometheus
Step 4: Vast.ai Integration (Optional)
Are you a Vast.ai provider? [y/N]: y
Vast.ai API Key: ******
โ vastai-exporter running (port 8622)
โ Setup Complete!
Grafana: http://192.168.1.100:3000
๐ Import File Format
Create a simple text file to add many servers at once:
Option 1: Global credentials (same for all)
global:root,mypassword
192.168.1.101
192.168.1.102
192.168.1.103
192.168.1.104
Option 2: Per-server credentials
192.168.1.101,root,password1
192.168.1.102,ubuntu,password2
192.168.1.103,admin,password3
Option 3: Mixed (global default + overrides)
global:root,defaultpass
192.168.1.101
192.168.1.102,ubuntu,custompass
192.168.1.103
๐ง Manual Installation
On Master Server (monitoring hub)
pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server"
On GPU Workers
pip install dc-overview
sudo dc-overview quickstart
# Select "GPU Worker"
Or from the master, provide SSH credentials and the wizard installs remotely.
๐ Available Commands
dc-overview quickstart # โก One-command setup (recommended)
dc-overview status # Check what's running
dc-overview add-machine IP # Add another machine to monitor
dc-overview install-exporters # Install exporters on current machine
dc-overview setup-ssl # Set up reverse proxy with SSL
๐ Reverse Proxy & SSL Setup
Set up a secure HTTPS frontend with a branded landing page:
Self-Signed Certificate (Default)
# Basic setup (IP access only)
sudo dc-overview setup-ssl
# With custom site name
sudo dc-overview setup-ssl --site-name "My GPU Farm"
# Include IPMI Monitor
sudo dc-overview setup-ssl --ipmi --vastai
Let's Encrypt (Free SSL)
For a valid SSL certificate (no browser warnings):
sudo dc-overview setup-ssl \
--domain monitor.example.com \
--letsencrypt \
--email admin@example.com \
--ipmi --vastai
DNS Setup (Required for Domain)
Add these DNS records pointing to your server IP:
| Type | Name | Value | Purpose |
|---|---|---|---|
| A | monitor.example.com |
<server-ip> |
Main dashboard |
| A | grafana.monitor.example.com |
<server-ip> |
Grafana subdomain (optional) |
| A | ipmi.monitor.example.com |
<server-ip> |
IPMI subdomain (optional) |
After Setup
Access your monitoring at:
https://<server-ip>/ # Landing page
https://<server-ip>/grafana/ # Grafana dashboards
https://<server-ip>/prometheus/# Prometheus UI
https://<server-ip>/ipmi/ # IPMI Monitor (if enabled)
Or with domain:
https://monitor.example.com/
https://grafana.monitor.example.com/ (if subdomain configured)
๐ณ Docker Alternative
If you prefer Docker Compose:
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
๐ Related Tools
| Tool | Purpose | Install |
|---|---|---|
| IPMI Monitor | Server health, SEL logs, ECC errors | pip install ipmi-monitor |
| dc-exporter | GPU VRAM temperatures | Included in quickstart |
๐ Full Suite Setup (Master + Workers)
For a complete datacenter setup with IPMI monitoring:
1. On Master Server
# Install dc-overview (Grafana + Prometheus)
pip install dc-overview
sudo dc-overview quickstart
# Select "Master Server", add your workers
# Install ipmi-monitor (optional - for BMC/IPMI)
pip install ipmi-monitor
sudo ipmi-monitor quickstart
2. Workers are configured automatically
The quickstart installs exporters on workers via SSH.
3. Import your servers
Create servers.txt:
global:root,sshpassword
192.168.1.101
192.168.1.102
192.168.1.103
Then paste when prompted, or run:
dc-overview add-machine 192.168.1.101 --ssh-pass mypassword
๐ฌ Support
๐ License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dc_overview-1.0.0.tar.gz.
File metadata
- Download URL: dc_overview-1.0.0.tar.gz
- Upload date:
- Size: 41.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e3fa41a6eedc236a9964181cb0c05a6b8ad44a3e0acc5919264af405dfd7124
|
|
| MD5 |
b9366c6331e9d589dbecebac7e2952cb
|
|
| BLAKE2b-256 |
9ea1b02b93f45fefdbf3a98decc14db995e2dd517e858ccf04577103fc44768d
|
File details
Details for the file dc_overview-1.0.0-py3-none-any.whl.
File metadata
- Download URL: dc_overview-1.0.0-py3-none-any.whl
- Upload date:
- Size: 44.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e19281057f09f1dc51a5f25022456c8d8a94019ff1b53a1fb2b3d2ddc2e63507
|
|
| MD5 |
3575763e1adec0e6a0edf877d88c7290
|
|
| BLAKE2b-256 |
9d57385a2ec969a9c02a32a6656855a3ab51d0563a7351afea582c57d20af797
|