Skip to main content

IPMI/BMC Server Monitoring with AI-Powered Insights

Project description

IPMI Monitor

PyPI Docker Build License: MIT

Free, self-hosted IPMI/BMC monitoring for your server fleet. Collect System Event Logs (SEL), monitor sensors, track ECC errors, and get alerts - all from a beautiful web dashboard.

Dashboard

๐Ÿ“ธ Screenshots

Events
Event Log - Track SEL events
Sensors
Live Sensors - Temperature, fans, voltage
Inventory
Hardware Inventory - CPU, Memory, Storage
System Logs
System Logs - SSH-based dmesg, syslog, journalctl

โœจ Features

  • ๐Ÿ” Event Collection - Automatically collect IPMI SEL logs (parallel, 32 workers)
  • ๐Ÿ“Š Real-time Dashboard - Auto-refreshing every second with server status cards
  • ๐ŸŒก๏ธ Sensor Monitoring - Temperature, fan, voltage, power readings
  • ๐Ÿ’พ ECC Memory Tracking - Identify which DIMM has errors
  • ๐ŸŽฎ GPU Health Monitoring - Detect NVIDIA GPU errors via SSH (Xid errors)
  • ๐Ÿ“œ SSH System Logs - Collect dmesg, journalctl, syslog, mcelog, Docker daemon logs via SSH
  • ๐Ÿณ Docker Log Collection - Monitor Docker daemon errors (storage-opt, overlay, pquota issues)
  • ๐Ÿ”ง Hardware Error Detection - AER, PCIe, ECC errors parsed automatically
  • ๐Ÿ”„ Uptime & Reboot Detection - Track unexpected server reboots
  • ๐Ÿšจ Alert Rules - Configurable alerts with email, Telegram, webhooks
  • โœ… Alert Resolution - Notifications when issues are resolved
  • โฑ๏ธ Alert Confirmation - Threshold checks to avoid false positives
  • ๐Ÿ“ˆ Prometheus Metrics - Native /metrics endpoint for Grafana
  • ๐Ÿ” User Management - Admin and read-only access levels
  • ๐Ÿ“ฅ Full Backup/Restore - Export everything: servers, credentials, SSH keys, alerts
  • ๐Ÿณ Docker Ready - Multi-arch images (amd64/arm64)
  • ๐Ÿ”„ Version Display - Shows version, git commit, and build time in header
  • โฌ†๏ธ Update Notifications - Checks GitHub for newer releases
  • ๐Ÿ”ง Bulk Credentials - Apply SSH/IPMI credentials to multiple servers at once
  • ๐Ÿ”ƒ BMC Reset - Cold/warm reset BMC without affecting host OS
  • ๐Ÿค– Optional AI Features - Enable AI-powered insights via Settings โ†’ AI Features

๐Ÿš€ Quick Start

One Command Setup โšก

Ubuntu 24.04+ / Python 3.12+ (uses pipx):

sudo apt install pipx -y
pipx install ipmi-monitor
pipx ensurepath && source ~/.bashrc
sudo ipmi-monitor quickstart

Ubuntu 22.04 / Python 3.10 (direct pip):

pip install ipmi-monitor
sudo ipmi-monitor quickstart

Alternative (if you get "externally-managed-environment" error):

pip install ipmi-monitor --break-system-packages
sudo ipmi-monitor quickstart

That's it! Answer a few questions:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚           IPMI Monitor - Quick Setup             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Detected: my-server (192.168.1.100)

Step 1: Add Server to Monitor
  Server name: gpu-server-01
  BMC IP address: 192.168.1.80
  BMC username: ADMIN
  BMC password: ******
  โœ“ IPMI connection successful
  
  Add SSH access for detailed monitoring? [Y/n]: y
  Server IP (for SSH): 192.168.1.81
  SSH username: root
  SSH password: ******

Step 2: Web Interface Settings
  Web interface port: [5000]

Step 3: AI Features (Optional)
  Enable AI Insights? [y/N]: n

Step 4: Starting IPMI Monitor
  โœ“ Configuration saved
  โœ“ Service installed and started

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚              โœ“ Setup Complete!                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
Web Interface: http://192.168.1.100:5000

After Setup

# Add more servers
ipmi-monitor add-server --bmc-ip 192.168.1.82 --username admin

# Check status
ipmi-monitor status

# View logs
ipmi-monitor logs

Bulk Import (Many Servers)

Create a simple text file and paste when prompted:

Option 1: SSH only (no IPMI)

global:root,sshpassword
192.168.1.101
192.168.1.102
192.168.1.103

Option 2: SSH + IPMI (full monitoring)

globalSSH:root,sshpassword
globalIPMI:ADMIN,ipmipassword
192.168.1.101,192.168.1.80
192.168.1.102,192.168.1.82
192.168.1.103,192.168.1.84

Option 3: Per-server credentials

# serverIP,sshUser,sshPass,ipmiUser,ipmiPass,bmcIP
192.168.1.101,root,pass1,ADMIN,ipmi1,192.168.1.80
192.168.1.102,root,pass2,ADMIN,ipmi2,192.168.1.82

๐Ÿ”— Full Datacenter Suite

For complete GPU datacenter monitoring, combine with DC Overview:

# On master server - install both tools
pip install dc-overview ipmi-monitor

# dc-overview: Grafana + Prometheus + GPU metrics
sudo dc-overview quickstart

# ipmi-monitor: BMC/IPMI health + SEL logs + AI insights
sudo ipmi-monitor quickstart
Tool What it monitors
dc-overview GPU utilization, temperature, power, CPU, RAM, disk
ipmi-monitor BMC health, SEL events, ECC errors, sensors, system logs

CLI Commands

ipmi-monitor setup              # Interactive setup wizard
ipmi-monitor run                # Start web interface
ipmi-monitor run --port 8080    # Custom port
ipmi-monitor daemon             # Run as daemon (for systemd)
ipmi-monitor status             # Show status and config
ipmi-monitor add-server         # Add a server interactively
ipmi-monitor list-servers       # List configured servers

Option 2: Docker Compose

For containerized deployments or if you prefer Docker:

Step 1: Create project directory

mkdir ipmi-monitor && cd ipmi-monitor

Step 2: Create docker-compose.yml:

version: '3.8'

services:
  ipmi-monitor:
    image: ghcr.io/cryptolabsza/ipmi-monitor:latest
    container_name: ipmi-monitor
    restart: unless-stopped
    ports:
      - "5000:5000"
    environment:
      - APP_NAME=My Server Fleet        # Customize this
      - IPMI_USER=admin
      - IPMI_PASS=YourIPMIPassword      # Your BMC password
      - ADMIN_PASS=changeme             # CHANGE THIS!
      - SECRET_KEY=change-this-to-random-string
    volumes:
      - ipmi_data:/app/data             # โš ๏ธ IMPORTANT: Persists your data!
    labels:
      - "com.centurylinklabs.watchtower.enable=true"  # Enable auto-updates

volumes:
  ipmi_data:

Step 3: Start the service

docker-compose up -d

Step 4: Open http://localhost:5000 and add your servers!


Option 3: Docker Run

# Create a named volume for data persistence
docker volume create ipmi_data

# Run the container
docker run -d \
  --name ipmi-monitor \
  --label com.centurylinklabs.watchtower.enable=true \
  -p 5000:5000 \
  -e IPMI_USER=admin \
  -e IPMI_PASS=YourIPMIPassword \
  -e ADMIN_PASS=YourAdminPassword \
  -e SECRET_KEY=your-random-secret-key \
  -v ipmi_data:/app/data \
  --restart unless-stopped \
  ghcr.io/cryptolabsza/ipmi-monitor:latest

โš ๏ธ Important: Data Persistence

Always use a named volume to preserve your data across container updates:

# โœ… CORRECT - Named volume (survives updates)
volumes:
  - ipmi_data:/app/data

# โŒ WRONG - No volume (data lost on rebuild)
# (no volume specified)

๐Ÿ“ Configuration File Reference

servers.yaml

servers:
  - name: GPU-Server-01           # Display name
    bmc_ip: 192.168.1.80          # BMC/IPMI IP (required)
    username: admin               # BMC username
    password: ipmi-password       # BMC password
    protocol: auto                # auto, ipmi, or redfish
    
    # Optional: SSH for system logs
    server_ip: 192.168.1.81       # Server OS IP
    ssh_user: root
    ssh_port: 22
    ssh_password: ssh-password    # Or use ssh_key
    ssh_key: ~/.ssh/id_rsa        # Path to SSH private key

config.yaml

settings:
  web_port: 5000
  refresh_interval: 60           # Seconds between collections
  enable_prometheus: true        # /metrics endpoint

ai:
  enabled: false                 # Enable AI features
  license_key: sk-ipmi-xxxx      # CryptoLabs license key

๐Ÿ”„ Keeping Up to Date

pip install

pip install --upgrade ipmi-monitor
sudo systemctl restart ipmi-monitor

Docker Manual Update

# Pull the latest image
docker pull ghcr.io/cryptolabsza/ipmi-monitor:latest

# Recreate the container (preserves data volume)
docker-compose up -d

Automatic Updates with Watchtower (Docker)

Add Watchtower to your docker-compose.yml:

services:
  ipmi-monitor:
    # ... your existing config ...
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

  watchtower:
    image: containrrr/watchtower
    container_name: watchtower
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - WATCHTOWER_CLEANUP=true
      - WATCHTOWER_POLL_INTERVAL=300  # Check every 5 minutes
    command: --label-enable  # Only update labeled containers
Tag Description
:latest Latest stable release (recommended)
:develop Development builds (testing new features)
:v1.0.3 Specific version (pin for stability)

๐Ÿ” Troubleshooting

Container won't start

# Check logs
docker logs ipmi-monitor

# Common issues:
# - Port 5000 already in use: Change port mapping to "5001:5000"
# - Permission denied: Ensure docker socket access

Can't connect to BMC

# Test from the container
docker exec ipmi-monitor ipmitool -I lanplus -H 192.168.1.80 -U admin -P password power status

# Common issues:
# - Wrong IP address (use BMC IP, not server OS IP)
# - Firewall blocking port 623 (IPMI)
# - Wrong credentials

SSH inventory collection fails

# Test SSH from container
docker exec ipmi-monitor ssh -o StrictHostKeyChecking=no root@192.168.1.81 hostname

# Common issues:
# - SSH key not added to container (add via Settings โ†’ SSH Keys)
# - Server IP not set (only BMC IP configured)
# - Firewall blocking port 22

Data disappeared after update

Your volume name must match! Check with:

docker volume ls | grep ipmi

If you see multiple volumes (e.g., ipmi_data and ipmi-monitor_ipmi_data), you may have used different names. Restore by:

docker stop ipmi-monitor
docker run --rm -v OLD_VOLUME:/from -v NEW_VOLUME:/to alpine cp -av /from/. /to/

โš™๏ธ Environment Variables (Docker)

Variable Default Description
APP_NAME IPMI Monitor Displayed in header
IPMI_USER admin Default BMC username
IPMI_PASS (required) Default BMC password
IPMI_PASS_NVIDIA - Separate password for NVIDIA DGX BMCs (16-char requirement)
ADMIN_USER admin Dashboard admin username
ADMIN_PASS changeme Dashboard admin password (change this!)
SECRET_KEY (auto) Flask session secret (set this for persistent sessions!)
POLL_INTERVAL 300 Seconds between collections
DATA_RETENTION_DAYS 30 How long to keep events
SSH_USER root Default SSH username for system log collection
SSH_PASS - Default SSH password (or use SSH keys)

๐Ÿ”ง Setting Up SSH for Enhanced Monitoring

SSH access enables powerful features:

  • System Logs - dmesg, journalctl, syslog, Docker daemon logs
  • Hardware Inventory - Detailed CPU, DIMM, GPU, NIC, storage info
  • GPU Monitoring - NVIDIA Xid errors, driver version, CUDA version
  • Uptime Tracking - Detect unexpected reboots

Option 1: SSH Keys (Recommended)

  1. Go to Settings โ†’ SSH Keys
  2. Click Add SSH Key
  3. Paste your private key content (from ~/.ssh/id_rsa or similar)
  4. Give it a name (e.g., "datacenter-key")
  5. In Settings โ†’ Servers, assign the key to each server

Option 2: SSH Password

  1. Go to Settings โ†’ Defaults
  2. Enter your SSH username and password
  3. Click Apply to All Servers

Important: Server IP vs BMC IP

  • BMC IP (e.g., 192.168.1.80) - IPMI/Redfish management interface
  • Server IP (e.g., 192.168.1.81) - The actual OS/SSH interface

When adding a server, set both IPs:

  • BMC IP: For IPMI/Redfish event collection
  • Server IP: For SSH-based inventory and logs

๐ŸŽฎ GPU Monitoring (NVIDIA)

IPMI Monitor can detect and monitor NVIDIA GPUs via SSH:

  • GPU Count & Models - Detected via nvidia-smi
  • Driver & CUDA Version - For compatibility tracking
  • Xid Errors - Parsed from dmesg/syslog (GPU failures, ECC errors)
  • PCIe Health - AER/correctable/uncorrectable errors

Collecting GPU Inventory

  1. Ensure SSH is configured for the server
  2. Go to server detail page
  3. Click Collect Inventory
  4. GPU info appears under ๐ŸŽฎ GPU section

๐Ÿ“‹ Detailed DIMM Inventory

For servers with Redfish or SSH access, IPMI Monitor collects per-DIMM details:

  • Slot/Locator (e.g., A1, B2)
  • Manufacturer (Samsung, SK Hynix, Micron, etc.)
  • Part Number
  • Size (32 GB, 64 GB)
  • Speed (Configured vs Rated - highlights if running slower)

This helps identify:

  • Mixed memory configurations
  • Under-clocked DIMMs
  • Which slot has ECC errors

๐Ÿค– AI Features (Optional)

IPMI Monitor can integrate with the CryptoLabs AI service for:

  • Fleet Summary - AI-generated daily analysis
  • Predictive Maintenance - Identify failing components
  • Root Cause Analysis - Correlate events across servers
  • Task Generation - Prioritized maintenance tasks

Enabling AI Features

  1. Go to Settings โ†’ AI Features
  2. Get an API key from cryptolabs.co.za/my-account
  3. Enter the key and click Enable

๐Ÿ“Œ AI features are optional - IPMI Monitor works fully offline without them.


๐Ÿ”Œ IPMI vs Redfish

IPMI Monitor supports both protocols and auto-detects which to use:

Feature IPMI/ipmitool Redfish
Event Collection โœ… SEL logs โœ… Log Service
Sensor Readings โœ… SDR โœ… Chassis/Thermal
Power Control โœ… โœ…
Inventory Basic FRU โœ… Rich metadata
Memory Details - โœ… Per-DIMM info
Supported BMCs All Dell iDRAC, HPE iLO, Supermicro, Lenovo

Forcing a Protocol

By default, IPMI Monitor auto-detects. To force a specific protocol:

  1. Go to Settings โ†’ Servers
  2. Click Edit on a server
  3. Set Protocol to ipmi or redfish

๐Ÿšจ Alert Configuration

IPMI Monitor can send alerts via multiple channels:

Notification Methods

Method Setup
Email Settings โ†’ Alerts โ†’ SMTP configuration
Telegram Settings โ†’ Alerts โ†’ Bot token + Chat ID
Webhook Settings โ†’ Alerts โ†’ Custom URL for Slack, Discord, etc.

Alert Rules

Create rules to trigger on specific conditions:

  • Event Type - SEL event categories (Temperature, Memory, Fan, etc.)
  • Severity - Critical, Warning, or both
  • Server Filter - All servers or specific ones
  • Keyword Match - Filter by event description

Alert Features

  • Confirmation Period - Wait N minutes before alerting (avoid false positives)
  • Resolution Alerts - Get notified when issues are resolved
  • Rate Limiting - Prevent alert floods

๐Ÿ“‹ API Reference

Public Endpoints

Endpoint Description
GET / Dashboard
GET /api/servers List servers
GET /api/events Get events (filterable)
GET /api/stats Dashboard stats
GET /api/sensors/{bmc_ip} Sensor readings
GET /metrics Prometheus metrics
GET /health Health check
GET /api/version Current version info
GET /api/version/check Check for updates
POST /api/server/{bmc_ip}/bmc/{action} BMC reset (cold/warm/info)
GET /api/server/{bmc_ip}/ssh-logs Get SSH system logs

Admin Endpoints (login required)

Endpoint Description
POST /api/collect Trigger collection
POST /api/servers/add Add server
DELETE /api/servers/{bmc_ip} Delete server
GET /api/backup Full configuration backup
POST /api/restore Restore from backup

๐Ÿ”’ Security

IPMI Monitor is designed with security in mind for production datacenter environments:

Credential Protection

  • No Command-Line Exposure - IPMI passwords use environment variables (IPMI_PASSWORD), not -P flags
  • SSH Key Isolation - SSH private keys stored in temporary files with 0600 permissions
  • Password Masking - Passwords passed via SSHPASS environment variable, not command line

Data Handling

  • Local-First - All data stored locally in SQLite
  • No Credential Sync - Credentials are never sent externally

Access Control

  • Role-Based Access - Admin vs read-only user levels
  • Session Management - Secure Flask sessions with configurable secret key
  • API Authentication - Protected endpoints require authentication

Best Practices

environment:
  - SECRET_KEY=your-random-32-char-key  # Always set this!
  - ADMIN_PASS=strong-unique-password   # Change from default

๐Ÿ”‘ Password Recovery

IPMI Monitor is self-hosted - there's no central server to reset your password. Since you have root access, you can reset it directly:

# Quick password reset (run on your server)
docker exec -i ipmi-monitor python3 << 'EOF'
from werkzeug.security import generate_password_hash
import sqlite3
new_password = "your_new_password"  # CHANGE THIS
conn = sqlite3.connect('/app/data/ipmi_monitor.db')
conn.execute("UPDATE user SET password_hash = ? WHERE username = 'admin'", 
             (generate_password_hash(new_password),))
conn.commit()
print(f"โœ… Admin password updated!")
EOF

๐Ÿ“– See User Guide - Password Recovery for detailed instructions and a reusable script.


๐Ÿ› ๏ธ Developer Guide

See DEVELOPER_GUIDE.md for:

  • Git workflow (develop/main branches)
  • Release process
  • Docker tag conventions
  • CI/CD pipeline details

๐Ÿค Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing)
  5. Open a Pull Request

๐Ÿ“œ License

MIT License - see LICENSE for details.


๐Ÿ”— Links


Made with โค๏ธ by CryptoLabs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipmi_monitor-1.0.8.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ipmi_monitor-1.0.8-py3-none-any.whl (1.5 MB view details)

Uploaded Python 3

File details

Details for the file ipmi_monitor-1.0.8.tar.gz.

File metadata

  • Download URL: ipmi_monitor-1.0.8.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ipmi_monitor-1.0.8.tar.gz
Algorithm Hash digest
SHA256 3a992c030c393ff382cfb7dbb27d94e40f7f5b6d2371e6b21c554c386cddfeb9
MD5 aeb4e0bb265b5fda0417cf1687fd65a9
BLAKE2b-256 9046b6b80580cf1c55e66fc52e96f3f7109c1e52a585905811841e783716c125

See more details on using hashes here.

File details

Details for the file ipmi_monitor-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: ipmi_monitor-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 1.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for ipmi_monitor-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 9785ccc2181005187a317d525cc57b4e096c3f996d01649ebe3d0c7c0adef937
MD5 cb01873acd05cf88f89de75394832cee
BLAKE2b-256 a88ab6f8b279508f04d8f086b708fb03ee48d2ff4e858ca9a9ae9a524ec5653d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page