A lightweight HPC monitoring and predictive analytics tool

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jtonini

These details have not been verified by PyPI

Project description

NØMADE

NØde MAnagement DEvice — A lightweight HPC monitoring and predictive analytics tool.

"Travels light, adapts to its environment, and doesn't need permanent infrastructure."

Overview

NØMADE is a lightweight, self-contained monitoring and prediction system for HPC clusters. Unlike heavyweight monitoring solutions that require complex infrastructure, NØMADE is designed to be deployed quickly, run with minimal resources, and provide actionable insights through both real-time alerts and predictive analytics.

Key Features

Real-time Monitoring: Track disk usage, SLURM queues, node health, license servers, and job metrics
Derivative Analysis: Detect accelerating trends before they become critical (not just threshold alerts)
Predictive Analytics: ML-based job health prediction using similarity networks
Actionable Recommendations: Data-driven defaults and user-specific suggestions
3D Visualization: Interactive network visualization with safe/danger zones
Lightweight: SQLite database, minimal dependencies, no external services required

Philosophy

NØMADE is inspired by nomadic principles:

Travels light: Minimal dependencies, single SQLite database, no complex infrastructure
Adapts to its environment: Configurable collectors, flexible alert rules, cluster-agnostic
Leaves no trace: Clean uninstall, no system modifications required (except optional SLURM hooks)

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                              NØMADE                                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │                      ALERT DISPATCHER                           │    │
│  │             Email · Slack · Webhook · Dashboard                 │    │
│  └─────────────────────────────┬───────────────────────────────────┘    │
│                                │                                        │
│  ┌─────────────────────────────┴───────────────────────────────────┐    │
│  │                      ALERT ENGINE                               │    │
│  │       Rules · Derivatives · Deduplication · Cooldowns           │    │
│  └─────────────────────────────┬───────────────────────────────────┘    │
│                                │                                        │
│         ┌──────────────────────┴──────────────────────┐                 │
│         ▼                                             ▼                 │
│  ┌─────────────────────┐                ┌─────────────────────────┐     │
│  │  MONITORING ENGINE  │                │   PREDICTION ENGINE     │     │
│  │  Threshold-based    │                │   Similarity networks   │     │
│  │  Immediate alerts   │                │   17-dim feature space  │     │
│  └─────────┬───────────┘                └─────────────┬───────────┘     │
│            │                                          │                 │
│            └──────────────────┬───────────────────────┘                 │
│                               │                                         │
│  ┌────────────────────────────┴────────────────────────────────────┐    │
│  │                         DATA LAYER                              │    │
│  │            SQLite · Time-series · Job History · I/O Samples     │    │
│  └────────────────────────────┬────────────────────────────────────┘    │
│                               │                                         │
│  ┌────────────────────────────┴─────────────────────────────────────┐   │
│  │                        COLLECTORS                                │   │
│  │  disk│slurm│job_metrics│iostat│mpstat│vmstat│node_state│gpu│nfs  │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Data Collection Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                         NØMADE Data Collection                               │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SYSTEM COLLECTORS (every 60s):                                              │
│  ┌──────────────┬─────────────────────────────────────────────────────────┐  │
│  │ disk         │ Filesystem usage (total, used, free, projections)       │  │
│  │ iostat       │ Device I/O: %iowait, utilization, latency               │  │
│  │ mpstat       │ Per-core CPU: utilization, imbalance detection          │  │
│  │ vmstat       │ Memory pressure, swap activity, blocked processes       │  │
│  │ nfs          │ NFS I/O: ops/sec, throughput, RTT, retransmissions      │  │
│  │ gpu          │ NVIDIA GPU: utilization, memory, temperature, power     │  │
│  └──────────────┴─────────────────────────────────────────────────────────┘  │
│                                                                              │
│  SLURM COLLECTORS (every 60s):                                               │
│  ┌──────────────┬─────────────────────────────────────────────────────────┐  │
│  │ slurm        │ Queue state: pending, running, partition stats          │  │
│  │ job_metrics  │ sacct data: CPU/mem efficiency, health scores           │  │
│  │ node_state   │ Node allocation, drain reasons, CPU load, memory        │  │
│  └──────────────┴─────────────────────────────────────────────────────────┘  │
│                                                                              │
│  JOB MONITOR (every 30s):                                                    │
│  ┌──────────────┬─────────────────────────────────────────────────────────┐  │
│  │ job_monitor  │ Per-job I/O: NFS vs local writes from /proc/[pid]/io    │  │
│  └──────────────┴─────────────────────────────────────────────────────────┘  │
│                                                                              │
│  FEATURE VECTOR (17 dimensions for similarity analysis):                     │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │  From sacct:              From iostat:           From vmstat:          │  │
│  │   1. health_score          11. avg_iowait         17. memory_pressure  │  │
│  │   2. cpu_efficiency        12. peak_iowait        18. swap_activity    │  │
│  │   3. memory_efficiency     13. device_util        19. procs_blocked    │  │
│  │   4. used_gpu                                                          │  │
│  │   5. had_swap             From mpstat:                                 │  │
│  │                            14. avg_core_busy                           │  │
│  │  From job_monitor:         15. imbalance_ratio                         │  │
│  │   6. total_write_gb        16. max_core_busy                           │  │
│  │   7. write_rate_mbps                                                   │  │
│  │   8. nfs_ratio                                                         │  │
│  │   9. runtime_minutes                                                   │  │
│  │  10. write_intensity                                                   │  │
│  └────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Collector Details

Collector	Source	Data Collected	Graceful Skip
`disk`	`shutil.disk_usage`	Filesystem total/used/free, projections	No
`slurm`	`squeue`, `sinfo`	Queue depth, partition stats, wait times	No
`job_metrics`	`sacct`	Job history, CPU/mem efficiency, health scores	No
`iostat`	`iostat -x`	%iowait, device utilization, latency	No
`mpstat`	`mpstat -P ALL`	Per-core CPU, imbalance ratio, saturation	No
`vmstat`	`vmstat`	Memory pressure, swap, blocked processes	No
`node_state`	`scontrol show node`	Node allocation, drain reasons, CPU load	No
`gpu`	`nvidia-smi`	GPU util, memory, temp, power	Yes (if no GPU)
`nfs`	`nfsiostat`	NFS ops/sec, throughput, RTT	Yes (if no NFS)
`job_monitor`	`/proc/[pid]/io`	Per-job NFS vs local I/O attribution	No

Two Engines, One System

Monitoring Engine: Real-time threshold and derivative-based alerts
- Catches immediate issues (disk full, node down, stuck jobs)
- Uses first and second derivatives for early warning
- "Your disk fill rate is accelerating — full in 3 days, not 10"
Prediction Engine: Pattern-based ML analytics
- Catches patterns before they become issues
- Uses job similarity networks and health prediction
- "Jobs with your I/O pattern have 72% failure rate"

Monitoring Capabilities

Disk Storage

Filesystem usage monitoring (/, /home, /scratch, /project)
Per-user and per-group quota tracking
Fill rate calculation and projection
Derivative analysis: Detect accelerating growth before thresholds trigger
Orphan file and stale data detection
Localscratch cleanup verification

SLURM Queue

Queue depth and wait time tracking
Stuck and zombie job detection
Node drain status monitoring
Fairshare imbalance alerts
Pending job analysis (why is my job waiting?)
Job array health monitoring

Node Health

Node up/down/drain status
Hardware error detection (ECC, GPU, disk)
Temperature monitoring (CPU, GPU)
NFS mount health
Service status (slurmctld, slurmd, munge)
Network connectivity checks

License Servers

FlexLM and RLM license tracking
Real-time availability monitoring
Usage pattern analysis
Server connectivity alerts
Expiration warnings

Job Metrics

Per-job resource usage (CPU, memory, GPU)
I/O patterns (NFS vs local storage)
Runtime and efficiency metrics
Collected via SLURM prolog/epilog hooks

Prediction Capabilities

17-Dimension Feature Vector

NØMADE builds job similarity networks using a comprehensive feature vector that captures multiple aspects of job behavior:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        Feature Vector Architecture                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  JOB OUTCOME (from sacct):                                                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  health_score      │ 0.0 (catastrophic) → 1.0 (perfect)             │    │
│  │  cpu_efficiency    │ actual/requested CPU utilization               │    │
│  │  memory_efficiency │ actual/requested memory utilization            │    │
│  │  used_gpu          │ job utilized GPU resources                     │    │
│  │  had_swap          │ job triggered swap usage                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  I/O BEHAVIOR (from job_monitor):                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  total_write_gb    │ total data written during job                  │    │
│  │  write_rate_mbps   │ peak write throughput                          │    │
│  │  nfs_ratio         │ NFS writes / total writes (0-1)                │    │
│  │  runtime_minutes   │ job duration                                   │    │
│  │  write_intensity   │ GB written per minute                          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  SYSTEM I/O STATE (from iostat, correlated to job runtime):                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  avg_iowait        │ average %iowait during job                     │    │
│  │  peak_iowait       │ maximum %iowait spike                          │    │
│  │  device_util       │ average device utilization                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  CPU DISTRIBUTION (from mpstat, correlated to job runtime):                 │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  avg_core_busy     │ average CPU utilization across cores           │    │
│  │  imbalance_ratio   │ std/avg busy (higher = more imbalance)         │    │
│  │  max_core_busy     │ hottest core utilization                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  MEMORY PRESSURE (from vmstat, correlated to job runtime):                  │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  memory_pressure   │ composite pressure indicator (0-1)             │    │
│  │  swap_activity     │ peak swap in+out (KB/s)                        │    │
│  │  procs_blocked     │ avg processes blocked on I/O                   │    │ 
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Quantitative Similarity Network

Raw quantitative metrics: No arbitrary thresholds or binary labels
Non-redundant features: vram_gb > 0 implies GPU used (no separate flag)
Cosine similarity: Z-score normalized feature vectors, threshold ≥ 0.7
Continuous health score: 0 (catastrophic) → 1 (perfect), not binary
Time-correlated system state: iostat/mpstat/vmstat data aligned to job runtime

Simulation & Validation

Generative model: Learn distributions from empirical data
Simulation cloud: Thousands of synthetic jobs for coverage validation
Anomaly detection: Real jobs outside simulation bounds
Temporal drift: Monitor for model staleness

Error Analysis & Defaults

Type 1 errors (false alarms): Predicted failure, actually succeeded
Type 2 errors (missed failures): Predicted success, actually failed
Threshold optimization: Balance alert fatigue vs missed problems
Data-driven defaults: "Use localscratch → +23% success rate"

Visualization

3D network visualization: Three.js interactive display
Axes: NFS Write / Local Write / I/O Wait
Safe zone: Low NFS, high local, low I/O wait (green region)
Danger zone: High NFS, low local, high I/O wait (red region)
Real-time tracking: Watch jobs move through feature space

Derivative Analysis

A key innovation in NØMADE is the use of first and second derivatives for early warning:

VALUE (0th derivative):     "Disk is at 850 GB"
FIRST DERIVATIVE:           "Disk is filling at 15 GB/day"  
SECOND DERIVATIVE:          "Fill rate is ACCELERATING at 3 GB/day²"

Why Second Derivatives Matter

Traditional threshold alerts only trigger when a value crosses a limit. By monitoring the second derivative (acceleration), NØMADE can detect:

Exponential growth: Before linear projections underestimate
Sudden changes: Spikes in usage patterns
Developing problems: I/O storms, memory leaks, cascading failures

Applications

Metric	Accelerating (d²>0)	Decelerating (d²<0)
Disk usage	! Exponential fill	OK Cleanup in progress
Queue depth	! System issue	OK Draining normally
Failure rate	Cascading problem	OK Issue resolving
NFS latency	! I/O storm developing	OK Load decreasing
Job memory	! Memory leak / OOM	OK Normal variation
GPU temp	! Cooling issue	OK Throttling working

Installation

Requirements

Python 3.9+
SQLite 3.35+
SLURM (for queue and job monitoring)
sysstat package (iostat, mpstat)
procps package (vmstat) - usually pre-installed

Optional:

nvidia-smi (for GPU monitoring)
nfs-common with nfsiostat (for NFS monitoring)
Root access (for cgroup metrics)

System Check

After installation, verify all requirements:

nomade syscheck

Expected output:

NØMADE System Check
════════════════════════════════════════
Python:
  OK Version 3.10.12 (requires >=3.9)
  OK Required packages installed
SLURM:
  OK sinfo available
  OK squeue available
  OK sacct available
  OK sstat available
  OK slurmdbd enabled
  OK JobAcctGather configured
System Tools:
  OK iostat available
  OK mpstat available
  OK vmstat available
  ○ nvidia-smi not found (no GPU monitoring)
  ○ nfsiostat not found (no NFS monitoring)
  OK /proc/[pid]/io accessible
Database:
  OK SQLite available
  OK Database: /var/lib/nomade/nomade.db
  OK Schema version: 2
Config:
  OK Config: /etc/nomade/nomade.toml
────────────────────────────────────────
OK All checks passed!

Quick Start

Try it now (no HPC required):

pip install nomade-hpc
nomade demo

This generates synthetic data and launches the dashboard at http://localhost:5000

For production HPC deployment:

pip install nomade-hpc
nomade init
nomade collect    # Start data collection
nomade dashboard  # Launch web interface

Or install from source:

git clone https://github.com/jtonini/nomade.git
cd nomade
pip install -e .
nomade demo  # Test with synthetic data


### SLURM Integration (Optional)

For per-job metrics collection, install prolog/epilog hooks:

```bash
# Copy hooks to SLURM configuration
sudo cp scripts/prolog.sh /etc/slurm/prolog.d/nomade.sh
sudo cp scripts/epilog.sh /etc/slurm/epilog.d/nomade.sh

# Update slurm.conf
# Prolog=/etc/slurm/prolog.d/*
# Epilog=/etc/slurm/epilog.d/*

# Restart SLURM
sudo systemctl restart slurmctld

Configuration

NØMADE uses a TOML configuration file:

# nomade.toml

[general]
cluster_name = "mycluster"
data_dir = "/var/lib/nomade"
log_level = "INFO"

[collectors]
# All collectors enabled by default
# Set enabled = false to disable specific collectors

[collectors.disk]
enabled = true
filesystems = ["/", "/home", "/scratch", "/localscratch"]

[collectors.slurm]
enabled = true
partitions = ["standard", "debug", "gpu", "highmem"]

[collectors.job_metrics]
enabled = true
lookback_hours = 24
min_runtime_seconds = 10

[collectors.iostat]
enabled = true
# devices = ["sda", "nvme0n1"]  # Optional: specific devices only

[collectors.mpstat]
enabled = true
store_per_core = true
store_summary = true

[collectors.vmstat]
enabled = true

[collectors.node_state]
enabled = true
# nodes = ["node001", "node002"]  # Optional: specific nodes only

[collectors.gpu]
enabled = true  # Gracefully skipped if no nvidia-smi

[collectors.nfs]
enabled = true  # Gracefully skipped if no nfsiostat

[monitor]
# Job I/O monitor settings
sample_interval = 30
nfs_paths = ["/home", "/scratch", "/project"]
local_paths = ["/localscratch", "/tmp", "/dev/shm"]
port = 27001

[alerts]
# Alert dispatch configuration
email_enabled = true
email_to = ["admin@example.edu"]
email_from = "nomade@cluster.example.edu"
smtp_host = "smtp.example.edu"

slack_enabled = false
slack_webhook = ""

# Alert thresholds
disk_warning_percent = 85
disk_critical_percent = 95
queue_stuck_days = 7
gpu_temp_warning = 83

[alerts.derivatives]
# Second derivative thresholds
disk_acceleration_warning = 1.0  # GB/day²
queue_acceleration_warning = 5   # jobs/hour²

[prediction]
# Prediction engine settings
enabled = true
similarity_threshold = 0.7
health_threshold = 0.5
retrain_interval_days = 7

[dashboard]
host = "0.0.0.0"
port = 8080

Usage

Command Line Interface

# System status overview
nomade status              # Full system status with all metrics
nomade syscheck            # Verify system requirements

# Data collection
nomade collect --once      # Single collection cycle
nomade collect --interval 60   # Continuous collection
nomade collect -C disk,slurm   # Specific collectors only

# Job I/O monitoring
nomade monitor             # Monitor running jobs for I/O
nomade monitor --once      # Single snapshot
nomade monitor -i 30       # 30-second interval

# Analysis
nomade disk /home --hours 24   # Filesystem trend analysis
nomade jobs --user jsmith      # Recent job history
nomade similarity              # Job similarity analysis
nomade similarity --find-similar 12345  # Find similar jobs
nomade similarity --export viz.json     # Export for visualization

# Alerts
nomade alerts              # View recent alerts
nomade alerts --unresolved # Only unresolved alerts

Bash Helper Functions

Source the helper script for convenient shortcuts:

source ~/nomade/scripts/nomade.sh
nhelp      # Show all commands

Command	Description
`nstatus`	Quick status overview
`nwatch [s]`	Live status updates (every s seconds)
`ndisk PATH`	Filesystem trend analysis
`njobs`	Recent job history
`nsimilarity`	Job similarity analysis
`nalerts`	View alerts
`ncollect`	Run data collection
`nmonitor`	Job I/O monitoring
`nsyscheck`	System requirements check
`nlog`	Tail collection log

Status Output

═══ NØMADE Status ═══

Filesystems:
  /                    [██████████░░░░░░░░░░] 51.4% (34.02/66.26 GB)
  /home                [██████████░░░░░░░░░░] 51.4% (34.02/66.26 GB)
Queue:
  standard        Running:   4  Pending:  12
  gpu             Running:   2  Pending:   3
I/O:
  CPU iowait:    2.3%
  CPU user/sys:  45.2% / 3.1%
  vda          util: 15.2% write: 1240 KB/s  latency: 4.2ms
CPU Cores:
  Cores:         32
  Avg busy:      48.2%
  Range:         12.0% - 98.5% (spread: 86.5%)
  Imbalance:     0.42 (std/avg)
  Saturated:     4 (>95% busy)
Memory:
  Free:          12.45 GB
  Cache:         48.23 GB
  Swap used:     128 MB
  Pressure:      0.15
Nodes:
  node001         MIXED        CPU: 28/32 (88%)  Mem: 92%  Load: 27.4
  node002         ALLOCATED    CPU: 32/32 (100%) Mem: 98%  Load: 31.2
  node003         DRAIN        CPU: 0/32 (0%)    Mem: 0%   Load: 0.01
    └─ Reason: GPU memory errors - investigating
Collection:
  disk            1440 runs  100% success
  iostat          1440 runs  100% success
  mpstat          1440 runs  100% success
  vmstat          1440 runs  100% success
  slurm           1440 runs  100% success
  job_metrics     1440 runs  100% success
  node_state      1440 runs  100% success

Python API

from nomade import Nomade

# Initialize
nm = Nomade(config_path='nomade.toml')

# Get current disk status
disk_status = nm.collectors.disk.get_status()
for fs in disk_status:
    print(f"{fs['path']}: {fs['used_pct']:.1f}%")
    
# Analyze trends
analysis = nm.analysis.analyze_disk('/scratch')
print(f"Fill rate: {analysis['first_derivative']:.1f} GB/day")
print(f"Acceleration: {analysis['second_derivative']:.2f} GB/day²")
print(f"Trend: {analysis['trend']}")

# Predict job health
prediction = nm.prediction.predict_job(job_metrics)
print(f"Predicted health: {prediction['health']:.2f}")
print(f"Risk level: {prediction['risk_level']}")
print(f"Recommendations: {prediction['recommendations']}")

# Get recommendations for a user
recs = nm.prediction.recommend_for_user('alice')
for rec in recs:
    print(f"- {rec['message']}")

Repository Structure

nomade/
├── README.md                 # This file
├── LICENSE                   # AGPL v3
├── pyproject.toml           # Package configuration
├── requirements.txt         # Dependencies
├── nomade.toml.example      # Example configuration
│
├── nomade/                  # Main package
│   ├── __init__.py
│   ├── cli.py               # Command-line interface
│   ├── daemon.py            # Main monitoring daemon
│   ├── config.py            # Configuration handling
│   │
│   ├── collectors/          # Data collectors
│   │   ├── __init__.py
│   │   ├── base.py          # Base collector class
│   │   ├── disk.py          # Disk & quota monitoring
│   │   ├── slurm.py         # SLURM queue & jobs
│   │   ├── nodes.py         # Node health
│   │   ├── licenses.py      # License servers
│   │   ├── jobs.py          # Per-job metrics
│   │   └── network.py       # Network monitoring
│   │
│   ├── db/                  # Database layer
│   │   ├── __init__.py
│   │   ├── schema.sql       # SQLite schema
│   │   ├── models.py        # Data models
│   │   └── queries.py       # Common queries
│   │
│   ├── analysis/            # Analysis utilities
│   │   ├── __init__.py
│   │   ├── derivatives.py   # Derivative calculations
│   │   ├── projections.py   # Trend projections
│   │   └── timeseries.py    # Time-series utilities
│   │
│   ├── alerts/              # Alert system
│   │   ├── __init__.py
│   │   ├── engine.py        # Alert evaluation
│   │   ├── rules.py         # Alert rule definitions
│   │   └── dispatch.py      # Email/Slack/webhook
│   │
│   ├── prediction/          # ML prediction
│   │   ├── __init__.py
│   │   ├── similarity.py    # Cosine similarity
│   │   ├── network.py       # Similarity network
│   │   ├── health.py        # Health score prediction
│   │   ├── simulation.py    # Simulation model
│   │   ├── errors.py        # Type 1/2 error analysis
│   │   └── recommendations.py  # Defaults generation
│   │
│   └── viz/                 # Visualization
│       ├── __init__.py
│       ├── dashboard.py     # Web dashboard
│       └── static/          # React frontend
│           ├── index.html
│           └── components/
│               ├── Network3D.jsx
│               ├── DiskStatus.jsx
│               ├── QueueStatus.jsx
│               └── Alerts.jsx
│
├── scripts/                 # Utility scripts
│   ├── prolog.sh           # SLURM prolog hook
│   ├── epilog.sh           # SLURM epilog hook
│   └── install_hooks.sh    # Hook installer
│
├── tests/                   # Test suite
│   ├── __init__.py
│   ├── test_collectors.py
│   ├── test_analysis.py
│   ├── test_alerts.py
│   └── test_prediction.py
│
└── docs/                    # Documentation
    ├── installation.md
    ├── configuration.md
    ├── collectors.md
    ├── alerts.md
    ├── prediction.md
    └── api.md

Theoretical Background

NØMADE's prediction engine is inspired by biogeographical network analysis, particularly the work of Vilhena & Antonelli (2015) on mapping biomes using species occurrence data.

Biogeography → HPC Analogy

Biogeography	HPC Infrastructure
Species	Jobs
Geographic regions	Resources (nodes, storage)
Biomes	Emergent behavior clusters
Species ranges	Job resource usage patterns
Transition zones	Domain boundaries (CPU↔GPU, NFS↔local)

Key Insight

Just as biogeographical regions emerge from species distribution data rather than being predefined, NØMADE allows behavior patterns to emerge from job metrics rather than imposing arbitrary categories.

Dual-View Analysis

Data space: Jobs as points in feature space, clustered by similarity
Real space: Jobs mapped to physical resources, showing actual infrastructure usage

Roadmap

Phase 1: Monitoring Foundation ✓

Design architecture
Define data model
Implement collectors (disk, SLURM, GPU, NFS, iostat, vmstat, mpstat)
Implement alert engine
Basic dashboard

Phase 2: Prediction Engine ✓

Cosine similarity network (default), Simpson available for biogeographical analysis
Failure classification (8 classes: SUCCESS, TIMEOUT, FAILED, OOM, etc.)
Simulation framework (VM-based SLURM simulation)
Clustering analysis (assortativity, SES.MNTD, neighborhood purity)
Hotspot detection (failure-correlated feature bins)

Phase 3: Visualization ✓

3D network visualization (Three.js force-directed layout)
Interactive dashboard with cluster/network views
PCA view for emergent patterns
Clustering quality panel
ML Risk panel with high-risk job display

Phase 4: Advanced ML ✓

GNN for network-aware prediction (PyTorch Geometric)
LSTM for temporal pattern detection
Autoencoder for anomaly detection (100% precision)
Ensemble methods (weighted voting)
Model persistence (save/load from database)
CLI commands (train, predict, report)
Real-time scoring hook (SLURM prolog)
Continuous learning pipeline

Phase 5: Community

Multi-cluster federation
Anonymized data sharing
Community benchmarks
JOSS/SoftwareX paper submission

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone and install in development mode
git clone https://github.com/jtonini/nomade.git
cd nomade
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check .

# Build documentation
cd docs && make html

License

NOMADE is dual-licensed:

AGPL v3: Free for academic, educational, and open-source use
Commercial License: Available for proprietary/commercial deployments

See LICENSE for details.

Citation

If you use NOMADE in your research, please cite:

@software{nomade2026,
  author = {Tonini, Joao},
  title = {NOMADE: A Lightweight HPC Monitoring and Prediction Tool},
  year = {2026},
  url = {https://github.com/jtonini/nomade}
}

Acknowledgments

Biogeographical network analysis inspired by Vilhena & Antonelli (2015)

Contact

Author: João Tonini
Email: jtonini@richmond.edu
Issues: GitHub Issues

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jtonini

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.1

Feb 11, 2026

1.2.0

Feb 3, 2026

This version

1.1.0

Jan 29, 2026

0.3.4

Jan 21, 2026

0.3.3

Dec 31, 2025

0.3.2

Dec 31, 2025

0.3.1

Dec 29, 2025

0.3.0

Dec 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nomade_hpc-1.1.0.tar.gz (221.3 kB view details)

Uploaded Jan 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nomade_hpc-1.1.0-py3-none-any.whl (226.6 kB view details)

Uploaded Jan 29, 2026 Python 3

File details

Details for the file nomade_hpc-1.1.0.tar.gz.

File metadata

Download URL: nomade_hpc-1.1.0.tar.gz
Upload date: Jan 29, 2026
Size: 221.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nomade_hpc-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8e1b9a8903ee34f3361dfa05c2710672f26f5a37e82e90ee64f07d848b6012b2`
MD5	`6a44ba446ec1ba533c61a1ebf73a0af7`
BLAKE2b-256	`c6085b596daeea34774693ac22f3004c10c0388c544becb60c8c79bc3e935851`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nomade_hpc-1.1.0.tar.gz:

Publisher: publish.yml on jtonini/nomade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nomade_hpc-1.1.0.tar.gz
- Subject digest: 8e1b9a8903ee34f3361dfa05c2710672f26f5a37e82e90ee64f07d848b6012b2
- Sigstore transparency entry: 870219561
- Sigstore integration time: Jan 29, 2026
Source repository:
- Permalink: jtonini/nomade@a8facb9bfc1a9d4ecd5ede52d5bd321f6889f7a0
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/jtonini
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a8facb9bfc1a9d4ecd5ede52d5bd321f6889f7a0
- Trigger Event: release

File details

Details for the file nomade_hpc-1.1.0-py3-none-any.whl.

File metadata

Download URL: nomade_hpc-1.1.0-py3-none-any.whl
Upload date: Jan 29, 2026
Size: 226.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nomade_hpc-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b2c2e30b986190faae140044a6515a5c8e72b074ba72632f471b45c166e3612`
MD5	`695fbfa39c2aad72de78753bf35fa448`
BLAKE2b-256	`0c5491f7e45e23a5677ba8a0bda6a88908e099a637113f8a987cb1c0e43b4c05`

See more details on using hashes here.

Provenance

The following attestation bundles were made for nomade_hpc-1.1.0-py3-none-any.whl:

Publisher: publish.yml on jtonini/nomade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: nomade_hpc-1.1.0-py3-none-any.whl
- Subject digest: 0b2c2e30b986190faae140044a6515a5c8e72b074ba72632f471b45c166e3612
- Sigstore transparency entry: 870219617
- Sigstore integration time: Jan 29, 2026
Source repository:
- Permalink: jtonini/nomade@a8facb9bfc1a9d4ecd5ede52d5bd321f6889f7a0
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/jtonini
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a8facb9bfc1a9d4ecd5ede52d5bd321f6889f7a0
- Trigger Event: release

nomade-hpc 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

NØMADE

Overview

Key Features

Philosophy

Architecture

Data Collection Architecture

Collector Details

Two Engines, One System

Monitoring Capabilities

Disk Storage

SLURM Queue

Node Health

License Servers

Job Metrics

Prediction Capabilities

17-Dimension Feature Vector

Quantitative Similarity Network

Simulation & Validation

Error Analysis & Defaults

Visualization

Derivative Analysis

Why Second Derivatives Matter

Applications

Installation

Requirements

System Check

Quick Start

Configuration

Usage

Command Line Interface

Bash Helper Functions

Status Output

Python API

Repository Structure

Theoretical Background

Biogeography → HPC Analogy

Key Insight

Dual-View Analysis

Roadmap

Phase 1: Monitoring Foundation ✓

Phase 2: Prediction Engine ✓

Phase 3: Visualization ✓

Phase 4: Advanced ML ✓

Phase 5: Community

Contributing

Development Setup

License

Citation

Acknowledgments

Contact

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance