Skip to main content

A lightweight HPC monitoring and predictive analytics tool

Project description

NØMAD-HPC

NØde Monitoring And Diagnostics — Lightweight HPC monitoring, visualization, and predictive analytics.

"Travels light, adapts to its environment, and doesn't need permanent infrastructure."

PyPI License: AGPL v3 Python 3.9+ DOI


📖 Full Documentation — Installation guides, configuration, CLI reference, network methodology, ML framework, and more.


Quick Start

pip install nomad-hpc
nomad demo                    # Try with synthetic data

For production:

nomad init                    # Configure for your cluster
nomad collect                 # Start data collection
nomad dashboard               # Launch web interface

Features

Feature Description Command
Dashboard Real-time multi-cluster monitoring with partition views nomad dashboard
Workstation Monitoring Track departmental workstations (CPU, memory, disk, users) Dashboard → Workstations
Storage Monitoring Monitor NFS servers, ZFS pools, IOPS, and client connections Dashboard → Storage
Interactive Sessions Monitor RStudio/Jupyter sessions with memory and age Dashboard → Interactive
Data Readiness Assess ML model readiness with sample size and variance analysis nomad readiness
Diagnostics Analyze network, storage, and node-level bottlenecks nomad diag
Educational Analytics Track computational proficiency development nomad edu explain <job>
Alerts Threshold + predictive alerts (email, Slack, webhook) nomad alerts
ML Prediction Job failure prediction using similarity networks nomad predict
Insight Engine Operational narratives from multi-signal analysis nomad insights brief
Cloud Monitoring AWS/Azure/GCP metrics with cost and utilization analysis nomad cloud status
Community Export Anonymized datasets for cross-institutional research nomad community export
System Dynamics Ecological and economic metrics for resource analysis nomad dyn
Reference Built-in documentation, code navigation, and search nomad ref

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                              NØMAD                                  │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│  Collectors   │   Analysis    │     Viz       │  Alerts   │  Intelligence  │
├───────────────┼───────────────┼───────────────┼───────────┼────────────────┤
│ disk          │ derivatives   │ dashboard     │ thresholds│ insights       │
│ iostat        │ similarity    │ network 3D    │ predictive│ dynamics       │
│ nfs           │ community     │ partitions    │ flapping  │ reference      │
│ slurm         │ ML ensemble   │ workstations  │ email     │ edu scoring    │
│ gpu           │ readiness     │ storage       │ slack     │                │
│ workstation   │ diagnostics   │ interactive   │ webhooks  │                │
│ storage       │               │               │           │                │
│ cloud         │               │               │           │                │
└───────────────┴───────────────┴───────────────┴───────────┴────────────────┘
                                │
                      ┌─────────┴─────────┐
                      │  SQLite Database  │
                      └───────────────────┘

CLI Reference

Core Commands

nomad init                    # Setup wizard
nomad collect                 # Start collectors
nomad dashboard               # Web interface
nomad dashboard --db file.db  # Use specific database
nomad demo                    # Demo mode with synthetic data
nomad status                  # System status

Data Readiness & Diagnostics

nomad readiness               # Check ML training readiness
nomad readiness -v            # Verbose with feature details
nomad diag network            # Network performance analysis
nomad diag storage            # Storage health and I/O patterns
nomad diag node               # Node-level resource bottlenecks

Educational Analytics

nomad edu explain <job_id>    # Job analysis with recommendations
nomad edu trajectory <user>   # User proficiency over time
nomad edu report <group>      # Course/group report

Analysis & Prediction

nomad disk /path              # Filesystem trends
nomad jobs --user <user>      # Job history
nomad similarity              # Network analysis
nomad train                   # Train ML models
nomad predict                 # Run predictions

Community & Alerts

nomad community export        # Export anonymized data
nomad community preview       # Preview export
nomad alerts                  # View alerts
nomad alerts --unresolved     # Unresolved only

System Dynamics

nomad dyn summary             # Full dynamics narrative
nomad dyn diversity           # Workload diversity indices
nomad dyn diversity --by partition  # By partition
nomad dyn niche               # Resource overlap between groups
nomad dyn capacity            # Carrying capacity, binding constraint
nomad dyn resilience          # Recovery time after disturbances
nomad dyn externality         # Inter-group impact scoring

Insight Engine

nomad insights brief          # Executive summary
nomad insights full           # Comprehensive report
nomad insights signals        # Raw signal detection
nomad insights correlations   # Cross-signal analysis
nomad insights enrich         # Alert enrichment with context

Reference

nomad ref                     # Browse all 60 topics
nomad ref dyn diversity       # Look up any topic
nomad ref search "regime"     # Search across documentation
nomad ref alerts thresholds   # Alert threshold reference
nomad ref config              # Configuration reference

Dashboard Views

The web dashboard includes multiple views accessible via tabs:

  • Cluster Overview: Real-time node status with health rings showing CPU utilization
  • Network View: 3D job similarity network with failure clustering analysis
  • Resources: CPU-hours, GPU-hours, and usage breakdown by group/user
  • Activity: Job submission heatmap showing patterns by day and hour
  • Interactive: Active RStudio and Jupyter sessions with memory usage
  • Workstations: Departmental machines with CPU, memory, disk, and logged-in users
  • Storage: NFS servers with ZFS pool health, capacity, and client connections

Toggle between light and dark themes with the Theme button.


Installation

From PyPI

pip install nomad-hpc

From Source

git clone https://github.com/jtonini/nomad-hpc
cd nomad-hpc && pip install -e .

Requirements

  • Python 3.9+
  • SQLite 3.35+
  • sysstat package (iostat, mpstat)
  • Optional: SLURM, nvidia-smi, nfsiostat

System Check

nomad syscheck

Documentation

📖 jtonini.github.io/nomad-hpc


License

Dual-licensed:

  • AGPL v3 — Free for academic, educational, and open-source use
  • Commercial License — Available for proprietary deployments

Citation

@software{nomad2026,
  author = {Tonini, João Filipe Riva},
  title = {NØMAD: Lightweight HPC Monitoring with Machine Learning-Based Failure Prediction},
  year = {2026},
  url = {https://github.com/jtonini/nomad-hpc},
  doi = {10.5281/zenodo.18614517}
}

Contributing

See CONTRIBUTING.md for guidelines.


Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nomad_hpc-1.3.1.tar.gz (404.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nomad_hpc-1.3.1-py3-none-any.whl (444.4 kB view details)

Uploaded Python 3

File details

Details for the file nomad_hpc-1.3.1.tar.gz.

File metadata

  • Download URL: nomad_hpc-1.3.1.tar.gz
  • Upload date:
  • Size: 404.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for nomad_hpc-1.3.1.tar.gz
Algorithm Hash digest
SHA256 c2b2f92a611daa55483bc408e09d05a0187830cf5d16810b9035e4ffb82c1ac9
MD5 24c9edfb8ccf30048962627015ece7ae
BLAKE2b-256 651a9cac120a06185c9f0264b33ed28c071221e43b7f20c621c377810fff4d26

See more details on using hashes here.

File details

Details for the file nomad_hpc-1.3.1-py3-none-any.whl.

File metadata

  • Download URL: nomad_hpc-1.3.1-py3-none-any.whl
  • Upload date:
  • Size: 444.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.21

File hashes

Hashes for nomad_hpc-1.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ac37a6a21795a46ba0c6c6a0c47a8e4021f65397b42c0034aea76e641b560211
MD5 7e66153cd19047bc0fcb77d8d8c73683
BLAKE2b-256 b7894f69da3edcef0a707816978fb400831f77f368f82dcd1316b9280f0e0fa3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page