A lightweight HPC monitoring and predictive analytics tool
Project description
NØMAD-HPC
NØde Monitoring And Diagnostics — Lightweight HPC monitoring, visualization, and predictive analytics.
"Travels light, adapts to its environment, and doesn't need permanent infrastructure."
📖 Full Documentation — Installation guides, configuration, CLI reference, network methodology, ML framework, and more.
Quick Start
pip install nomad-hpc
nomad demo # Try with synthetic data
For production:
nomad init # Configure for your cluster
nomad collect # Start data collection
nomad dashboard # Launch web interface
Features
| Feature | Description | Command |
|---|---|---|
| Dashboard | Real-time multi-cluster monitoring with partition views | nomad dashboard |
| Workstation Monitoring | Track departmental workstations (CPU, memory, disk, users) | Dashboard → Workstations |
| Storage Monitoring | Monitor NFS servers, ZFS pools, IOPS, and client connections | Dashboard → Storage |
| Interactive Sessions | Monitor RStudio/Jupyter sessions with memory and age | Dashboard → Interactive |
| Data Readiness | Assess ML model readiness with sample size and variance analysis | nomad readiness |
| Diagnostics | Analyze network, storage, and node-level bottlenecks | nomad diag |
| Educational Analytics | Track computational proficiency development | nomad edu explain <job> |
| Alerts | Threshold + predictive alerts (email, Slack, webhook) | nomad alerts |
| ML Prediction | Job failure prediction using similarity networks | nomad predict |
| Insight Engine | Operational narratives from multi-signal analysis | nomad insights brief |
| Cloud Monitoring | AWS/Azure/GCP metrics with cost and utilization analysis | nomad cloud status |
| Community Export | Anonymized datasets for cross-institutional research | nomad community export |
| System Dynamics | Ecological and economic metrics for resource analysis | nomad dyn |
| Reference | Built-in documentation, code navigation, and search | nomad ref |
| Developer Toolchain | Scaffolding, validation, and contribution pipeline | nomad dev |
| Issue Reporting | Submit bugs, features, questions from any interface | nomad issue report |
Developer Toolchain
Scaffolding, codebase validation, and contribution pipeline for NØMAD development.
nomad dev guide # Interactive contribution wizard
nomad dev new collector zfs # Scaffold a new module
nomad dev check # Validate codebase health
nomad dev check --fix # Auto-fix registration issues
nomad dev test changed # Test only modified files
nomad dev status # Current branch and readiness
nomad dev submit # Full contribution pipeline
nomad dev setup # One-time dev environment config
nomad dev bump patch # Version management
nomad dev deps collector disk # Module dependency graph
Supports 8 module types: collector, command, analysis, metric, view, page, alert, insight. Every scaffolded module includes source file, test stubs, schema/config templates, and next-step instructions. Quality by construction — not by review.
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ NØMAD │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│ Collectors │ Analysis │ Viz │ Alerts │ Intelligence │
├───────────────┼───────────────┼───────────────┼───────────┼────────────────┤
│ disk │ derivatives │ dashboard │ thresholds│ insights │
│ iostat │ similarity │ network 3D │ predictive│ dynamics │
│ nfs │ community │ partitions │ flapping │ reference │
│ slurm │ ML ensemble │ workstations │ email │ edu scoring │
│ gpu │ readiness │ storage │ slack │ │
│ workstation │ diagnostics │ interactive │ webhooks │ │
│ storage │ │ │ │ │
│ cloud │ │ │ │ │
└───────────────┴───────────────┴───────────────┴───────────┴────────────────┘
│
┌─────────┴─────────┐
│ SQLite Database │
└───────────────────┘
CLI Reference
Core Commands
nomad init # Setup wizard
nomad collect # Start collectors
nomad dashboard # Web interface
nomad dashboard --db file.db # Use specific database
nomad demo # Demo mode with synthetic data
nomad status # System status
Data Readiness & Diagnostics
nomad readiness # Check ML training readiness
nomad readiness -v # Verbose with feature details
nomad diag network # Network performance analysis
nomad diag storage # Storage health and I/O patterns
nomad diag node # Node-level resource bottlenecks
Educational Analytics
nomad edu explain <job_id> # Job analysis with recommendations
nomad edu trajectory <user> # User proficiency over time
nomad edu report <group> # Course/group report
Analysis & Prediction
nomad disk /path # Filesystem trends
nomad jobs --user <user> # Job history
nomad similarity # Network analysis
nomad train # Train ML models
nomad predict # Run predictions
Community & Alerts
nomad community export # Export anonymized data
nomad community preview # Preview export
nomad alerts # View alerts
nomad alerts --unresolved # Unresolved only
System Dynamics
nomad dyn summary # Full dynamics narrative
nomad dyn diversity # Workload diversity indices
nomad dyn diversity --by partition # By partition
nomad dyn niche # Resource overlap between groups
nomad dyn capacity # Carrying capacity, binding constraint
nomad dyn resilience # Recovery time after disturbances
nomad dyn externality # Inter-group impact scoring
Insight Engine
nomad insights brief # Executive summary
nomad insights full # Comprehensive report
nomad insights signals # Raw signal detection
nomad insights correlations # Cross-signal analysis
nomad insights enrich # Alert enrichment with context
Reference
nomad ref # Browse all 60 topics
nomad ref dyn diversity # Look up any topic
nomad ref search "regime" # Search across documentation
nomad ref alerts thresholds # Alert threshold reference
nomad ref config # Configuration reference
Issue Reporting
nomad issue report # Interactive bug/feature/question form
nomad issue report -c bug -m alerts # Pre-select category and component
nomad issue report --email # Send via email instead of GitHub
nomad issue search disk # Search existing issues
nomad issue info # Preview auto-collected system info
nomad issue info --json # JSON output for scripting
Dashboard Views
The web dashboard includes multiple views accessible via tabs:
- Cluster Overview: Real-time node status with health rings showing CPU utilization
- Network View: 3D job similarity network with failure clustering analysis
- Resources: CPU-hours, GPU-hours, and usage breakdown by group/user
- Activity: Job submission heatmap showing patterns by day and hour
- Interactive: Active RStudio and Jupyter sessions with memory usage
- Workstations: Departmental machines with CPU, memory, disk, and logged-in users
- Storage: NFS servers with ZFS pool health, capacity, and client connections
- Cloud: AWS, Azure, and GCP resource utilization and cost tracking
- Insights: Operational narratives from multi-signal analysis
- Dynamics: Ecological and economic metrics (diversity, niche, capacity, resilience)
- Report Issue: Submit bugs, feature requests, and questions with auto-populated system info
Toggle between light and dark themes with the Theme button.
Installation
From PyPI
pip install nomad-hpc
From Source
git clone https://github.com/jtonini/nomad-hpc
cd nomad-hpc && pip install -e .
Requirements
- Python 3.9+
- SQLite 3.35+
- sysstat package (
iostat,mpstat) - Optional: SLURM, nvidia-smi, nfsiostat
System Check
nomad syscheck
Documentation
- Installation & Configuration
- System Install (
--system) - Dashboard Guide
- Educational Analytics
- Network Methodology
- ML Framework
- Proficiency Scoring
- CLI Reference
- Configuration Options
License
Dual-licensed:
- AGPL v3 — Free for academic, educational, and open-source use
- Commercial License — Available for proprietary deployments
Citation
@software{nomad2026,
author = {Tonini, João Filipe Riva},
title = {NØMAD: Lightweight HPC Monitoring with Machine Learning-Based Failure Prediction},
year = {2026},
url = {https://github.com/jtonini/nomad-hpc},
doi = {10.5281/zenodo.18614517}
}
@article{tonini2026nomad,
author = {Tonini, João Filipe Riva},
title = {NØMAD: Lightweight HPC Monitoring with Machine Learning-Based Failure Prediction},
journal = {Journal of Open Research Software},
volume = {14},
pages = {17},
year = {2026},
doi = {10.5334/jors.686}
}
Contributing
See CONTRIBUTING.md for guidelines.
Contact
- Author: João Tonini
- Email: jtonini@richmond.edu
- Issues: GitHub Issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nomad_hpc-1.3.2.tar.gz.
File metadata
- Download URL: nomad_hpc-1.3.2.tar.gz
- Upload date:
- Size: 443.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d1b605a8dce26580bc1faaf73bdcf5948bb3b82367299432806b9ad39c5e676
|
|
| MD5 |
2c15f696bd35c083b4187c1b8dafe7d0
|
|
| BLAKE2b-256 |
a9dcd02391abdd6cc8c4cc491529226d04ed3ff7cd052a38a0ff2941356d7287
|
File details
Details for the file nomad_hpc-1.3.2-py3-none-any.whl.
File metadata
- Download URL: nomad_hpc-1.3.2-py3-none-any.whl
- Upload date:
- Size: 476.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01ade89f5cf35b969667951cbedf247212dc32d7795b7b47e1805506d29f1a33
|
|
| MD5 |
438956ab26e8728ad3a7240b0ac09100
|
|
| BLAKE2b-256 |
111cae339d41abc9fed94fe48eca1528d22f2ac376b8d1ef933d6306a27b176f
|