A lightweight HPC monitoring and predictive analytics tool
Project description
NØMAÐ-HPC
NØde Monitoring And Diagnostics — Lightweight HPC monitoring, visualization, and predictive analytics.
"Travels light, adapts to its environment, and doesn't need permanent infrastructure."
📖 Full Documentation — Installation guides, configuration, CLI reference, network methodology, ML framework, and more.
Quick Start
pip install nomad-hpc
nomad demo # Try with synthetic data
For production:
nomad init # Interactive setup wizard
nomad collect # Start data collection
nomad dashboard # Launch web interface
What's New in v1.4.0
Multi-Cluster Monitoring
Monitor multiple clusters, interactive servers, and workstation groups from a single dashboard. The nomad sync command merges databases from remote sites into a combined view with per-cluster tabs, partition-aware layouts, and cross-site insights.
Alert Pipeline
End-to-end alerting from data collection through email notification. The DiskCollector detects filesystem usage above thresholds, the ThresholdChecker fires severity-graded alerts, and the AlertDispatcher persists them to the database with deduplication and cooldown. Daily email reports via system mail — no SMTP configuration required.
Per-Cluster Dynamics
The Insight Engine runs diversity, niche overlap, capacity, and resilience computations independently per cluster in combined databases. Each signal is tagged with its cluster name for clear attribution.
Workstation Monitoring
Monitor departmental workstations via SSH from a central machine. The WorkstationCollector gathers CPU load, memory, disk, logged-in users, process counts, and zombie detection. Workstation groups appear in the Workstations page with department-level grouping.
Disk Signals with Derivative Analysis
Filesystem signals now include fill rate and projected days-until-full from derivative analysis. The Insight Engine reads from the filesystems table and surfaces actionable warnings before disks reach critical capacity.
Umbrella Group Filter
Niche overlap analysis excludes groups that contain more than 80% of all users, eliminating false contention warnings from universal groups.
Features
| Feature | Description | Command |
|---|---|---|
| Multi-Cluster Dashboard | Real-time monitoring across HPC clusters, interactive servers, and workstations | nomad dashboard |
| Multi-Site Sync | Merge databases from remote sites into a combined view | nomad sync |
| Workstation Monitoring | Track departmental machines via SSH (CPU, memory, disk, users) | Dashboard → Workstations |
| Storage Monitoring | Filesystem health grouped by server with usage bars | Dashboard → Storage |
| Interactive Sessions | Monitor RStudio/Jupyter sessions with memory and idle detection | Dashboard → Interactive |
| Alert Pipeline | Threshold + derivative alerts with email, Slack, and webhook delivery | nomad alerts |
| Insight Engine | Operational narratives from multi-signal, per-cluster analysis | nomad insights brief |
| System Dynamics | Ecological and economic metrics for resource analysis | nomad dyn |
| ML Prediction | Job failure prediction using similarity networks | nomad predict |
| Data Readiness | Assess ML model readiness with sample size and variance analysis | nomad readiness |
| Diagnostics | Analyze network, storage, and node-level bottlenecks | nomad diag |
| Educational Analytics | Track computational proficiency development | nomad edu explain <job> |
| Cloud Monitoring | AWS/Azure/GCP metrics with cost and utilization analysis | nomad cloud status |
| Community Export | Anonymized datasets for cross-institutional research | nomad community export |
| Reference | Built-in documentation, code navigation, and search | nomad ref |
| Developer Toolchain | Scaffolding, validation, and contribution pipeline | nomad dev |
| Issue Reporting | Submit bugs, features, questions from any interface | nomad issue report |
Dashboard Views
The web dashboard includes multiple views accessible via tabs:
- Cluster Overview: Real-time node status with health rings, per-partition layout, and live running/pending counts from queue state
- Network View: 3D job similarity network with failure clustering analysis
- Resources: CPU-hours, GPU-hours, and usage breakdown by group/user with cluster filtering
- Activity: Job submission heatmap showing patterns by day and hour
- Interactive: Active RStudio and Jupyter sessions with memory usage and idle detection
- Workstations: Departmental machines grouped by site and department with CPU, memory, disk, and user counts
- Storage: Filesystem health grouped by server with color-coded usage bars
- Cloud: AWS, Azure, and GCP resource utilization and cost tracking
- Insights: Operational narratives from multi-signal, per-cluster analysis
- Dynamics: Diversity indices, niche overlap, carrying capacity, resilience scoring
- Readiness: Collection health, uptime, cycles, and prediction readiness
- Report Issue: Submit bugs, feature requests, and questions with auto-populated system info
Toggle between light and dark themes with the Theme button.
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ NØMAÐ │
├───────────────┬───────────────┬───────────────┬───────────┬────────────────┤
│ Collectors │ Analysis │ Viz │ Alerts │ Intelligence │
├───────────────┼───────────────┼───────────────┼───────────┼────────────────┤
│ disk │ derivatives │ dashboard │ thresholds│ insights │
│ iostat │ similarity │ network 3D │ predictive│ dynamics │
│ nfs │ community │ partitions │ flapping │ reference │
│ slurm │ ML ensemble │ workstations │ email │ edu scoring │
│ gpu │ readiness │ storage │ slack │ │
│ workstation │ diagnostics │ interactive │ webhooks │ │
│ storage │ │ cloud │ │ │
│ cloud │ │ insights │ │ │
│ groups │ │ dynamics │ │ │
│ interactive │ │ readiness │ │ │
└───────────────┴───────────────┴───────────────┴───────────┴────────────────┘
│
┌─────────┴─────────┐
│ SQLite Database │
│ (per-site + combined via sync) │
└───────────────────┘
Multi-Site Deployment
NØMAÐ supports monitoring multiple sites from a single dashboard:
# On each site
nomad init # Configure for local environment
nomad collect # Start data collection
# On a central machine
nomad sync # Pull and merge all site databases
nomad dashboard --db combined.db # Unified view
The nomad sync command pulls databases via SCP, merges them with source_site tagging, and copies partition metadata for per-cluster dashboard filtering. Set up a cron for automatic syncing:
*/10 * * * * /path/to/nomad sync 2>/dev/null
CLI Reference
Core Commands
nomad init # Interactive setup wizard
nomad collect # Start collectors
nomad collect --once # Single collection cycle
nomad dashboard # Web interface
nomad dashboard --db file.db # Use specific database
nomad sync # Merge remote databases
nomad demo # Demo mode with synthetic data
nomad status # System status
nomad syscheck # Verify environment
Insight Engine
nomad insights brief # Executive summary
nomad insights detail # Comprehensive report
nomad insights json # Machine-readable output
nomad insights slack # Slack-formatted report
System Dynamics
nomad dyn summary # Full dynamics narrative
nomad dyn diversity # Workload diversity indices
nomad dyn diversity --by partition # By partition
nomad dyn niche # Resource overlap between groups
nomad dyn capacity # Carrying capacity, binding constraint
nomad dyn resilience # Recovery time after disturbances
nomad dyn externality # Inter-group impact scoring
Data Readiness & Diagnostics
nomad readiness # Check ML training readiness
nomad readiness -v # Verbose with feature details
nomad diag network # Network performance analysis
nomad diag storage # Storage health and I/O patterns
nomad diag node # Node-level resource bottlenecks
Educational Analytics
nomad edu explain <job_id> # Job analysis with recommendations
nomad edu trajectory <user> # User proficiency over time
nomad edu report <group> # Course/group report
Analysis & Prediction
nomad disk /path # Filesystem trends
nomad jobs --user <user> # Job history
nomad similarity # Network analysis
nomad train # Train ML models
nomad predict # Run predictions
Alerts & Community
nomad alerts # View alerts
nomad alerts --unresolved # Unresolved only
nomad community export # Export anonymized data
nomad community preview # Preview export
Reference
nomad ref # Browse all topics
nomad ref dyn diversity # Look up any topic
nomad ref search "regime" # Search across documentation
Issue Reporting
nomad issue report # Interactive bug/feature/question form
nomad issue report -c bug # Pre-select category
nomad issue search disk # Search existing issues
nomad issue info # Preview system info
Developer Toolchain
nomad dev guide # Interactive contribution wizard
nomad dev new collector zfs # Scaffold a new module
nomad dev check # Validate codebase health
nomad dev check --fix # Auto-fix registration issues
nomad dev test changed # Test only modified files
nomad dev status # Current branch and readiness
nomad dev submit # Full contribution pipeline
nomad dev bump patch # Version management
nomad dev deps collector disk # Module dependency graph
Installation
From PyPI
pip install nomad-hpc
From Source
git clone https://github.com/jtonini/nomad-hpc
cd nomad-hpc && pip install -e .
Requirements
- Python 3.10+
- SQLite 3.35+
- Optional: sysstat (
iostat,mpstat), SLURM, nvidia-smi, nfsiostat
System Check
nomad syscheck
Documentation
- Installation & Configuration
- Dashboard Guide
- CLI Reference
- Configuration Options
- Network Methodology
- ML Framework
- System Dynamics
- Educational Analytics
- Cloud Monitoring
- Reference System
- Issue Reporting
License
Dual-licensed:
- AGPL v3 — Free for academic, educational, and open-source use
- Commercial License — Available for proprietary deployments
Citation
@software{nomad2026,
author = {Tonini, João Filipe Riva},
title = {NØMAÐ: Lightweight HPC Monitoring with Machine Learning-Based Failure Prediction},
year = {2026},
url = {https://github.com/jtonini/nomad-hpc},
doi = {10.5281/zenodo.18614517}
}
@article{tonini2026nomad,
author = {Tonini, João Filipe Riva},
title = {NØMAÐ: Lightweight HPC Monitoring with Machine Learning-Based Failure Prediction},
journal = {Journal of Open Research Software},
volume = {14},
pages = {17},
year = {2026},
doi = {10.5334/jors.686}
}
Contributing
See CONTRIBUTING.md for guidelines.
Contact
- Author: João Tonini
- Email: jtonini@richmond.edu
- Issues: GitHub Issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nomad_hpc-1.5.4.tar.gz.
File metadata
- Download URL: nomad_hpc-1.5.4.tar.gz
- Upload date:
- Size: 533.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
598f0ed5c49a05a1641336024679d290f28b9ca5a9b68312ba2733c3692ce41c
|
|
| MD5 |
5833b07d5e00268b07c48da4401013e5
|
|
| BLAKE2b-256 |
117b95c7408dff2a9d5c027aecb20c7fbbe66f6143806e417e768c923e96241a
|
Provenance
The following attestation bundles were made for nomad_hpc-1.5.4.tar.gz:
Publisher:
publish.yml on jtonini/nomad-hpc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nomad_hpc-1.5.4.tar.gz -
Subject digest:
598f0ed5c49a05a1641336024679d290f28b9ca5a9b68312ba2733c3692ce41c - Sigstore transparency entry: 1401387946
- Sigstore integration time:
-
Permalink:
jtonini/nomad-hpc@1a6993eb6740691af5682d26767201d903ebac7f -
Branch / Tag:
refs/tags/v1.5.4 - Owner: https://github.com/jtonini
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1a6993eb6740691af5682d26767201d903ebac7f -
Trigger Event:
release
-
Statement type:
File details
Details for the file nomad_hpc-1.5.4-py3-none-any.whl.
File metadata
- Download URL: nomad_hpc-1.5.4-py3-none-any.whl
- Upload date:
- Size: 559.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f89adb722ff8a816cb3560827385ca4a801d107ea3694e52e7149f3b106eab2
|
|
| MD5 |
e7a5a8682665a3c3d186d195eb6662b9
|
|
| BLAKE2b-256 |
98f67590424dff217986e90b07645eda2e6b20c66ffcc6d69116ef605640f9b1
|
Provenance
The following attestation bundles were made for nomad_hpc-1.5.4-py3-none-any.whl:
Publisher:
publish.yml on jtonini/nomad-hpc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nomad_hpc-1.5.4-py3-none-any.whl -
Subject digest:
3f89adb722ff8a816cb3560827385ca4a801d107ea3694e52e7149f3b106eab2 - Sigstore transparency entry: 1401387989
- Sigstore integration time:
-
Permalink:
jtonini/nomad-hpc@1a6993eb6740691af5682d26767201d903ebac7f -
Branch / Tag:
refs/tags/v1.5.4 - Owner: https://github.com/jtonini
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1a6993eb6740691af5682d26767201d903ebac7f -
Trigger Event:
release
-
Statement type: