Read-only multi-cluster SLURM dashboard + agent capacity overview, over CLI and MCP.
Project description
Cluster Jobs — terminal dashboard
📖 Documentation — the extended reference (configuration, agent overview, MCP, development).
A read-only terminal dashboard that shows your SLURM jobs across several
clusters and your desktop in one view. It SSHes into each host, runs
squeue --me, and renders a live, colour-coded overview.
For coding agents: one --overview call (CLI or MCP)
returns, per cluster and partition, how many CPUs/GPUs are free — broken down
by GPU type, with the largest free block on a single node — plus your
queued/running jobs and an approximate queueing time. See
Agent overview.
Read-only by design: the only commands ever run are squeue and sinfo
(SLURM hosts) and nvidia-smi + ps (non-SLURM GPU hosts). There is no code
path that can cancel or submit jobs. It uses your existing SSH config and
keys — nothing new is exposed, no server, no stored secrets.
Live TUI with synthetic demo data (--demo).
Contents
- Install —
pip install cluster-job-monitor - Quick look (no clusters needed) — try it with synthetic data
- Real setup — point it at your clusters
- Agent overview — free CPUs/GPUs, GPU types, queue times (CLI + MCP)
- Keys — TUI keybindings
- Layout · Development · License
Install
pip install cluster-job-monitor # the dashboard + CLI
pip install "cluster-job-monitor[mcp]" # also the MCP server (see below)
This installs the cluster-jobs command. Prefer a virtualenv:
python3 -m venv .venv && source .venv/bin/activate
pip install cluster-job-monitor
If
pip installfails with an SSL / "ssl module is not available" error, your defaultpython3was built without OpenSSL. Use Homebrew's instead:/opt/homebrew/bin/python3 -m venv .venv.
Working from a checkout instead? pip install -e ".[mcp,dev]" and use
cluster-jobs. The legacy entrypoints python run.py … and
python mcp_server.py still work as thin shims.
Quick look (no clusters needed)
cluster-jobs --demo # live TUI with synthetic data
cluster-jobs --once --demo # print one synthetic snapshot and exit
Real setup
-
Make sure each cluster is an SSH alias you can reach non-interactively. In
~/.ssh/config, e.g.:Host mycluster HostName login.mycluster.example.edu User myuser # Reuse one connection so polling every 30s is fast and doesn't re-auth: ControlMaster auto ControlPath ~/.ssh/cm-%r@%h:%p ControlPersist 10mTest it:
ssh mycluster "squeue --me --noheader | head"should return instantly with no password prompt. -
Create your config from the template:
cp clusters.example.json clusters.json
Edit
clusters.json— one entry per host.sshis the~/.ssh/configalias; set"local": truefor the machine you run the tool on (no SSH). Add"minimized": trueto start a cluster collapsed (see below).clusters.jsonis git-ignored.Non-SLURM GPU box? Add
"scheduler": "gpu"to that host. Instead ofsqueueit runsnvidia-smi+psand shows one row per GPU process — the run name (from a--name/--run-name/--experimentarg, else the script name), GPU memory used, elapsed time, and the owner — plus a GPU utilisation/memory line. Works through any login shell (fish, csh, …). -
Run it:
cluster-jobs # uses ./clusters.json cluster-jobs --config ~/my-clusters.json
(
python run.py …still works from a checkout.)
Agent overview
For coding agents (or scripts) that need to decide where to launch a job, there's a one-shot overview that answers, in a single call:
- how many jobs you have queued / running, per cluster and per partition,
- how many CPUs and GPUs are still free — broken down by GPU type (a100/h100/…) and with the largest free block on a single node, and
- an approximate queueing time: SLURM's estimated start for your pending jobs, plus a per-partition pre-submission wait estimate.
cluster-jobs --overview # human-readable capacity table
cluster-jobs --overview --json # machine-readable JSON (for agents)
cluster-jobs --overview --demo --json # try it with synthetic data
cluster-jobs --overview (synthetic data).
This is the only place the tool runs sinfo and a cluster-wide squeue (still
read-only). All of it is folded into the same SSH round-trip as
squeue --me, so an overview is one connection per host. The JSON shape:
{
"generated_at": 1718900000.0,
"clusters": [
{
"name": "Snellius", "ok": true, "error": null, "kind": "slurm",
"my_jobs": { "running": 2, "pending": 1 },
"my_pending_jobs": [
{ "jobid": "8123460", "name": "sweep-7", "partition": "gpu_a100",
"est_start": "2026-06-30T03:00:00" }
],
"free": { "cpus": 224, "gpus": 14 },
"capacity": { "cpus": 768, "gpus": 48 },
"partitions": [
{
"name": "gpu_a100", "my_running": 1, "my_pending": 2,
"cpus": { "free": 0, "alloc": 512, "total": 512 },
"gpus": {
"free": 0, "alloc": 32, "total": 32,
"by_type": { "a100": { "free": 0, "alloc": 32, "total": 32 } },
"max_free_per_node": 0
},
"is_default": false,
"nodes": { "idle": 0, "mixed": 0, "alloc": 8, "other": 0, "total": 8 },
"queue": {
"pending": 11, "running": 8,
"soonest_free_sec": 9300,
"wait_estimate": ">=2h35m (11 queued)"
}
}
]
}
]
}
Notes:
cpus.freeis sinfo's idle CPU count (down/drained nodes already excluded);gpus.free(andby_type/max_free_per_node) is counted only on usable nodes (idle/mixed/allocated).max_free_per_nodetells you whether a multi-GPU job fits on one node.- A node shared between partitions counts toward each partition's tally but is
counted once in the cluster-level
free/capacitytotals.is_defaultflags the partition that untargeted (sbatchwithout-p) jobs land on. my_running/my_pendingandmy_pending_jobs[].est_startcome fromsqueue --me(est_startis SLURM's backfill estimate,nulluntil it's computed). The per-partitionqueueblock comes from a cluster-widesqueue(all users).queue.wait_estimateis a hint, not a promise:immediatewhen GPUs are free now, else~<t>/>=<t>derived from the soonest-finishing running job (soonest_free_sec) and the pending depth. It's optimistic — it doesn't model scheduler priority — so treat it as "ballpark".
As an MCP tool
The same overview is exposed over MCP so an agent can call it natively, via a thin wrapper that adds no new cluster access:
pip install "cluster-job-monitor[mcp]" # installs the MCP SDK
# Register with Claude Code (point CLUSTER_MONITOR_CONFIG at your config):
claude mcp add cluster-monitor \
-e CLUSTER_MONITOR_CONFIG=/abs/path/to/clusters.json \
-- cluster-jobs-mcp
(cluster-jobs-mcp is installed with the [mcp] extra. Equivalents:
python -m cluster_job_monitor.mcp_server, or python /abs/path/to/mcp_server.py
from a source checkout.)
It serves two tools:
| tool | returns |
|---|---|
cluster_overview |
the JSON above — free CPUs/GPUs + your jobs, per cluster & part |
my_jobs |
just your jobs per cluster (skips sinfo, lighter) |
Keys
| key | action |
|---|---|
r |
refresh now |
f |
cycle state filter (all → running → …) |
c |
cycle cluster filter |
p |
cycle partition filter |
/ |
search by job name (Enter applies) |
1–9 |
collapse / expand the cluster with that number |
m |
collapse / expand all clusters |
esc |
clear all filters |
q |
quit |
Each cluster shows a number (1 ▾ Snellius); press it to collapse that
cluster to a one-line summary (▸) and again to expand it. Start a cluster
collapsed by adding "minimized": true to its entry in clusters.json.
Auto-refresh interval is refresh_seconds in the config (default 30).
Layout
cluster-job-monitor/
cluster_job_monitor/ # the import package (pip-installable)
__init__.py # public API: Job, Host, Partition, Snapshot, collect, …
cli.py # entry point (--once, --overview, --json, --demo, --config)
collector.py # UI-agnostic: SSH + squeue/sinfo -> Snapshot dataclasses
mcp_server.py # MCP wrapper exposing the capacity overview to agents
tui/app.py # Textual app (live loop, filters, keybindings)
tui/render.py # Rich renderables (shared by TUI, --once, --overview)
tui/sample.py # synthetic snapshot for --demo
run.py # back-compat shim -> cluster_job_monitor.cli:main
mcp_server.py # back-compat shim -> cluster_job_monitor.mcp_server:main
clusters.example.json # config template (copy to clusters.json)
pyproject.toml # packaging (hatchling) + pytest/coverage config
cluster_job_monitor/collector.py has no third-party dependencies and
returns a Snapshot whose .to_dict() is JSON-ready — that's the seam for a
future web/phone dashboard (push the dict to an authenticated endpoint and
render it in a browser), without changing the collector. Import it directly:
from cluster_job_monitor import collect, build_overview, load_config.
Development
pip install -e ".[dev,mcp]" # editable install with test + MCP deps
pytest # run the suite
pytest --cov --cov-report=term-missing # with coverage (~94%)
Tests live in tests/ and mock SSH/subprocess, so they run anywhere — no
cluster access needed. CI (GitHub Actions) runs
them on Python 3.10–3.12 and reports coverage to Codecov.
License
MIT © David R Wessels. You're free to use, modify, and redistribute it; just keep the copyright notice and license text in any copies.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cluster_job_monitor-0.1.0.tar.gz.
File metadata
- Download URL: cluster_job_monitor-0.1.0.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
637f0c4400e3f9116de8285b6b0093dd0a47c27500d20a381891181dc560d779
|
|
| MD5 |
adaafd93aab3e759a5039b928f514ccd
|
|
| BLAKE2b-256 |
5c1294373c2478ad951de96abfcf14df29bef3695eb10e7324299287f67f116d
|
Provenance
The following attestation bundles were made for cluster_job_monitor-0.1.0.tar.gz:
Publisher:
release.yml on Dafidofff/cluster-job-monitor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cluster_job_monitor-0.1.0.tar.gz -
Subject digest:
637f0c4400e3f9116de8285b6b0093dd0a47c27500d20a381891181dc560d779 - Sigstore transparency entry: 1950422528
- Sigstore integration time:
-
Permalink:
Dafidofff/cluster-job-monitor@1b538718af06e527d95dd2c99f0c7000951331d3 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Dafidofff
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1b538718af06e527d95dd2c99f0c7000951331d3 -
Trigger Event:
push
-
Statement type:
File details
Details for the file cluster_job_monitor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cluster_job_monitor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44f964e7f20d97a85f292b495994fb9be643f338e1fafdc15bf57cd7ba299c25
|
|
| MD5 |
a1e87d69a448f86e110d318a7e7912a9
|
|
| BLAKE2b-256 |
c2293e2b0411e8e82dd67e1172e754cd61a84ed0aa100ab3d6a21d6c2a044c09
|
Provenance
The following attestation bundles were made for cluster_job_monitor-0.1.0-py3-none-any.whl:
Publisher:
release.yml on Dafidofff/cluster-job-monitor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cluster_job_monitor-0.1.0-py3-none-any.whl -
Subject digest:
44f964e7f20d97a85f292b495994fb9be643f338e1fafdc15bf57cd7ba299c25 - Sigstore transparency entry: 1950422626
- Sigstore integration time:
-
Permalink:
Dafidofff/cluster-job-monitor@1b538718af06e527d95dd2c99f0c7000951331d3 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Dafidofff
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1b538718af06e527d95dd2c99f0c7000951331d3 -
Trigger Event:
push
-
Statement type: