Read-only multi-cluster SLURM dashboard + agent capacity overview, over CLI and MCP.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dafidofff

These details have not been verified by PyPI

Project description

Cluster Jobs — terminal dashboard

📖 Documentation — the extended reference (configuration, agent overview, MCP, development).

A read-only terminal dashboard that shows your SLURM jobs across several clusters and your desktop in one view. It SSHes into each host, runs squeue --me, and renders a live, colour-coded overview.

For coding agents: one --overview call (CLI or MCP) returns, per cluster and partition, how many CPUs/GPUs are free — broken down by GPU type, with the largest free block on a single node — plus your queued/running jobs and an approximate queueing time. See Agent overview.

Read-only by design: the only commands ever run are squeue and sinfo (SLURM hosts) and nvidia-smi + ps (non-SLURM GPU hosts). There is no code path that can cancel or submit jobs. It uses your existing SSH config and keys — nothing new is exposed, no server, no stored secrets.

cluster-jobs live dashboard — colour-coded SLURM jobs across clusters

Live TUI with synthetic demo data (--demo).

Install — pip install cluster-job-monitor
Quick look (no clusters needed) — try it with synthetic data
Real setup — point it at your clusters
Agent overview — free CPUs/GPUs, GPU types, queue times (CLI + MCP)
Keys — TUI keybindings
Layout · Development · License

Install

pip install cluster-job-monitor            # the dashboard + CLI
pip install "cluster-job-monitor[mcp]"     # also the MCP server (see below)

This installs the cluster-jobs command. Prefer a virtualenv:

python3 -m venv .venv && source .venv/bin/activate
pip install cluster-job-monitor

If pip install fails with an SSL / "ssl module is not available" error, your default python3 was built without OpenSSL. Use Homebrew's instead: /opt/homebrew/bin/python3 -m venv .venv.

Working from a checkout instead? pip install -e ".[mcp,dev]" and use cluster-jobs. The legacy entrypoints python run.py … and python mcp_server.py still work as thin shims.

Quick look (no clusters needed)

cluster-jobs --demo          # live TUI with synthetic data
cluster-jobs --once --demo   # print one synthetic snapshot and exit

Real setup

Make sure each cluster is an SSH alias you can reach non-interactively. In ~/.ssh/config, e.g.:

Host mycluster
    HostName login.mycluster.example.edu
    User myuser
    # Reuse one connection so polling every 30s is fast and doesn't re-auth:
    ControlMaster auto
    ControlPath ~/.ssh/cm-%r@%h:%p
    ControlPersist 10m

Test it: ssh mycluster "squeue --me --noheader | head" should return instantly with no password prompt.

Create your config from the template:
```
cp clusters.example.json clusters.json
```
Edit clusters.json — one entry per host. ssh is the ~/.ssh/config alias; set "local": true for the machine you run the tool on (no SSH). Add "minimized": true to start a cluster collapsed (see below). clusters.json is git-ignored.

Non-SLURM GPU box? Add "scheduler": "gpu" to that host. Instead of squeue it runs nvidia-smi + ps and shows one row per GPU process — the run name (from a --name/--run-name/--experiment arg, else the script name), GPU memory used, elapsed time, and the owner — plus a GPU utilisation/memory line. Works through any login shell (fish, csh, …).

Run it:

cluster-jobs                 # uses ./clusters.json
cluster-jobs --config ~/my-clusters.json

(python run.py … still works from a checkout.)

Agent overview

For coding agents (or scripts) that need to decide where to launch a job, there's a one-shot overview that answers, in a single call:

how many jobs you have queued / running, per cluster and per partition,
how many CPUs and GPUs are still free — broken down by GPU type (a100/h100/…) and with the largest free block on a single node, and
an approximate queueing time: SLURM's estimated start for your pending jobs, plus a per-partition pre-submission wait estimate.

cluster-jobs --overview          # human-readable capacity table
cluster-jobs --overview --json   # machine-readable JSON (for agents)
cluster-jobs --overview --demo --json   # try it with synthetic data

cluster-jobs capacity overview — free CPUs/GPUs per cluster and partition

cluster-jobs --overview (synthetic data).

This is the only place the tool runs sinfo and a cluster-wide squeue (still read-only). All of it is folded into the same SSH round-trip as squeue --me, so an overview is one connection per host. The JSON shape:

{
  "generated_at": 1718900000.0,
  "clusters": [
    {
      "name": "Snellius", "ok": true, "error": null, "kind": "slurm",
      "my_jobs": { "running": 2, "pending": 1 },
      "my_pending_jobs": [
        { "jobid": "8123460", "name": "sweep-7", "partition": "gpu_a100",
          "est_start": "2026-06-30T03:00:00" }
      ],
      "free":     { "cpus": 224, "gpus": 14 },
      "capacity": { "cpus": 768, "gpus": 48 },
      "partitions": [
        {
          "name": "gpu_a100", "my_running": 1, "my_pending": 2,
          "cpus":  { "free": 0, "alloc": 512, "total": 512 },
          "gpus":  {
            "free": 0, "alloc": 32, "total": 32,
            "by_type": { "a100": { "free": 0, "alloc": 32, "total": 32 } },
            "max_free_per_node": 0
          },
          "is_default": false,
          "nodes": { "idle": 0, "mixed": 0, "alloc": 8, "other": 0, "total": 8 },
          "queue": {
            "pending": 11, "running": 8,
            "soonest_free_sec": 9300,
            "wait_estimate": ">=2h35m (11 queued)"
          }
        }
      ]
    }
  ]
}

Notes:

cpus.free is sinfo's idle CPU count (down/drained nodes already excluded); gpus.free (and by_type/max_free_per_node) is counted only on usable nodes (idle/mixed/allocated). max_free_per_node tells you whether a multi-GPU job fits on one node.
A node shared between partitions counts toward each partition's tally but is counted once in the cluster-level free/capacity totals. is_default flags the partition that untargeted (sbatch without -p) jobs land on.
my_running/my_pending and my_pending_jobs[].est_start come from squeue --me (est_start is SLURM's backfill estimate, null until it's computed). The per-partition queue block comes from a cluster-wide squeue (all users).
queue.wait_estimate is a hint, not a promise: immediate when GPUs are free now, else ~<t>/>=<t> derived from the soonest-finishing running job (soonest_free_sec) and the pending depth. It's optimistic — it doesn't model scheduler priority — so treat it as "ballpark".

As an MCP tool

The same overview is exposed over MCP so an agent can call it natively, via a thin wrapper that adds no new cluster access:

pip install "cluster-job-monitor[mcp]"     # installs the MCP SDK

# Register with Claude Code (point CLUSTER_MONITOR_CONFIG at your config):
claude mcp add cluster-monitor \
  -e CLUSTER_MONITOR_CONFIG=/abs/path/to/clusters.json \
  -- cluster-jobs-mcp

(cluster-jobs-mcp is installed with the [mcp] extra. Equivalents: python -m cluster_job_monitor.mcp_server, or python /abs/path/to/mcp_server.py from a source checkout.)

It serves two tools:

tool	returns
`cluster_overview`	the JSON above — free CPUs/GPUs + your jobs, per cluster & part
`my_jobs`	just your jobs per cluster (skips `sinfo`, lighter)

Keys

key	action
`r`	refresh now
`f`	cycle state filter (all → running → …)
`c`	cycle cluster filter
`p`	cycle partition filter
`/`	search by job name (Enter applies)
`1`–`9`	collapse / expand the cluster with that number
`m`	collapse / expand all clusters
`esc`	clear all filters
`q`	quit

Each cluster shows a number (1 ▾ Snellius); press it to collapse that cluster to a one-line summary (▸) and again to expand it. Start a cluster collapsed by adding "minimized": true to its entry in clusters.json.

Auto-refresh interval is refresh_seconds in the config (default 30).

Layout

cluster-job-monitor/
  cluster_job_monitor/        # the import package (pip-installable)
    __init__.py               # public API: Job, Host, Partition, Snapshot, collect, …
    cli.py                    # entry point (--once, --overview, --json, --demo, --config)
    collector.py              # UI-agnostic: SSH + squeue/sinfo -> Snapshot dataclasses
    mcp_server.py             # MCP wrapper exposing the capacity overview to agents
    tui/app.py                # Textual app (live loop, filters, keybindings)
    tui/render.py             # Rich renderables (shared by TUI, --once, --overview)
    tui/sample.py             # synthetic snapshot for --demo
  run.py                      # back-compat shim -> cluster_job_monitor.cli:main
  mcp_server.py               # back-compat shim -> cluster_job_monitor.mcp_server:main
  clusters.example.json       # config template (copy to clusters.json)
  pyproject.toml              # packaging (hatchling) + pytest/coverage config

cluster_job_monitor/collector.py has no third-party dependencies and returns a Snapshot whose .to_dict() is JSON-ready — that's the seam for a future web/phone dashboard (push the dict to an authenticated endpoint and render it in a browser), without changing the collector. Import it directly: from cluster_job_monitor import collect, build_overview, load_config.

Development

pip install -e ".[dev,mcp]"  # editable install with test + MCP deps
pytest                       # run the suite
pytest --cov --cov-report=term-missing   # with coverage (~94%)

Tests live in tests/ and mock SSH/subprocess, so they run anywhere — no cluster access needed. CI (GitHub Actions) runs them on Python 3.10–3.12 and reports coverage to Codecov.

License

MIT © David R Wessels. You're free to use, modify, and redistribute it; just keep the copyright notice and license text in any copies.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dafidofff

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cluster_job_monitor-0.1.0.tar.gz (36.3 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cluster_job_monitor-0.1.0-py3-none-any.whl (29.2 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file cluster_job_monitor-0.1.0.tar.gz.

File metadata

Download URL: cluster_job_monitor-0.1.0.tar.gz
Upload date: Jun 25, 2026
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cluster_job_monitor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`637f0c4400e3f9116de8285b6b0093dd0a47c27500d20a381891181dc560d779`
MD5	`adaafd93aab3e759a5039b928f514ccd`
BLAKE2b-256	`5c1294373c2478ad951de96abfcf14df29bef3695eb10e7324299287f67f116d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cluster_job_monitor-0.1.0.tar.gz:

Publisher: release.yml on Dafidofff/cluster-job-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cluster_job_monitor-0.1.0.tar.gz
- Subject digest: 637f0c4400e3f9116de8285b6b0093dd0a47c27500d20a381891181dc560d779
- Sigstore transparency entry: 1950422528
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: Dafidofff/cluster-job-monitor@1b538718af06e527d95dd2c99f0c7000951331d3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Dafidofff
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1b538718af06e527d95dd2c99f0c7000951331d3
- Trigger Event: push

File details

Details for the file cluster_job_monitor-0.1.0-py3-none-any.whl.

File metadata

Download URL: cluster_job_monitor-0.1.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 29.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cluster_job_monitor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44f964e7f20d97a85f292b495994fb9be643f338e1fafdc15bf57cd7ba299c25`
MD5	`a1e87d69a448f86e110d318a7e7912a9`
BLAKE2b-256	`c2293e2b0411e8e82dd67e1172e754cd61a84ed0aa100ab3d6a21d6c2a044c09`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cluster_job_monitor-0.1.0-py3-none-any.whl:

Publisher: release.yml on Dafidofff/cluster-job-monitor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cluster_job_monitor-0.1.0-py3-none-any.whl
- Subject digest: 44f964e7f20d97a85f292b495994fb9be643f338e1fafdc15bf57cd7ba299c25
- Sigstore transparency entry: 1950422626
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: Dafidofff/cluster-job-monitor@1b538718af06e527d95dd2c99f0c7000951331d3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Dafidofff
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1b538718af06e527d95dd2c99f0c7000951331d3
- Trigger Event: push

cluster-job-monitor 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Cluster Jobs — terminal dashboard

Contents

Install

Quick look (no clusters needed)

Real setup

Agent overview

As an MCP tool

Keys

Layout

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance