AI-assisted HPC workflow manager for SLURM clusters
Project description
ClusterPilot
AI-assisted SLURM workflow manager for HPC clusters worldwide.
Generate, submit, and monitor SLURM jobs on any HPC cluster from a terminal UI.
https://juliafrank.net/clusterpilot/
Built by a computational physics PhD student who got tired of doing this manually.
What it does
ClusterPilot automates the full local to cluster to local research cycle:
- Describe your job in plain English - ClusterPilot sends your description to an AI model to generate a correct, cluster-aware SLURM script
- Upload and submit - files are rsynced to the cluster and
sbatchis run over an existing SSH ControlMaster socket - Monitor without babysitting - a background poll daemon checks
squeueevery 5 minutes; no persistent SSH connection is held open - Get notified (optional) - push notifications to your phone on job start, completion, failure, and walltime warnings via ntfy.sh
- Auto-sync results - on completion, output files are rsynced back to your local project directory
Everything runs from a keyboard-driven terminal UI (amber phosphor aesthetic, naturally).
F2 — Describe your job and generate a SLURM script
F1 — Monitor jobs, tail logs in real time, sync results
Supported clusters
ClusterPilot works with any SLURM cluster worldwide. Grex and Compute Canada (DRAC) clusters have built-in profiles with cluster-specific prompt tuning. Every other SLURM cluster uses the generic profile, which works correctly for the vast majority of schedulers.
| Cluster | cluster_type |
Notes |
|---|---|---|
| Grex (UManitoba) | grex |
Built-in profile |
| Cedar, Narval, Graham, Beluga (DRAC) | drac |
Built-in profile |
| NSF ACCESS, ARCHER2, EuroHPC, university clusters, … | generic |
Works out of the box |
Requirements
- Python >= 3.9
- System
sshbinary with ControlMaster support (standard on macOS/Linux) - An API key for your chosen AI provider (Anthropic or OpenAI), or a local Ollama installation
- (Optional) A free ntfy.sh topic for push notifications
- Terminal: Konsole, Alacritty, or Kitty on Linux; iTerm2 on macOS. macOS Terminal.app is not supported.
Installation
pip install clusterpilot
clusterpilot
On first run, ClusterPilot creates a starter config at
~/.config/clusterpilot/config.toml, prints its location, and exits.
Edit it to add your cluster username and account, then run clusterpilot again.
Updating
pip install --upgrade clusterpilot
That's it. Your config at ~/.config/clusterpilot/config.toml and job history
are untouched by updates.
Configuration
~/.config/clusterpilot/config.toml:
[defaults]
provider = "anthropic" # "anthropic", "openai", or "ollama"
model = "claude-sonnet-4-6" # model name for the chosen provider
api_key = "" # API key (not required for ollama)
poll_interval = 300 # seconds between job status checks
[[clusters]]
name = "grex"
host = "yak.hpc.umanitoba.ca"
user = "your_username"
account = "def-yoursupervisor" # leave blank if your cluster does not require one
scratch = "$HOME/clusterpilot_jobs"
cluster_type = "grex" # "drac", "grex", or "generic"
[notifications]
backend = "ntfy"
ntfy_topic = "your-topic-string"
ntfy_server = "https://ntfy.sh"
AI providers
provider |
model examples |
API key |
|---|---|---|
anthropic (default) |
claude-sonnet-4-6, claude-opus-4-6 |
ANTHROPIC_API_KEY env var or api_key in config |
openai |
gpt-4o, gpt-4o-mini, o4-mini |
OPENAI_API_KEY env var or api_key in config |
ollama |
llama3.2, qwen2.5-coder, any local model |
not required |
For Ollama, ClusterPilot connects to http://localhost:11434 by default. To use a remote Ollama instance, set api_base_url = "http://your-host:11434/v1" in config.
Any OpenAI-compatible API (vLLM, LM Studio, etc.) also works with provider = "openai" and api_base_url pointing at the server.
To switch provider or model, edit ~/.config/clusterpilot/config.toml directly, or press EDIT CONFIG on the F9 screen. Changes take effect on the next script generation — no restart needed.
Adding multiple clusters
Add as many [[clusters]] blocks as you need. All configured clusters appear
in the cluster dropdown on the F2 Submit screen and are connected to
automatically on startup.
[[clusters]]
name = "grex"
host = "yak.hpc.umanitoba.ca"
user = "jsmith"
account = "def-supervisor"
scratch = "$HOME/clusterpilot_jobs"
cluster_type = "grex"
[[clusters]]
name = "narval"
host = "narval.computecanada.ca"
user = "jsmith"
account = "def-supervisor"
scratch = "/scratch/jsmith"
cluster_type = "drac"
[[clusters]]
name = "myuni-hpc"
host = "hpc.myuniversity.edu"
user = "jsmith"
account = "" # omit if not required
scratch = "$HOME/jobs"
cluster_type = "generic" # any other SLURM cluster
cluster_type values:
| Value | Use for |
|---|---|
generic |
Any SLURM cluster (default if omitted) |
drac |
Compute Canada / DRAC (Cedar, Narval, Graham, Beluga) |
grex |
University of Manitoba Grex (same as generic in practice) |
ClusterPilot probes $SCRATCH at connection time, so storage advice in
generated scripts is accurate for any cluster without manual configuration:
| What the probe finds | Storage advice injected into the AI prompt |
|---|---|
$SCRATCH is set (e.g. /scratch/jsmith) |
Use $SCRATCH for large output; $SLURM_TMPDIR for temp files |
$SCRATCH is unset |
Use $HOME or the job working directory; $SLURM_TMPDIR for temp files |
cluster_type = "drac" (regardless of probe) |
Hard rule: never $HOME — DRAC home quota is ~50 GB and jobs writing there get killed |
The only reason to set cluster_type = "drac" is to get that hard warning.
For every other cluster, generic is correct — the probe handles the rest.
Upload and download excludes
When uploading a project directory, ClusterPilot excludes files that are not needed on the cluster. When downloading results, it skips source files that are already on your machine and only pulls back output (SLURM logs, data files, etc.).
Both lists are configurable in the [defaults] section:
[defaults]
# Files/dirs excluded from upload to the cluster.
upload_excludes = [
".git/",
"__pycache__/",
"*.pyc",
"*.egg-info/",
".DS_Store",
"CLAUDE.md",
"clusterpilot_jobs/",
]
# Files/dirs excluded when syncing results back from the cluster.
# Everything not matched here is downloaded (SLURM logs, data output, etc.).
download_excludes = [
"src/",
"docs/",
"examples/",
"scripts/",
"*.toml",
"*.md",
"*.sh",
".git/",
"__pycache__/",
".DS_Store",
]
These are rsync glob patterns. If your job writes output to an unusual
location, adjust download_excludes to avoid filtering it out.
Per-project upload exclusions can also be set in a .clusterpilot_ignore file
at the project root (one pattern per line, same syntax as rsync --exclude).
Usage
clusterpilot
That's it for normal use. The TUI monitors your jobs and syncs results automatically while it is open.
Background daemon (optional)
If you want job monitoring and notifications to continue after you close the TUI — for example, you submit a job and close your laptop — you can run the poll daemon separately:
clusterpilot daemon run # run in the foreground (Ctrl-C to stop)
clusterpilot daemon install # install as a systemd user service (Linux)
The daemon polls squeue every 5 minutes, sends ntfy notifications on job events, and syncs results on completion. You do not need it if you keep the TUI open.
TUI screens
| Key | Screen |
|---|---|
| F1 | Job list - status, log tail, cancel, remote cleanup |
| F2 | Submit - describe job, pick partition, generate + review script |
| F9 | Settings - clusters, SSH, notifications, API key |
F1 job actions
Select a job from the queue and use these keys:
| Key | Action |
|---|---|
| R | Rsync results to your local machine |
| T | Tail live output (polls every 5 s while running) |
| L | Fetch full output log |
| C | Clean — delete the remote job directory to reclaim scratch space |
| K | Kill — cancel the job (scancel) |
| D | Delete the job record from local history |
Cleaning up scratch space: once a job has finished, select it and press C. ClusterPilot
deletes the entire job directory from the cluster (clusterpilot_jobs/<job-name>/), including
uploaded project files, the generated script, and all output. If you have not yet synced the
results, you will be warned and asked to press C a second time to confirm. The job record stays
in your local history with a CLEANED marker.
Submitting a job (F2 workflow)
-
Select your cluster from the dropdown
-
Select a partition (populated from a live
sinfocache) -
Type a plain-language description of your job, e.g.:
Train a small transformer on CIFAR-10 using PyTorch, 1 V100, 4 hours
-
ClusterPilot generates a complete
sbatchscript - review and edit as needed -
Press Submit - files are uploaded and the job is queued
The partition you select is passed to the model as a hard constraint, not a
suggestion. It will use the correct --gres syntax for that partition's hardware.
Job arrays
Fill in the ARRAY field on the F2 screen to submit a SLURM job array instead of a single job. The field accepts standard SLURM array syntax:
| Example | Meaning |
|---|---|
0-9 |
10 tasks, indices 0–9 |
1-100%5 |
100 tasks, at most 5 running simultaneously |
0,2,4,8 |
Specific indices only |
The generated script uses $SLURM_ARRAY_TASK_ID to select parameters per
task. Describe how each index maps to your parameter space in the job
description and the AI will generate the selection logic automatically:
Run a hyperparameter sweep over learning rates [1e-4, 1e-3, 1e-2] and batch sizes [32, 64, 128]. Use $SLURM_ARRAY_TASK_ID to index into a flat list of all nine combinations.
Output logs are named <job-name>-<array-id>-<task-id>.out so results from
each task land in separate files.
Project directory mode
If you set PROJECT DIR on the F2 screen, the entire project tree is
rsynced to a job-specific directory on the cluster
($HOME/clusterpilot_jobs/<job-name>/). Each job gets its own isolated copy,
so you can submit multiple jobs from the same local project without them
interfering with each other. Modify a parameter, change the driver script,
and submit again - each submission creates a fresh directory on the cluster.
When results are synced back, only output files are downloaded (SLURM logs, data files). Source code that was uploaded is skipped by default. See Upload and download excludes for details.
How SSH works
ClusterPilot uses your system ssh binary with ControlMaster multiplexing.
You authenticate once (including MFA if required); all subsequent commands
reuse the existing socket with sub-second latency.
No changes to ~/.ssh/config are required. ClusterPilot passes all
ControlMaster flags directly on the command line. Your existing SSH config
is left untouched.
Terminal emulator compatibility
Recommended terminals:
| Platform | Recommended | Works with caveats | Avoid |
|---|---|---|---|
| Linux | Konsole, Alacritty, Kitty | GNOME Terminal | — |
| macOS | iTerm2 | — | Terminal.app |
| Windows | Windows Terminal (WSL2) | — | cmd, PowerShell |
macOS Terminal.app renders many Unicode symbols (arrows, icons, box-drawing variants) at double character width, which breaks the TUI layout. This is a long-standing Terminal.app bug, not a ClusterPilot issue. Use iTerm2 on macOS — it is free and handles everything correctly.
Terminal colours
ClusterPilot uses 24-bit RGB colour throughout. Most modern terminal emulators
support this, but the COLORTERM environment variable must be set to truecolor
for Textual to detect it. Without it, colours fall back to the nearest 16 ANSI
colours, which can look significantly different from the intended amber palette.
macOS (iTerm2): truecolor works out of the box in a local window. No action needed.
Over SSH: the COLORTERM variable is often not forwarded to the remote
session. Fix this by adding the following to ~/.bashrc (or ~/.zshrc) on
the remote machine:
export COLORTERM=truecolor
Then reconnect, or run source ~/.bashrc in the current session.
To verify:
echo $COLORTERM # should print: truecolor
iTerm2 users: you can also forward the variable automatically for all SSH
sessions by adding COLORTERM = truecolor to the environment section of your
iTerm2 profile (Profiles → Session → Environment).
The left screenshot below shows correct truecolor rendering. The right shows
the 16-colour fallback over SSH without COLORTERM set — the amber backgrounds
are approximated as red by the terminal.
| Correct (truecolor) | 16-colour fallback over SSH |
|---|---|
Mouse support over SSH
ClusterPilot is fully keyboard-navigable (Tab, arrow keys, Enter, F1/F2/F9) and this is the recommended way to use it over SSH.
Mouse clicks work in local terminal windows and in most SSH sessions from macOS terminals. However, SSH into a Linux machine running Wayland is a known exception — mouse events are not reliably forwarded through the SSH connection in this configuration, regardless of terminal settings. This is a Wayland limitation, not a ClusterPilot bug, and affects most TUI applications.
Workaround: run ClusterPilot directly on the local machine and point it at
the remote cluster via SSH ControlMaster, which is the intended workflow. If
you need to run it on a remote Linux workstation, switching that session to an
X11 fallback (ssh -X) may restore mouse support.
Notifications (optional)
Push notifications are entirely optional. If you prefer to just leave
the TUI open and check job status from the F1 screen, that works perfectly
well. The SSH connection stays alive as long as the TUI is running
(ControlPersist 4h + ServerAliveInterval 60), the job list refreshes
automatically every 10 seconds, and you can press TAIL or LOG at any time
to see live output. No external service is needed for this workflow.
If you want push notifications to your phone (useful when you close the lid and walk away), ClusterPilot supports ntfy.sh.
Setting up ntfy (if you want it)
-
Pick a topic string - this is just a name, like a channel. Use something unique so strangers cannot read your notifications (e.g.
clusterpilot-jfrank-a8f3, nottest-jobs). -
Add it to your config (
~/.config/clusterpilot/config.toml):[notifications] backend = "ntfy" ntfy_topic = "clusterpilot-jfrank-a8f3" # your unique topic ntfy_server = "https://ntfy.sh" # or a self-hosted server
-
Subscribe on your phone - install the ntfy app (Android / iOS) and subscribe to the same topic string. No account or phone number is required.
That's it. You can also view notifications in a browser at
https://ntfy.sh/your-topic-string.
Disabling notifications
Leave ntfy_topic empty (or remove it) and no notifications will be sent:
[notifications]
backend = "ntfy"
ntfy_topic = ""
Notification events
When enabled, ClusterPilot notifies on:
- Job started (PENDING to RUNNING)
- Job completed - results are syncing
- Job failed - includes the last 6 lines of the SLURM log
- Walltime warning - less than 30 minutes remaining
- ETA update - periodic estimate while running
A self-hosted ntfy server or any HTTP POST webhook also works; set
ntfy_server in the config accordingly.
Architecture
clusterpilot/
ssh/ system ssh/rsync subprocess wrappers (ControlMaster)
cluster/ sinfo/module avail probe + 24h JSON cache
jobs/ AI script generation, sbatch submit, state machine
notify/ ntfy.sh HTTP push
daemon/ async poll loop + systemd service installer
tui/ Textual app (F1 jobs / F2 submit / F9 settings)
config.py ~/.config/clusterpilot/config.toml loader
db.py aiosqlite job history
All cluster-specific SLURM quirks (account requirements, scratch paths, GPU syntax) live in one place and are injected into the AI prompt automatically.
Development
git clone https://github.com/ju-pixel/clusterpilot
cd clusterpilot
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest # 128 tests, no SSH required
ruff check . # lint
Planned
Support for additional AI providers (OpenAI, local models via Ollama, etc.)Done.Job array support in the submission UIDone.- Hosted tier with managed API key and web dashboard
- conda-forge package for HPC environments that prefer conda
- Windows support (WSL2 path handling, no systemd dependency)
Remote cleanup from F1: delete synced/terminal job directories on the cluster to reclaim scratch space without SSH-ing in manuallyDone.Cost estimation before submission based on requested resources and account allocationDone.
Support
ClusterPilot is free and open source. If it saves you time, consider sponsoring development.
Licence
MIT - free to use and self-host.
A hosted tier (managed API key, web dashboard) is planned for researchers who want zero setup. Subscribing will also support continued development. The self-hosted version will always be fully functional.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clusterpilot-0.1.3.tar.gz.
File metadata
- Download URL: clusterpilot-0.1.3.tar.gz
- Upload date:
- Size: 4.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40d19a1b50d2a985100a7115502d8217f97dea46957e264fb8c8d5e5b4eaea5b
|
|
| MD5 |
fd6d973355baafad577a6fe553b549a8
|
|
| BLAKE2b-256 |
693a860c0762382efece44d7e7b1925295c10b9557b8b50ee5d3ed50724c18ad
|
File details
Details for the file clusterpilot-0.1.3-py3-none-any.whl.
File metadata
- Download URL: clusterpilot-0.1.3-py3-none-any.whl
- Upload date:
- Size: 66.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d94443edc9a4670f7cb796403529bd4180ec23539fa3c872b5f570c255ce174a
|
|
| MD5 |
0967a54048adfa4e724823de5533666a
|
|
| BLAKE2b-256 |
c793a82610fde8adaa4293eba43cb738551f8747aca4d542f891be75894b6478
|