Skip to main content

Client-side contention management for Slurm HPC clusters using CSMA/CA-inspired backoff

Project description

polite_submit

CI Python 3.10+ License: MIT

Client-side contention management for Slurm HPC clusters using CSMA/CA-inspired backoff.

Overview

polite_submit probes cluster state before job submission and backs off when resources are congested, improving queue health for all users without requiring scheduler modifications.

Key Features:

  • Reduces queue congestion from batch job floods
  • Zero server-side changes required (pure client)
  • Drop-in replacement for sbatch
  • Configurable politeness levels
  • Supports batch and array job chunking
  • Exponential backoff with jitter (like WiFi CSMA/CA)

Installation

pip install polite_submit

Or from source:

git clone https://github.com/ahb-sjsu/polite-submit
cd polite-submit
pip install -e .

Quick Start

# Single job
polite_submit job.sh

# Multiple scripts
polite_submit --batch job1.sh job2.sh job3.sh

# Array job in chunks
polite_submit --array sweep.sh --range 0-99 --chunk 10

# Dry run (see what would happen)
polite_submit --dry-run job.sh

# Skip politeness (late night, aggressive mode)
polite_submit --aggressive job.sh

How It Works

Before each submission, polite_submit:

  1. Probes cluster state via sinfo and squeue
  2. Checks thresholds:
    • Am I running too many jobs? (default: 4)
    • Do I have too many pending? (default: 2)
    • Are others waiting? (default: threshold 10)
    • Is cluster utilization high? (default: 85%)
  3. If any threshold exceeded: Back off with exponential delay
  4. If clear: Submit via sbatch

This mirrors CSMA/CA (Carrier-Sense Multiple Access with Collision Avoidance) from WiFi protocols.

Configuration

Create ~/.polite_submit.yaml or polite_submit.yaml in your working directory:

cluster:
  host: hpc                    # SSH host alias (null for local)
  partition: gpu               # Default partition

politeness:
  max_concurrent_jobs: 4       # Max running at once
  max_pending_jobs: 2          # Max waiting in queue
  queue_depth_threshold: 10    # Back off if this many others pending
  utilization_threshold: 0.85  # Back off if cluster this full

peak_hours:
  enabled: true
  schedule:
    - [9, 17]                  # 9 AM - 5 PM
  max_concurrent: 2            # Stricter during peak
  weekend_exempt: true

backoff:
  initial_seconds: 30
  max_seconds: 1800            # 30 minutes
  multiplier: 2.0
  max_attempts: 20

CLI Options

Usage: polite_submit [OPTIONS] [SCRIPT]

Options:
  -b, --batch PATH    Submit multiple scripts (can be repeated)
  -a, --array PATH    Submit as array job
  --range TEXT        Array range (e.g., 0-99). Required with --array
  --chunk INTEGER     Chunk size for array jobs
  --aggressive        Skip politeness checks
  -n, --dry-run       Show what would happen without submitting
  -c, --config PATH   Path to config file
  -p, --partition     Override partition
  -H, --host TEXT     SSH host for remote cluster
  --version           Show version
  --help              Show this message

SSH Setup

For remote clusters, configure SSH:

# ~/.ssh/config
Host hpc
    HostName your-cluster.edu
    User yourusername
    IdentityFile ~/.ssh/id_ed25519

Then use:

polite_submit --host hpc job.sh

Theory: Fairness as Gauge Invariance

This tool implements voluntary compliance with fairness constraints. By limiting your own submissions when others are waiting, you preserve approximate user-permutation invariance—the principle that who you are shouldn't change your expected wait time.

For more on the theoretical foundation, see the ErisML library and the SQND framework.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polite_submit-0.1.1.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polite_submit-0.1.1-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file polite_submit-0.1.1.tar.gz.

File metadata

  • Download URL: polite_submit-0.1.1.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for polite_submit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5a520c4995df62b62e487a765538c822a72215c1af02ad6521c0c0a87409006b
MD5 0eec68dcd2d7c6bacae3f3143f94f5ac
BLAKE2b-256 027758ea6683d50d0d024e5e89e17e7bb7fa2721e1b909cb825b2d4bafc19f03

See more details on using hashes here.

File details

Details for the file polite_submit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: polite_submit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for polite_submit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1936488d717ee321a059275717021ef42fb6e86e79e3863293e7522b89ce9544
MD5 876345e72f9a3ba3804c38f0fb8766d3
BLAKE2b-256 7866a7067aa616ddf06df51c70dc30bdaa2ce776cae19f8859446b7815003054

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page