Skip to main content

Client-side contention management for Slurm HPC clusters using CSMA/CA-inspired backoff

Project description

polite_submit

CI Python 3.10+ License: MIT

Client-side contention management for Slurm HPC clusters using CSMA/CA-inspired backoff.

Overview

polite_submit probes cluster state before job submission and backs off when resources are congested, improving queue health for all users without requiring scheduler modifications.

Key Features:

  • Reduces queue congestion from batch job floods
  • Zero server-side changes required (pure client)
  • Drop-in replacement for sbatch
  • Configurable politeness levels
  • Supports batch and array job chunking
  • Exponential backoff with jitter (like WiFi CSMA/CA)

Installation

pip install polite_submit

Or from source:

git clone https://github.com/ahb-sjsu/polite-submit
cd polite-submit
pip install -e .

Quick Start

# Single job
polite_submit job.sh

# Multiple scripts
polite_submit --batch job1.sh job2.sh job3.sh

# Array job in chunks
polite_submit --array sweep.sh --range 0-99 --chunk 10

# Dry run (see what would happen)
polite_submit --dry-run job.sh

# Skip politeness (late night, aggressive mode)
polite_submit --aggressive job.sh

How It Works

Before each submission, polite_submit:

  1. Probes cluster state via sinfo and squeue
  2. Checks thresholds:
    • Am I running too many jobs? (default: 4)
    • Do I have too many pending? (default: 2)
    • Are others waiting? (default: threshold 10)
    • Is cluster utilization high? (default: 85%)
  3. If any threshold exceeded: Back off with exponential delay
  4. If clear: Submit via sbatch

This mirrors CSMA/CA (Carrier-Sense Multiple Access with Collision Avoidance) from WiFi protocols.

Configuration

Create ~/.polite_submit.yaml or polite_submit.yaml in your working directory:

cluster:
  host: hpc                    # SSH host alias (null for local)
  partition: gpu               # Default partition

politeness:
  max_concurrent_jobs: 4       # Max running at once
  max_pending_jobs: 2          # Max waiting in queue
  queue_depth_threshold: 10    # Back off if this many others pending
  utilization_threshold: 0.85  # Back off if cluster this full

peak_hours:
  enabled: true
  schedule:
    - [9, 17]                  # 9 AM - 5 PM
  max_concurrent: 2            # Stricter during peak
  weekend_exempt: true

backoff:
  initial_seconds: 30
  max_seconds: 1800            # 30 minutes
  multiplier: 2.0
  max_attempts: 20

CLI Options

Usage: polite_submit [OPTIONS] [SCRIPT]

Options:
  -b, --batch PATH    Submit multiple scripts (can be repeated)
  -a, --array PATH    Submit as array job
  --range TEXT        Array range (e.g., 0-99). Required with --array
  --chunk INTEGER     Chunk size for array jobs
  --aggressive        Skip politeness checks
  -n, --dry-run       Show what would happen without submitting
  -c, --config PATH   Path to config file
  -p, --partition     Override partition
  -H, --host TEXT     SSH host for remote cluster
  --version           Show version
  --help              Show this message

SSH Setup

For remote clusters, configure SSH:

# ~/.ssh/config
Host hpc
    HostName your-cluster.edu
    User yourusername
    IdentityFile ~/.ssh/id_ed25519

Then use:

polite_submit --host hpc job.sh

Theory: Fairness as Gauge Invariance

This tool implements voluntary compliance with fairness constraints. By limiting your own submissions when others are waiting, you preserve approximate user-permutation invariance—the principle that who you are shouldn't change your expected wait time.

For more on the theoretical foundation, see the ErisML library and the SQND framework.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polite_submit-0.1.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polite_submit-0.1.0-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file polite_submit-0.1.0.tar.gz.

File metadata

  • Download URL: polite_submit-0.1.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for polite_submit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fc30722e876005c3d10a12d13da51104197f05209232ff3427a8cd8e3fc64bd0
MD5 b1000cebc5a4c79f4ad2c5b5af622758
BLAKE2b-256 e80e1a14980a1bc2a5e507e6df1662cb91acc1029c7f24163b11df0392c39f9a

See more details on using hashes here.

File details

Details for the file polite_submit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: polite_submit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for polite_submit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ccd3bfe3449227dda800689fcad8af8a43bbb9b5774f58d8d6a9a5ea9328e0c2
MD5 cfecd454f99f964dc7bff4775adca6cc
BLAKE2b-256 ac80a0c7a1dc55f3716e4ad3388d31bedb03eaa13a66f9b9a83215d38158a414

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page