Skip to main content

Generic SLURM dispatch (srun, sbatch, sync, poll, fetch) for the SciTeX ecosystem — login nodes never run compute

Project description

scitex-hpc

PyPI Python Tests Install Test Coverage Docs License: AGPL v3

Generic SLURM dispatch for the SciTeX ecosystem — srun / sbatch / sync / poll_job / fetch_result with sane defaults for spartan/sapphire and override knobs for any other cluster.

Login nodes never run compute — every command is wrapped in srun or sbatch via a login-shell SSH so the SLURM module loads correctly.

Install

pip install scitex-hpc

Usage

from scitex_hpc import JobConfig, srun, sbatch, sync, poll_job, fetch_result

cfg = JobConfig(
    project="scitex-dsp",
    command="pip install -e '.[dev]' -q && python -m pytest tests/ -n 16",
    host="spartan",
    partition="sapphire",
    cpus=16,
    time="00:30:00",
    mem="64G",
)

# 1. Sync local sources to the cluster.
sync(cfg)

# 2a. Blocking interactive run via srun.
exit_code = srun(cfg)

# 2b. Async batch submission via sbatch.
job_id = sbatch(cfg)
print(poll_job(cfg, job_id))   # {'state': 'COMPLETED', 'exit_code': '0:0', 'elapsed': '00:01:23'}
fetch_result(cfg, job_id)      # downloads the .out file

Reservations (book once, exec many)

For workflows where queue wait dominates iteration time — multi-agent fleets, distributed test runners, jupyter-on-HPC — book a node once and run many short commands inside its allocation:

from scitex_hpc import JobConfig, Reservation

# Book a 7-day allocation
res = Reservation.book(
    JobConfig(
        project="dev-pool",
        host="spartan",
        partition="cascade",
        cpus=8, mem="32G", time="7-0",
    ),
    persistent=True,        # walltime auto-resubmit via SIGUSR1 trap
)

# Run many commands inside the SAME allocation — no queue wait
res.exec("hostname")                          # → "spartan-bm022.hpc..."
res.exec(["python", "-m", "unittest", "discover"])
res.exec("tmux new -d -s helper claude --dangerously-skip-permissions")

# Open an interactive shell on the compute node
res.attach(cmd="bash")

# Or look up later by friendly name (state lives in ~/.scitex/hpc/leases/)
res = Reservation.get("dev-pool")
res.release()                                 # scancel + clear state

Equivalent CLI:

scitex-hpc reservations book dev-pool --host spartan --cpus 8 --mem 32G --time 7-0 --persistent
scitex-hpc reservations list
scitex-hpc reservations exec dev-pool 'hostname'
scitex-hpc reservations attach dev-pool
scitex-hpc reservations release dev-pool

Compatible with bastion-only HPC policies. No daemons, no tunnels, no crontab @reboot. Every exec() is a fresh ssh round-trip. SSH ControlMaster pooling on the calling host amortizes the handshake cost.

Walltime auto-resubmit (persistent=True)

When persistent=True, scitex-hpc:

  1. Adds #SBATCH --signal=B:USR1@3600 so SLURM signals the script 1h before walltime.
  2. Wraps the sbatch script body with a SIGUSR1 trap that calls sbatch "$0" to resubmit itself.
  3. The friendly name (dev-pool) stays stable across resubmits; the SLURM job_id changes.

To pick up the new job_id after a resubmit:

res = Reservation.get("dev-pool")
res.refresh()                                 # squeue --user --name=dev-pool
res.exec("...")                               # uses the new job_id

This is SLURM's documented signaling mechanism — not a custom daemon. Compatible with HPC policies that ban persistent user-space daemons.

Defaults & overrides

Every JobConfig field has a SCITEX_HPC_* env-var override:

Field Default Env override
host spartan SCITEX_HPC_HOST
partition sapphire SCITEX_HPC_PARTITION
cpus 16 SCITEX_HPC_CPUS
time 00:20:00 SCITEX_HPC_TIME
mem 128G SCITEX_HPC_MEM
remote_base ~/proj SCITEX_HPC_REMOTE_BASE

Resolution priority: explicit JobConfig field → env var → built-in default.

Status

Standalone module from the SciTeX ecosystem. Public API surfaces in scitex.hpc (via the umbrella package's sys.modules alias) so you can write from scitex.hpc import srun from any consumer.

License

AGPL-3.0-only.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scitex_hpc-0.7.2.tar.gz (42.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scitex_hpc-0.7.2-py3-none-any.whl (35.4 kB view details)

Uploaded Python 3

File details

Details for the file scitex_hpc-0.7.2.tar.gz.

File metadata

  • Download URL: scitex_hpc-0.7.2.tar.gz
  • Upload date:
  • Size: 42.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scitex_hpc-0.7.2.tar.gz
Algorithm Hash digest
SHA256 90946a09aca38d02ffd03b6bca241e31265135fcc3d1136b8a0c1885a810e651
MD5 909805b04e09f7637aab1585fd17933d
BLAKE2b-256 d6ca1bf58ee3371ab01076364d63fecaf00617ae92862619d1bf45ad7b64c201

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_hpc-0.7.2.tar.gz:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-hpc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scitex_hpc-0.7.2-py3-none-any.whl.

File metadata

  • Download URL: scitex_hpc-0.7.2-py3-none-any.whl
  • Upload date:
  • Size: 35.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scitex_hpc-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a4cc886aa0e8469f3532f48d721fab754ebf1b9db8e5107299a8220dbb2d0366
MD5 cb653fc73f0dce55488491011c6f4320
BLAKE2b-256 ba567dad77d96828a18e4c906412856f91746437d40db81fadb4b2cc63d1215d

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_hpc-0.7.2-py3-none-any.whl:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-hpc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page