Skip to main content

Generic SLURM dispatch (srun, sbatch, sync, poll, fetch) for the SciTeX ecosystem — login nodes never run compute

Project description

scitex-hpc

PyPI Python Tests Install Test Coverage Docs License: AGPL v3

Generic SLURM dispatch for the SciTeX ecosystem — srun / sbatch / sync / poll_job / fetch_result with sane defaults for spartan/sapphire and override knobs for any other cluster.

Login nodes never run compute — every command is wrapped in srun or sbatch via a login-shell SSH so the SLURM module loads correctly.

Install

pip install scitex-hpc

Usage

from scitex_hpc import JobConfig, srun, sbatch, sync, poll_job, fetch_result

cfg = JobConfig(
    project="scitex-dsp",
    command="pip install -e '.[dev]' -q && python -m pytest tests/ -n 16",
    host="spartan",
    partition="sapphire",
    cpus=16,
    time="00:30:00",
    mem="64G",
)

# 1. Sync local sources to the cluster.
sync(cfg)

# 2a. Blocking interactive run via srun.
exit_code = srun(cfg)

# 2b. Async batch submission via sbatch.
job_id = sbatch(cfg)
print(poll_job(cfg, job_id))   # {'state': 'COMPLETED', 'exit_code': '0:0', 'elapsed': '00:01:23'}
fetch_result(cfg, job_id)      # downloads the .out file

Reservations (book once, exec many)

For workflows where queue wait dominates iteration time — multi-agent fleets, distributed test runners, jupyter-on-HPC — book a node once and run many short commands inside its allocation:

from scitex_hpc import JobConfig, Reservation

# Book a 7-day allocation
res = Reservation.book(
    JobConfig(
        project="dev-pool",
        host="spartan",
        partition="cascade",
        cpus=8, mem="32G", time="7-0",
    ),
    persistent=True,        # walltime auto-resubmit via SIGUSR1 trap
)

# Run many commands inside the SAME allocation — no queue wait
res.exec("hostname")                          # → "spartan-bm022.hpc..."
res.exec(["python", "-m", "unittest", "discover"])
res.exec("tmux new -d -s helper claude --dangerously-skip-permissions")

# Open an interactive shell on the compute node
res.attach(cmd="bash")

# Or look up later by friendly name (state lives in ~/.scitex/hpc/leases/)
res = Reservation.get("dev-pool")
res.release()                                 # scancel + clear state

Equivalent CLI:

scitex-hpc reservations book dev-pool --host spartan --cpus 8 --mem 32G --time 7-0 --persistent
scitex-hpc reservations list
scitex-hpc reservations exec dev-pool 'hostname'
scitex-hpc reservations attach dev-pool
scitex-hpc reservations release dev-pool

Compatible with bastion-only HPC policies. No daemons, no tunnels, no crontab @reboot. Every exec() is a fresh ssh round-trip. SSH ControlMaster pooling on the calling host amortizes the handshake cost.

Walltime auto-resubmit (persistent=True)

When persistent=True, scitex-hpc:

  1. Adds #SBATCH --signal=B:USR1@3600 so SLURM signals the script 1h before walltime.
  2. Wraps the sbatch script body with a SIGUSR1 trap that calls sbatch "$0" to resubmit itself.
  3. The friendly name (dev-pool) stays stable across resubmits; the SLURM job_id changes.

To pick up the new job_id after a resubmit:

res = Reservation.get("dev-pool")
res.refresh()                                 # squeue --user --name=dev-pool
res.exec("...")                               # uses the new job_id

This is SLURM's documented signaling mechanism — not a custom daemon. Compatible with HPC policies that ban persistent user-space daemons.

Defaults & overrides

Every JobConfig field has a SCITEX_HPC_* env-var override:

Field Default Env override
host spartan SCITEX_HPC_HOST
partition sapphire SCITEX_HPC_PARTITION
cpus 16 SCITEX_HPC_CPUS
time 00:20:00 SCITEX_HPC_TIME
mem 128G SCITEX_HPC_MEM
remote_base ~/proj SCITEX_HPC_REMOTE_BASE

Resolution priority: explicit JobConfig field → env var → built-in default.

Status

Standalone module from the SciTeX ecosystem. Public API surfaces in scitex.hpc (via the umbrella package's sys.modules alias) so you can write from scitex.hpc import srun from any consumer.

License

AGPL-3.0-only.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scitex_hpc-0.6.3.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scitex_hpc-0.6.3-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file scitex_hpc-0.6.3.tar.gz.

File metadata

  • Download URL: scitex_hpc-0.6.3.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scitex_hpc-0.6.3.tar.gz
Algorithm Hash digest
SHA256 e9e4f96ecc9ce66033456a9eedafb2acfe0bfcea006a7ee5dbf50e2d8bcb7c4b
MD5 4b12ab8381371d27b906135326d438b4
BLAKE2b-256 a360f369c3aa32db675638827f2c908dcc1c0ab2843009739b6e9e878e35b683

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_hpc-0.6.3.tar.gz:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-hpc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scitex_hpc-0.6.3-py3-none-any.whl.

File metadata

  • Download URL: scitex_hpc-0.6.3-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scitex_hpc-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 1019156a5b7276335f0b0d52279dced07b70dec3a94c23114ba6553f90fd46fb
MD5 e7cd91cede8940ba10b1a443545ef137
BLAKE2b-256 3bb784c896888ac8d85229da6628d01257be28ddc6917b44ffbf1f9a5a6a54e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_hpc-0.6.3-py3-none-any.whl:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-hpc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page