Add your description here
Project description
srunx
A modern Python library for SLURM workload manager integration with workflow orchestration capabilities.
Features
- 🚀 Simple Job Submission: Easy-to-use API for submitting SLURM jobs
- ⚙️ Flexible Configuration: Support for various environments (conda, venv, sqsh)
- 📋 Job Management: Submit, monitor, cancel, and list jobs
- 🧩 Workflow Orchestration: YAML-based workflow definitions with Prefect integration
- 📝 Template System: Customizable Jinja2 templates for SLURM scripts
- 🛡️ Type Safe: Full type hints and mypy compatibility
- 🖥️ CLI Tools: Command-line interfaces for both job management and workflows
Installation
Using uv (Recommended)
uv add srunx
Using pip
pip install srunx
Development Installation
git clone https://github.com/your-username/srunx.git
cd srunx
uv sync --dev
Quick Start
Basic Job Submission
from srunx import Job, JobResource, JobEnvironment, Slurm
# Create a job configuration
job = Job(
name="my_training_job",
command=["python", "train.py", "--epochs", "100"],
resources=JobResource(
nodes=1,
gpus_per_node=1,
memory_per_node="32GB",
time_limit="4:00:00"
),
environment=JobEnvironment(conda="ml_env")
)
# Submit the job
client = Slurm()
job = client.run(job)
print(f"Submitted job {job.job_id}")
# Monitor job status
job = client.retrieve(job.job_id)
print(f"Job status: {job.status}")
Command Line Usage
Submit a Job
# Basic job submission
srunx submit python train.py --name ml_job
# With resource specifications
srunx submit python train.py \
--name gpu_job \
--gpus-per-node 2 \
--memory 64GB \
--time 8:00:00
# With environment setup
srunx submit python train.py \
--conda ml_env \
--module cuda/11.8 \
--module gcc/9.3.0
Job Management
# Check job status
srunx status 12345
# List all jobs
srunx list
# Cancel a job
srunx cancel 12345
Workflow Orchestration
Create a workflow YAML file:
# workflow.yaml
name: ml_pipeline
tasks:
- name: preprocess
command: ["python", "preprocess.py"]
nodes: 1
memory_per_node: "16GB"
- name: train
command: ["python", "train.py"]
depends_on: [preprocess]
nodes: 1
gpus_per_node: 2
memory_per_node: "32GB"
time_limit: "8:00:00"
conda: ml_env
- name: evaluate
command: ["python", "evaluate.py"]
depends_on: [train]
nodes: 1
- name: notify
command: ["python", "notify.py"]
depends_on: [train, evaluate]
async: true
Execute the workflow:
# Run workflow
srunx flow workflow.yaml
# Validate workflow without execution
srunx flow workflow.yaml --validate-only
# Show execution plan
srunx flow workflow.yaml --dry-run
Advanced Usage
Custom Templates
Create a custom SLURM template:
#!/bin/bash
#SBATCH --job-name={{ job_name }}
#SBATCH --nodes={{ nodes }}
{% if gpus_per_node > 0 -%}
#SBATCH --gpus-per-node={{ gpus_per_node }}
{% endif -%}
#SBATCH --time={{ time_limit }}
#SBATCH --output={{ log_dir }}/%x_%j.out
{{ environment_setup }}
srun {{ command }}
Use it with your job:
job = client.run(job, template_path="custom_template.slurm.jinja")
Environment Configuration
Conda Environment
environment = JobEnvironment(
conda="my_env",
env_vars={"CUDA_VISIBLE_DEVICES": "0,1"}
)
SquashFS Images
environment = JobEnvironment(
sqsh="/path/to/container.sqsh",
env_vars={"OMP_NUM_THREADS": "8"}
)
Programmatic Workflow Execution
from srunx.workflows import WorkflowRunner
runner = WorkflowRunner()
workflow = runner.load_from_yaml("workflow.yaml")
results = runner.execute_workflow(workflow)
print("Job IDs:")
for task_name, job_id in results.items():
print(f" {task_name}: {job_id}")
Async Job Submission
# Submit job without waiting
job = client.run(job)
# Later, wait for completion
completed_job = client.monitor(job, poll_interval=30)
print(f"Job completed with status: {completed_job.status}")
API Reference
Core Classes
Job
Main job configuration class with resources and environment settings.
JobResource
Resource allocation specification (nodes, GPUs, memory, time).
JobEnvironment
Environment setup (conda, venv, sqsh, environment variables).
Slurm
Main interface for SLURM operations (submit, status, cancel, list).
WorkflowRunner
Workflow execution engine with YAML support.
CLI Commands
Main CLI (srunx)
submit- Submit SLURM jobsstatus- Check job statuslist- List jobscancel- Cancel jobs
Workflow CLI (srunx workflow)
- Execute YAML-defined workflows
- Validate workflow files
- Show execution plans
Configuration
Environment Variables
SLURM_LOG_DIR: Default directory for SLURM logs (default:logs)
Template Locations
srunx includes built-in templates:
base.slurm.jinja: Basic job templateadvanced.slurm.jinja: Full-featured template with all options
Development
Setup Development Environment
git clone https://github.com/your-username/srunx.git
cd srunx
uv sync --dev
Run Tests
uv run pytest
Type Checking
uv run mypy .
Code Formatting
uv run ruff check .
uv run ruff format .
Examples
Machine Learning Pipeline
# Complete ML pipeline example
from srunx import Job, JobResource, JobEnvironment, Slurm
def create_ml_job(script: str, **kwargs) -> Job:
return Job(
name=f"ml_{script.replace('.py', '')}",
command=["python", script] + [f"--{k}={v}" for k, v in kwargs.items()],
resources=JobResource(
nodes=1,
gpus_per_node=1,
memory_per_node="32GB",
time_limit="4:00:00"
),
environment=JobEnvironment(conda="pytorch")
)
client = Slurm()
# Submit preprocessing job
prep_job = create_ml_job("preprocess.py", data_path="/data", output_path="/processed")
prep_job = client.run(prep_job)
# Wait for preprocessing to complete
client.monitor(prep_job)
# Submit training job
train_job = create_ml_job("train.py", data_path="/processed", model_path="/models")
train_job = client.run(train_job)
print(f"Training job {train_job.job_id} submitted")
Distributed Computing
# Multi-node distributed job
distributed_job = Job(
name="distributed_training",
command=[
"mpirun", "-np", "16",
"python", "distributed_train.py"
],
resources=JobResource(
nodes=4,
ntasks_per_node=4,
cpus_per_task=8,
gpus_per_node=2,
memory_per_node="128GB",
time_limit="12:00:00"
),
environment=JobEnvironment(
conda="distributed_ml"
)
)
job = client.run(distributed_job)
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Workflow
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Run type checking and tests
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
See CHANGELOG.md for release history.
Support
- 📖 Documentation: docs.example.com/srunx
- 🐞 Issues: GitHub Issues
- 💬 Discussions: GitHub Discussions
Acknowledgments
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file srunx-0.1.0.tar.gz.
File metadata
- Download URL: srunx-0.1.0.tar.gz
- Upload date:
- Size: 220.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
578dda7e4c95c7252f6ed79fecd1f93771c27b902c678738f8ec52123ae70ee0
|
|
| MD5 |
e426d17833d12b9e085a297cf2f58788
|
|
| BLAKE2b-256 |
fcf273bc8663d5f4c68fb1116ab6f12c055217fb1211214225c39cc37c0f2b5e
|
File details
Details for the file srunx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: srunx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0346ce611094dd0002bfbb826547ac3e1cb1075197cac88b633845722680033d
|
|
| MD5 |
91065c39f8b77f17dc9cf37371496bca
|
|
| BLAKE2b-256 |
97bb6afcda8ca6f27dae9b359f6c9280dda6d634071c4c997c7a5d9747737ed7
|