Skip to main content

Super simple vLLM server launcher for SLURM/HPC with nested config support

Project description

vLLM MareNostrum Launcher

Super simple vLLM server launcher for SLURM/HPC clusters with nested YAML config support.

Why This Exists

Deploying vLLM on HPC clusters should be one line. This launcher:

  • One config file with clean vLLM/launcher separation
  • One command to launch anywhere
  • Zero complexity - transparent behavior
  • SLURM ready - handles GPU affinity automatically

Quick Start

1. Install

git clone <this-repo>
cd vllm-marenostrum
pip install -e .

2. Create Config

# config/my_model.yaml
vllm:
  model: "/path/to/your/model"
  port: 8000
  dtype: auto

launcher:
  cuda_devices: "0,1,2,3"
  gpu_memory_utilization: 0.9

3. Launch

# Local launch
vllm-marenostrum config/my_model.yaml

# Wait for service to be ready (blocks until ready)
vllm-marenostrum config/my_model.yaml --wait-for-ready

# Launch in background (starts vLLM, waits for ready, then exits)
vllm-marenostrum config/my_model.yaml --background

# SLURM batch job (uses --background for pipelines)
sbatch run_scripts/run_single.sh config/my_model.yaml

# With overrides
vllm-marenostrum config/my_model.yaml --cuda-devices 0,1 --port 8001

That's it! 🎉

Config Structure

The nested config cleanly separates concerns:

vllm:
  # Pure vLLM parameters - passed directly to vllm serve
  model: "/path/to/model"
  port: 8000
  dtype: auto
  max_model_len: 4096
  max_num_seqs: 256

launcher:
  # Deployment parameters - handled by launcher
  cuda_devices: "0,1,2,3"           # Sets CUDA_VISIBLE_DEVICES (or "cpu" for CPU)
  gpu_memory_utilization: 0.9       # Passed to vLLM
  # CPU-only parameters (when cuda_devices: "cpu"):
  cpu_kvcache_space: 64             # Memory (GB) for key-value cache on CPU
  cpu_omp_threads_bind: "0-15"      # CPU cores to bind OpenMP threads to

Why nested? vLLM's --config flag only accepts pure vLLM parameters. The launcher creates a clean temporary config file with only the vllm: section.

Examples

Single GPU Model

# config/small_model.yaml
vllm:
  model: "/models/llama-8b"
  port: 8000

launcher:
  cuda_devices: "0"

Multi-GPU with Custom Settings

# config/large_model.yaml
vllm:
  model: "/models/mistral-72b"
  port: 8001
  dtype: bfloat16
  max_model_len: 8192

launcher:
  cuda_devices: "0,1,2,3"
  gpu_memory_utilization: 0.85

CPU Embeddings (with CPU optimization)

# config/embeddings.yaml
vllm:
  model: "/models/jina-embeddings"
  task: embed
  port: 8002

launcher:
  cuda_devices: "cpu"              # Run on CPU to free up GPUs
  cpu_kvcache_space: 64            # Memory (GB) for key-value cache on CPU
  cpu_omp_threads_bind: "0-15"     # CPU cores to bind OpenMP threads to

SLURM Usage

Interactive Session

salloc -A $USER -t 01:00:00 -q $QUEUE -n 1 -c 80 --gres=gpu:4
vllm-marenostrum config/my_model.yaml

Batch Job (Simple)

sbatch -A $USER -t 01:00:00 -q $QUEUE run_scripts/run_single.sh config/my_model.yaml

Batch Job (With Pipeline)

Use the example pipeline script that starts vLLM then runs your app:

sbatch -A $USER -t 04:00:00 -q $QUEUE run_scripts/example_pipeline.sh config/my_model.yaml

Or create your own pipeline script:

#!/bin/bash
#SBATCH --job-name=my_pipeline
#SBATCH --gres=gpu:4

# Load environment...
vllm-marenostrum config/my_model.yaml --background

# Now run your application
python my_app.py
python another_script.py

Environment Setup (MareNostrum)

module purge && module load mkl intel python/3.12
unset PYTHONPATH
python -m venv venv_mn5
source venv_mn5/bin/activate
pip install -r requirements.txt

Helper Scripts

The repository includes useful helper scripts for common tasks:

Model Download (helpers_scripts/hf_dl.sh)

Downloads Hugging Face models efficiently:

# Download a model (saves to ./huggingface_models/)
bash helpers_scripts/hf_dl.sh mistralai/Mistral-Small-24B-Instruct-2501
bash helpers_scripts/hf_dl.sh meta-llama/Llama-3.1-8B-Instruct

# Models are saved to: ./huggingface_models/{model-name}/

Setup Hugging Face authentication first:

# Option 1: Environment variable
export HUGGINGFACE_HUB_TOKEN=your_token

# Option 2: CLI login
huggingface-cli login

SSH Tunnel (helpers_scripts/bsc_ssh_tunnel.sh)

Create SSH tunnels to access vLLM servers remotely:

# Forward single port
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000

# Forward multiple ports
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000,8001,8002

# Forward port range
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000-8005

Then test locally:

# Test vLLM health
curl http://localhost:8000/health

# Test OpenAI API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/model",
    "prompt": "Barcelona is a",
    "max_tokens": 7
  }'

How It Works

  1. Load config → Split into vllm and launcher sections
  2. Set devicesCUDA_VISIBLE_DEVICES for GPU or --device cpu for CPU
  3. CPU optimization → Set VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND env vars
  4. Auto tensor-parallel → Calculate from number of CUDA devices (GPU only)
  5. Create clean config → Temporary file with only vllm section
  6. Launch vLLMvllm serve --config clean_config.yaml

That's it. No magic, no complexity.

CLI Overrides

Common parameters can be overridden:

vllm-marenostrum config.yaml --cuda-devices 0,1 --port 8001 --tensor-parallel-size 2

Health Check & Background Options

# Wait for service to be ready (blocks until ready)
vllm-marenostrum config.yaml --wait-for-ready

# Launch in background (perfect for pipelines)
vllm-marenostrum config.yaml --background

# Custom timeout (default: 300 seconds)
vllm-marenostrum config.yaml --background --timeout 600

Pipeline Usage

The --background flag is perfect for running applications after vLLM:

# Start vLLM, wait for ready, then continue with your app
vllm-marenostrum config.yaml --background
python your_pipeline.py  # This runs after vLLM is ready!

Any unknown arguments are passed directly to vLLM.

Migrating from Other Projects

Just copy your configs to the nested structure:

Before:

model: "/path/to/model"
port: 8000
cuda_devices: "0,1,2,3"  # ❌ vLLM doesn't understand this

After:

vllm:
  model: "/path/to/model"
  port: 8000

launcher:
  cuda_devices: "0,1,2,3"  # ✅ Handled by launcher

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_marenostrum-1.0.2.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_marenostrum-1.0.2-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file vllm_marenostrum-1.0.2.tar.gz.

File metadata

  • Download URL: vllm_marenostrum-1.0.2.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for vllm_marenostrum-1.0.2.tar.gz
Algorithm Hash digest
SHA256 bb1b7ac73f3db4d42cda4f74dbd3c79a993dfbaeff286cb39965f3414ec40742
MD5 0fc708d5a40f3503e9ea30a3d3d731a2
BLAKE2b-256 6d7a91be288bf9c3e882658c34152d91cf57fffeb6f8b45acd7e34052380244d

See more details on using hashes here.

File details

Details for the file vllm_marenostrum-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for vllm_marenostrum-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 db3bb958c23d65c3f5e287df4bce654d26a67f8c303c5e1a70af926f0a629a71
MD5 136d4cec8f7361912fa49fa7f65c2a37
BLAKE2b-256 cbed63df827c6ce801f9f56054fd9c3155f5c3563edd8d412d0bc89f1c0aaa26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page