Super simple vLLM server launcher for SLURM/HPC with nested config support

Project description

vLLM MareNostrum Launcher

Super simple vLLM server launcher for SLURM/HPC clusters with nested YAML config support.

Why This Exists

Deploying vLLM on HPC clusters should be one line. This launcher:

✅ One config file with clean vLLM/launcher separation
✅ One command to launch anywhere
✅ Zero complexity - transparent behavior
✅ SLURM ready - handles GPU affinity automatically

Quick Start

1. Install

git clone <this-repo>
cd vllm-marenostrum
pip install -e .

2. Create Config

# config/my_model.yaml
vllm:
  model: "/path/to/your/model"
  port: 8000
  dtype: auto

launcher:
  cuda_devices: "0,1,2,3"
  gpu_memory_utilization: 0.9

3. Launch

# Local launch
vllm-marenostrum config/my_model.yaml

# Wait for service to be ready (blocks until ready)
vllm-marenostrum config/my_model.yaml --wait-for-ready

# Launch in background (starts vLLM, waits for ready, then exits)
vllm-marenostrum config/my_model.yaml --background

# SLURM batch job (uses --background for pipelines)
sbatch run_scripts/run_single.sh config/my_model.yaml

# With overrides
vllm-marenostrum config/my_model.yaml --cuda-devices 0,1 --port 8001

That's it! 🎉

Config Structure

The nested config cleanly separates concerns:

vllm:
  # Pure vLLM parameters - passed directly to vllm serve
  model: "/path/to/model"
  port: 8000
  dtype: auto
  max_model_len: 4096
  max_num_seqs: 256

launcher:
  # Deployment parameters - handled by launcher
  cuda_devices: "0,1,2,3"           # Sets CUDA_VISIBLE_DEVICES (or "cpu" for CPU)
  gpu_memory_utilization: 0.9       # Passed to vLLM
  # CPU-only parameters (when cuda_devices: "cpu"):
  cpu_kvcache_space: 64             # Memory (GB) for key-value cache on CPU
  cpu_omp_threads_bind: "0-15"      # CPU cores to bind OpenMP threads to

Why nested? vLLM's --config flag only accepts pure vLLM parameters. The launcher creates a clean temporary config file with only the vllm: section.

Examples

Single GPU Model

# config/small_model.yaml
vllm:
  model: "/models/llama-8b"
  port: 8000

launcher:
  cuda_devices: "0"

Multi-GPU with Custom Settings

# config/large_model.yaml
vllm:
  model: "/models/mistral-72b"
  port: 8001
  dtype: bfloat16
  max_model_len: 8192

launcher:
  cuda_devices: "0,1,2,3"
  gpu_memory_utilization: 0.85

CPU Embeddings (with CPU optimization)

# config/embeddings.yaml
vllm:
  model: "/models/jina-embeddings"
  task: embed
  port: 8002

launcher:
  cuda_devices: "cpu"              # Run on CPU to free up GPUs
  cpu_kvcache_space: 64            # Memory (GB) for key-value cache on CPU
  cpu_omp_threads_bind: "0-15"     # CPU cores to bind OpenMP threads to

SLURM Usage

Interactive Session

salloc -A $USER -t 01:00:00 -q $QUEUE -n 1 -c 80 --gres=gpu:4
vllm-marenostrum config/my_model.yaml

Batch Job (Simple)

sbatch -A $USER -t 01:00:00 -q $QUEUE run_scripts/run_single.sh config/my_model.yaml

Batch Job (With Pipeline)

Use the example pipeline script that starts vLLM then runs your app:

sbatch -A $USER -t 04:00:00 -q $QUEUE run_scripts/example_pipeline.sh config/my_model.yaml

Or create your own pipeline script:

#!/bin/bash
#SBATCH --job-name=my_pipeline
#SBATCH --gres=gpu:4

# Load environment...
vllm-marenostrum config/my_model.yaml --background

# Now run your application
python my_app.py
python another_script.py

Environment Setup (MareNostrum)

module purge && module load mkl intel python/3.12
unset PYTHONPATH
python -m venv venv_mn5
source venv_mn5/bin/activate
pip install -r requirements.txt

Helper Scripts

The repository includes useful helper scripts for common tasks:

Model Download (`helpers_scripts/hf_dl.sh`)

Downloads Hugging Face models efficiently:

# Download a model (saves to ./huggingface_models/)
bash helpers_scripts/hf_dl.sh mistralai/Mistral-Small-24B-Instruct-2501
bash helpers_scripts/hf_dl.sh meta-llama/Llama-3.1-8B-Instruct

# Models are saved to: ./huggingface_models/{model-name}/

Setup Hugging Face authentication first:

# Option 1: Environment variable
export HUGGINGFACE_HUB_TOKEN=your_token

# Option 2: CLI login
huggingface-cli login

SSH Tunnel (`helpers_scripts/bsc_ssh_tunnel.sh`)

Create SSH tunnels to access vLLM servers remotely:

# Forward single port
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000

# Forward multiple ports
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000,8001,8002

# Forward port range
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000-8005

Then test locally:

# Test vLLM health
curl http://localhost:8000/health

# Test OpenAI API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/model",
    "prompt": "Barcelona is a",
    "max_tokens": 7
  }'

How It Works

Load config → Split into vllm and launcher sections
Set devices → CUDA_VISIBLE_DEVICES for GPU or --device cpu for CPU
CPU optimization → Set VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND env vars
Auto tensor-parallel → Calculate from number of CUDA devices (GPU only)
Create clean config → Temporary file with only vllm section
Launch vLLM → vllm serve --config clean_config.yaml

That's it. No magic, no complexity.

CLI Overrides

Common parameters can be overridden:

vllm-marenostrum config.yaml --cuda-devices 0,1 --port 8001 --tensor-parallel-size 2

Health Check & Background Options

# Wait for service to be ready (blocks until ready)
vllm-marenostrum config.yaml --wait-for-ready

# Launch in background (perfect for pipelines)
vllm-marenostrum config.yaml --background

# Custom timeout (default: 300 seconds)
vllm-marenostrum config.yaml --background --timeout 600

Pipeline Usage

The --background flag is perfect for running applications after vLLM:

# Start vLLM, wait for ready, then continue with your app
vllm-marenostrum config.yaml --background
python your_pipeline.py  # This runs after vLLM is ready!

Any unknown arguments are passed directly to vLLM.

Migrating from Other Projects

Just copy your configs to the nested structure:

Before:

model: "/path/to/model"
port: 8000
cuda_devices: "0,1,2,3"  # ❌ vLLM doesn't understand this

After:

vllm:
  model: "/path/to/model"
  port: 8000

launcher:
  cuda_devices: "0,1,2,3"  # ✅ Handled by launcher

License

MIT

Project details

Release history Release notifications | RSS feed

This version

1.0.2

Sep 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_marenostrum-1.0.2.tar.gz (6.8 kB view details)

Uploaded Sep 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vllm_marenostrum-1.0.2-py3-none-any.whl (7.3 kB view details)

Uploaded Sep 18, 2025 Python 3

File details

Details for the file vllm_marenostrum-1.0.2.tar.gz.

File metadata

Download URL: vllm_marenostrum-1.0.2.tar.gz
Upload date: Sep 18, 2025
Size: 6.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for vllm_marenostrum-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`bb1b7ac73f3db4d42cda4f74dbd3c79a993dfbaeff286cb39965f3414ec40742`
MD5	`0fc708d5a40f3503e9ea30a3d3d731a2`
BLAKE2b-256	`6d7a91be288bf9c3e882658c34152d91cf57fffeb6f8b45acd7e34052380244d`

See more details on using hashes here.

File details

Details for the file vllm_marenostrum-1.0.2-py3-none-any.whl.

File metadata

Download URL: vllm_marenostrum-1.0.2-py3-none-any.whl
Upload date: Sep 18, 2025
Size: 7.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for vllm_marenostrum-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db3bb958c23d65c3f5e287df4bce654d26a67f8c303c5e1a70af926f0a629a71`
MD5	`136d4cec8f7361912fa49fa7f65c2a37`
BLAKE2b-256	`cbed63df827c6ce801f9f56054fd9c3155f5c3563edd8d412d0bc89f1c0aaa26`

See more details on using hashes here.

vllm-marenostrum 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

vLLM MareNostrum Launcher

Why This Exists

Quick Start

1. Install

2. Create Config

3. Launch

Config Structure

Examples

Single GPU Model

Multi-GPU with Custom Settings

CPU Embeddings (with CPU optimization)

SLURM Usage

Interactive Session

Batch Job (Simple)

Batch Job (With Pipeline)

Environment Setup (MareNostrum)

Helper Scripts

Model Download (helpers_scripts/hf_dl.sh)

SSH Tunnel (helpers_scripts/bsc_ssh_tunnel.sh)

How It Works

CLI Overrides

Health Check & Background Options

Pipeline Usage

Migrating from Other Projects

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Model Download (`helpers_scripts/hf_dl.sh`)

SSH Tunnel (`helpers_scripts/bsc_ssh_tunnel.sh`)