Super simple vLLM server launcher for SLURM/HPC with nested config support
Project description
vLLM MareNostrum Launcher
Super simple vLLM server launcher for SLURM/HPC clusters with nested YAML config support.
Why This Exists
Deploying vLLM on HPC clusters should be one line. This launcher:
- ✅ One config file with clean vLLM/launcher separation
- ✅ One command to launch anywhere
- ✅ Zero complexity - transparent behavior
- ✅ SLURM ready - handles GPU affinity automatically
Quick Start
1. Install
git clone <this-repo>
cd vllm-marenostrum
pip install -e .
2. Create Config
# config/my_model.yaml
vllm:
model: "/path/to/your/model"
port: 8000
dtype: auto
launcher:
cuda_devices: "0,1,2,3"
gpu_memory_utilization: 0.9
3. Launch
# Local launch
vllm-marenostrum config/my_model.yaml
# Wait for service to be ready (blocks until ready)
vllm-marenostrum config/my_model.yaml --wait-for-ready
# Launch in background (starts vLLM, waits for ready, then exits)
vllm-marenostrum config/my_model.yaml --background
# SLURM batch job (uses --background for pipelines)
sbatch run_scripts/run_single.sh config/my_model.yaml
# With overrides
vllm-marenostrum config/my_model.yaml --cuda-devices 0,1 --port 8001
That's it! 🎉
Config Structure
The nested config cleanly separates concerns:
vllm:
# Pure vLLM parameters - passed directly to vllm serve
model: "/path/to/model"
port: 8000
dtype: auto
max_model_len: 4096
max_num_seqs: 256
launcher:
# Deployment parameters - handled by launcher
cuda_devices: "0,1,2,3" # Sets CUDA_VISIBLE_DEVICES (or "cpu" for CPU)
gpu_memory_utilization: 0.9 # Passed to vLLM
# CPU-only parameters (when cuda_devices: "cpu"):
cpu_kvcache_space: 64 # Memory (GB) for key-value cache on CPU
cpu_omp_threads_bind: "0-15" # CPU cores to bind OpenMP threads to
Why nested? vLLM's --config flag only accepts pure vLLM parameters. The launcher creates a clean temporary config file with only the vllm: section.
Examples
Single GPU Model
# config/small_model.yaml
vllm:
model: "/models/llama-8b"
port: 8000
launcher:
cuda_devices: "0"
Multi-GPU with Custom Settings
# config/large_model.yaml
vllm:
model: "/models/mistral-72b"
port: 8001
dtype: bfloat16
max_model_len: 8192
launcher:
cuda_devices: "0,1,2,3"
gpu_memory_utilization: 0.85
CPU Embeddings (with CPU optimization)
# config/embeddings.yaml
vllm:
model: "/models/jina-embeddings"
task: embed
port: 8002
launcher:
cuda_devices: "cpu" # Run on CPU to free up GPUs
cpu_kvcache_space: 64 # Memory (GB) for key-value cache on CPU
cpu_omp_threads_bind: "0-15" # CPU cores to bind OpenMP threads to
SLURM Usage
Interactive Session
salloc -A $USER -t 01:00:00 -q $QUEUE -n 1 -c 80 --gres=gpu:4
vllm-marenostrum config/my_model.yaml
Batch Job (Simple)
sbatch -A $USER -t 01:00:00 -q $QUEUE run_scripts/run_single.sh config/my_model.yaml
Batch Job (With Pipeline)
Use the example pipeline script that starts vLLM then runs your app:
sbatch -A $USER -t 04:00:00 -q $QUEUE run_scripts/example_pipeline.sh config/my_model.yaml
Or create your own pipeline script:
#!/bin/bash
#SBATCH --job-name=my_pipeline
#SBATCH --gres=gpu:4
# Load environment...
vllm-marenostrum config/my_model.yaml --background
# Now run your application
python my_app.py
python another_script.py
Environment Setup (MareNostrum)
module purge && module load mkl intel python/3.12
unset PYTHONPATH
python -m venv venv_mn5
source venv_mn5/bin/activate
pip install -r requirements.txt
Helper Scripts
The repository includes useful helper scripts for common tasks:
Model Download (helpers_scripts/hf_dl.sh)
Downloads Hugging Face models efficiently:
# Download a model (saves to ./huggingface_models/)
bash helpers_scripts/hf_dl.sh mistralai/Mistral-Small-24B-Instruct-2501
bash helpers_scripts/hf_dl.sh meta-llama/Llama-3.1-8B-Instruct
# Models are saved to: ./huggingface_models/{model-name}/
Setup Hugging Face authentication first:
# Option 1: Environment variable
export HUGGINGFACE_HUB_TOKEN=your_token
# Option 2: CLI login
huggingface-cli login
SSH Tunnel (helpers_scripts/bsc_ssh_tunnel.sh)
Create SSH tunnels to access vLLM servers remotely:
# Forward single port
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000
# Forward multiple ports
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000,8001,8002
# Forward port range
bash helpers_scripts/bsc_ssh_tunnel.sh mn5-acc-4 as05r1b08 8000-8005
Then test locally:
# Test vLLM health
curl http://localhost:8000/health
# Test OpenAI API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/path/to/model",
"prompt": "Barcelona is a",
"max_tokens": 7
}'
How It Works
- Load config → Split into
vllmandlaunchersections - Set devices →
CUDA_VISIBLE_DEVICESfor GPU or--device cpufor CPU - CPU optimization → Set
VLLM_CPU_KVCACHE_SPACEandVLLM_CPU_OMP_THREADS_BINDenv vars - Auto tensor-parallel → Calculate from number of CUDA devices (GPU only)
- Create clean config → Temporary file with only
vllmsection - Launch vLLM →
vllm serve --config clean_config.yaml
That's it. No magic, no complexity.
CLI Overrides
Common parameters can be overridden:
vllm-marenostrum config.yaml --cuda-devices 0,1 --port 8001 --tensor-parallel-size 2
Health Check & Background Options
# Wait for service to be ready (blocks until ready)
vllm-marenostrum config.yaml --wait-for-ready
# Launch in background (perfect for pipelines)
vllm-marenostrum config.yaml --background
# Custom timeout (default: 300 seconds)
vllm-marenostrum config.yaml --background --timeout 600
Pipeline Usage
The --background flag is perfect for running applications after vLLM:
# Start vLLM, wait for ready, then continue with your app
vllm-marenostrum config.yaml --background
python your_pipeline.py # This runs after vLLM is ready!
Any unknown arguments are passed directly to vLLM.
Migrating from Other Projects
Just copy your configs to the nested structure:
Before:
model: "/path/to/model"
port: 8000
cuda_devices: "0,1,2,3" # ❌ vLLM doesn't understand this
After:
vllm:
model: "/path/to/model"
port: 8000
launcher:
cuda_devices: "0,1,2,3" # ✅ Handled by launcher
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_marenostrum-1.0.2.tar.gz.
File metadata
- Download URL: vllm_marenostrum-1.0.2.tar.gz
- Upload date:
- Size: 6.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb1b7ac73f3db4d42cda4f74dbd3c79a993dfbaeff286cb39965f3414ec40742
|
|
| MD5 |
0fc708d5a40f3503e9ea30a3d3d731a2
|
|
| BLAKE2b-256 |
6d7a91be288bf9c3e882658c34152d91cf57fffeb6f8b45acd7e34052380244d
|
File details
Details for the file vllm_marenostrum-1.0.2-py3-none-any.whl.
File metadata
- Download URL: vllm_marenostrum-1.0.2-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db3bb958c23d65c3f5e287df4bce654d26a67f8c303c5e1a70af926f0a629a71
|
|
| MD5 |
136d4cec8f7361912fa49fa7f65c2a37
|
|
| BLAKE2b-256 |
cbed63df827c6ce801f9f56054fd9c3155f5c3563edd8d412d0bc89f1c0aaa26
|