A comprehensive Python package for managing vLLM API server lifecycle with logging, process management, and multi-server support
Project description
vLLM Server Manager
A comprehensive Python package for managing vLLM API server lifecycle with proper logging, process management, and multi-server support.
Features
- Two Startup Modes: Parameterized configuration or full command execution
- Comprehensive Logging: Automatic log files with model name, date, and server identifier
- Graceful Shutdown: Proper process termination with timeout and force kill fallback
- Health Monitoring: Built-in health check and readiness detection
- Multi-Server Management: Cluster support for running multiple instances
- Auto-Restart: Automatic server restart on failure (optional)
- CUDA Management: GPU device selection and resource management
- Context Manager Support: Clean resource management with
withstatement
Installation
# Ensure vLLM is installed
pip install vllm
# Import the manager
from vllm_manager import VLLMManager
Quick Start
Basic Usage
from vllm_manager import VLLMManager
# Create manager and start server
manager = VLLMManager(model="facebook/opt-125m")
manager.start(host="0.0.0.0", port=8000)
# Check status
print(manager.get_status())
# {'running': True, 'pid': 12345, 'model': 'facebook/opt-125m', ...}
# Wait for server to be ready
manager.wait_for_ready(timeout=60)
# Stop when done
manager.stop()
Parameterized Startup
manager = VLLMManager(model="meta-llama/Llama-2-7b-hf")
manager.start(
host="0.0.0.0",
port=8000,
tensor_parallel_size=2, # Multi-GPU
gpu_memory_utilization=0.9,
max_model_len=4096,
dtype="float16",
quantization="awq", # Optional: AWQ quantization
api_key="your-api-key", # Optional: API authentication
)
Full Command Startup
manager = VLLMManager()
# Start with complete command
manager.start_command(
"vllm serve facebook/opt-125m --port 8000 --tensor-parallel-size 2"
)
# Or as list
manager.start_command([
"vllm", "serve", "facebook/opt-125m",
"--port", "8000",
"--host", "0.0.0.0"
])
Context Manager
from vllm_manager import VLLMManager
with VLLMManager(model="facebook/opt-125m") as manager:
manager.start(port=8000)
# Server runs here
# Automatically stopped when exiting context
Convenience Function
from vllm_manager import serve
# Quick one-liner
manager = serve("facebook/opt-125m", port=8001)
# ... use the server ...
manager.stop()
Advanced Features
Multi-Server Cluster
from vllm_manager import VLLMCluster
# Create cluster
cluster = VLLMCluster()
# Add multiple servers
cluster.add_server("small", model="facebook/opt-125m", port=8001)
cluster.add_server("medium", model="facebook/opt-350m", port=8002, auto_restart=True)
# Start all
results = cluster.start_all()
# Health check
health = cluster.health_check()
# Get status
status = cluster.get_status()
# Stop all
cluster.stop_all()
Configuration Persistence
from vllm_manager import VLLMCluster
# Save configuration
cluster = VLLMCluster()
cluster.add_server("model1", model="facebook/opt-125m", port=8001)
cluster.save_config("config.json")
# Load configuration
cluster = VLLMCluster.load_config("config.json")
GPU Selection
from vllm_manager import VLLMManager
# Select specific GPUs
manager = VLLMManager(model="facebook/opt-125m")
manager.start(
port=8000,
cuda_devices=[0, 1], # Use GPUs 0 and 1
tensor_parallel_size=2,
)
Health Monitoring
from vllm_manager import VLLMManager, health_monitor
import threading
manager = VLLMManager(model="facebook/opt-125m")
manager.start(port=8000)
# Start health monitoring in background thread
monitor_thread = threading.Thread(
target=health_monitor,
args=(manager,),
kwargs={"interval": 30, "max_failures": 3},
daemon=True
)
monitor_thread.start()
API Reference
VLLMManager
Initialization
VLLMManager(
model: Optional[str] = None, # Model name/path
log_dir: Optional[Path] = None, # Log directory
server_id: Optional[str] = None, # Custom server ID
)
Methods
start(**kwargs)- Start server with parametersstart_command(cmd)- Start server with full commandstop(timeout=30, force=False)- Stop server gracefullyis_running()- Check if server is runningget_pid()- Get server process IDget_uptime()- Get server uptime stringwait_for_ready(timeout=60)- Wait for server readinessget_log_file()- Get log file pathget_status()- Get comprehensive status dict
Start Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | - | Model name or path |
host |
str | "0.0.0.0" | Bind address |
port |
int | 8000 | Port number |
tensor_parallel_size |
int | 1 | GPU count for parallelism |
dtype |
str | "auto" | Data type (auto, float16, etc.) |
max_model_len |
int | None | Maximum context length |
gpu_memory_utilization |
float | 0.9 | GPU memory ratio |
swap_space |
int | 4 | CPU swap space (GB) |
quantization |
str | None | Quantization method (awq, gptq, etc.) |
api_key |
str | None | API authentication key |
cuda_devices |
List[int] | None | GPU device IDs |
additional_args |
List[str] | [] | Extra command-line arguments |
VLLMCluster
Methods
add_server(name, model, port, **kwargs)- Add server instancestart_server(name)- Start specific serverstop_server(name)- Stop specific serverstart_all()- Start all serversstop_all()- Stop all servershealth_check(timeout=5)- Check all servers' healthget_status()- Get all servers' statussave_config(path)- Save configuration to JSONload_config(path)- Load configuration from JSON
Log Files
Logs are stored in ./vllm_logs/ by default with naming format:
vllm_{model_name}_{YYYYMMDD}_{server_id}.log
Example:
vllm_facebook_opt-125m_20260226_srv_20260226105451.log
Log format:
2026-02-26 10:54:51 [INFO] [vllm.srv_20260226105451] VLLMManager initialized
2026-02-26 10:54:52 [INFO] [vllm.srv_20260226105451] Starting vLLM server: vllm serve ...
2026-02-26 10:54:53 [INFO] [vllm.srv_20260226105451] vLLM server started with PID: 12345
Error Handling
from vllm_manager import VLLMManager
manager = VLLMManager(model="facebook/opt-125m")
try:
manager.start(port=8000)
except RuntimeError as e:
# Server already running
print(f"Error: {e}")
except ValueError as e:
# Model not specified
print(f"Error: {e}")
except Exception as e:
# Other errors
print(f"Error: {e}")
finally:
manager.stop()
Examples
Example 1: Simple Server
from vllm_manager import VLLMManager
# Start server
manager = VLLMManager(model="facebook/opt-125m")
manager.start(port=8000)
# Keep running
import time
while manager.is_running():
time.sleep(1)
Example 2: Production Deployment
from vllm_manager import VLLMManager
import logging
# Setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Start with production config
manager = VLLMManager(
model="meta-llama/Llama-2-70b-hf",
log_dir="/var/log/vllm",
)
try:
manager.start(
host="0.0.0.0",
port=8000,
tensor_parallel_size=4,
gpu_memory_utilization=0.95,
api_key="secret-key",
disable_log_requests=False,
)
# Wait for ready
if not manager.wait_for_ready(timeout=120):
raise RuntimeError("Server failed to start")
logger.info("Server is ready!")
# Keep running
while True:
time.sleep(1)
except KeyboardInterrupt:
logger.info("Shutting down...")
manager.stop(timeout=60)
except Exception as e:
logger.error(f"Error: {e}")
manager.stop(force=True)
raise
Example 3: Multiple Models
from vllm_manager import VLLMCluster
# Create cluster for multiple models
cluster = VLLMCluster(log_dir="/var/log/vllm")
# Add models
cluster.add_server(
"llama-7b",
model="meta-llama/Llama-2-7b-hf",
port=8001,
auto_restart=True,
)
cluster.add_server(
"llama-70b",
model="meta-llama/Llama-2-70b-hf",
port=8002,
tensor_parallel_size=4,
)
# Start all
cluster.start_all()
# Monitor
import time
while True:
status = cluster.get_status()
print(f"Status: {status}")
time.sleep(10)
License
MIT License
Contributing
Contributions welcome! Please feel free to submit issues and pull requests.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_manager-0.1.0.tar.gz.
File metadata
- Download URL: vllm_manager-0.1.0.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f49b12781a99bc7b3938f2776d8e3593c41840d85ce7f4fe8cf3f6972ab7b3a8
|
|
| MD5 |
de3c685dab6b63908c2a11bbb6bd13fd
|
|
| BLAKE2b-256 |
fc78a345fe944356583dccbe34e0f5e45e9dd6e7c616093cc496d6387ac5c0c7
|
File details
Details for the file vllm_manager-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vllm_manager-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c3b633f701712133f6ac5a4ba60d2a3a019a71f2b34e8aea368c62cb0ef3a45
|
|
| MD5 |
e7ca6828b0c8c1049d9719adc1ef438b
|
|
| BLAKE2b-256 |
ce645363596638c67068587b7f96083b65ab17e17b0cb2b856615f460e06b2d1
|