MCP server for vLLM - expose vLLM capabilities to AI assistants
Project description
vLLM MCP Server
A Model Context Protocol (MCP) server that exposes vLLM capabilities to AI assistants like Claude, Cursor, and other MCP-compatible clients.
Features
- ๐ Chat & Completion: Send chat messages and text completions to vLLM
- ๐ Model Management: List and inspect available models
- ๐ Server Monitoring: Check server health and performance metrics
- ๐ณ Platform-Aware Container Control: Supports both Podman and Docker. Automatically detects your platform (Linux/macOS/Windows) and GPU availability, selecting the appropriate container image
- ๐ Benchmarking: Run GuideLLM benchmarks (optional)
- ๐ฌ Pre-defined Prompts: Use curated system prompts for common tasks
Installation
Using uvx (Recommended)
uvx vllm-mcp-server
Using pip
pip install vllm-mcp-server
From Source
git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
pip install -e .
Quick Start
1. Start a vLLM Server
You can either start a vLLM server manually or let the MCP server manage it via Docker.
Option A: Let MCP Server Manage Docker (Recommended)
The MCP server can automatically start/stop vLLM containers with platform detection. Just configure your MCP client (step 2) and use the start_vllm tool.
Option B: Manual Container Setup (Podman or Docker)
Replace podman with docker if using Docker.
Linux/Windows with NVIDIA GPU:
podman run --device nvidia.com/gpu=all -p 8000:8000 \
vllm/vllm-openai:latest \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
macOS (Apple Silicon / Intel):
podman run -p 8000:8000 \
quay.io/rh_ee_micyang/vllm-service:macos \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
Linux/Windows CPU-only:
podman run -p 8000:8000 \
quay.io/rh_ee_micyang/vllm-service:cpu \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--device cpu --dtype float32
Option C: Native vLLM Installation
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0
2. Configure Your MCP Client
Cursor
Add to ~/.cursor/mcp.json:
{
"mcpServers": {
"vllm": {
"command": "uvx",
"args": ["vllm-mcp-server"],
"env": {
"VLLM_BASE_URL": "http://localhost:8000",
"VLLM_MODEL": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}
}
}
}
Claude Desktop
Add to your Claude Desktop configuration:
{
"mcpServers": {
"vllm": {
"command": "uvx",
"args": ["vllm-mcp-server"],
"env": {
"VLLM_BASE_URL": "http://localhost:8000"
}
}
}
}
3. Use the Tools
Once configured, you can use these tools in your AI assistant:
Server Management:
start_vllm- Start a vLLM container (auto-detects platform & GPU)stop_vllm- Stop a running containerget_platform_status- Check platform, Docker, and GPU statusvllm_status- Check vLLM server health
Inference:
vllm_chat- Send chat messagesvllm_complete- Generate text completions
Model Management:
list_models- List available modelsget_model_info- Get model details
Configuration
Configure the server using environment variables:
| Variable | Description | Default |
|---|---|---|
VLLM_BASE_URL |
vLLM server URL | http://localhost:8000 |
VLLM_API_KEY |
API key (if required) | None |
VLLM_MODEL |
Default model to use | None (auto-detect) |
VLLM_DEFAULT_TEMPERATURE |
Default temperature | 0.7 |
VLLM_DEFAULT_MAX_TOKENS |
Default max tokens | 1024 |
VLLM_DEFAULT_TIMEOUT |
Request timeout (seconds) | 60.0 |
VLLM_CONTAINER_RUNTIME |
Container runtime (podman, docker, or auto) |
None (auto-detect, prefers Podman) |
VLLM_DOCKER_IMAGE |
Container image (GPU mode) | vllm/vllm-openai:latest |
VLLM_DOCKER_IMAGE_MACOS |
Container image (macOS) | quay.io/rh_ee_micyang/vllm-service:macos |
VLLM_DOCKER_IMAGE_CPU |
Container image (CPU mode) | quay.io/rh_ee_micyang/vllm-service:cpu |
VLLM_CONTAINER_NAME |
Container name | vllm-server |
VLLM_GPU_MEMORY_UTILIZATION |
GPU memory fraction | 0.9 |
Available Tools
P0 (Core)
vllm_chat
Send chat messages to vLLM with multi-turn conversation support.
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 1024
}
vllm_complete
Generate text completions.
{
"prompt": "def fibonacci(n):",
"max_tokens": 200,
"stop": ["\n\n"]
}
P1 (Model Management)
list_models
List all available models on the vLLM server.
get_model_info
Get detailed information about a specific model.
{
"model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}
P2 (Status)
vllm_status
Check the health and status of the vLLM server.
P3 (Server Control - Platform Aware)
The server control tools support both Podman (preferred) and Docker, automatically detecting your platform and GPU availability:
| Platform | GPU Support | Container Image |
|---|---|---|
| Linux (GPU) | โ NVIDIA | vllm/vllm-openai:latest |
| Linux (CPU) | โ | quay.io/rh_ee_micyang/vllm-service:cpu |
| macOS (Apple Silicon) | โ | quay.io/rh_ee_micyang/vllm-service:macos |
| macOS (Intel) | โ | quay.io/rh_ee_micyang/vllm-service:macos |
| Windows (GPU) | โ NVIDIA | vllm/vllm-openai:latest |
| Windows (CPU) | โ | quay.io/rh_ee_micyang/vllm-service:cpu |
start_vllm
Start a vLLM server in a Docker container with automatic platform detection.
{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"port": 8000,
"gpu_memory_utilization": 0.9,
"cpu_only": false,
"tensor_parallel_size": 1,
"max_model_len": 4096,
"dtype": "auto"
}
stop_vllm
Stop a running vLLM Docker container.
{
"container_name": "vllm-server",
"remove": true,
"timeout": 10
}
restart_vllm
Restart a vLLM container.
list_vllm_containers
List all vLLM Docker containers.
{
"all": true
}
get_vllm_logs
Get container logs to monitor loading progress.
{
"container_name": "vllm-server",
"tail": 100
}
get_platform_status
Get detailed platform, Docker, and GPU status information.
run_benchmark
Run a GuideLLM benchmark against the server.
{
"rate": "sweep",
"max_seconds": 120,
"data": "emulated"
}
Resources
The server exposes these MCP resources:
vllm://status- Current server statusvllm://metrics- Performance metricsvllm://config- Current configurationvllm://platform- Platform, Docker, and GPU information
Prompts
Pre-defined prompts for common tasks:
coding_assistant- Expert coding helpcode_reviewer- Code review feedbacktechnical_writer- Documentation writingdebugger- Debugging assistancearchitect- System design helpdata_analyst- Data analysisml_engineer- ML/AI development
Development
Setup
# Clone the repository
git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # or `.venv\Scripts\activate` on Windows
# Install with dev dependencies
uv pip install -e ".[dev]"
Running Tests
uv run pytest
Code Formatting
uv run ruff check --fix .
uv run ruff format .
Architecture
vllm-mcp-server/
โโโ src/vllm_mcp_server/
โ โโโ server.py # Main MCP server entry point
โ โโโ tools/ # MCP tool implementations
โ โ โโโ chat.py # Chat/completion tools
โ โ โโโ models.py # Model management tools
โ โ โโโ server_control.py # Docker container control
โ โ โโโ benchmark.py # GuideLLM integration
โ โโโ resources/ # MCP resource implementations
โ โ โโโ server_status.py # Server health resource
โ โ โโโ metrics.py # Prometheus metrics resource
โ โโโ prompts/ # Pre-defined prompts
โ โ โโโ system_prompts.py # Curated system prompts
โ โโโ utils/ # Utilities
โ โโโ config.py # Configuration management
โ โโโ vllm_client.py # vLLM API client
โโโ tests/ # Test suite
โโโ examples/ # Configuration examples
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
License
Apache License 2.0 - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vllm_mcp_server-0.1.1.tar.gz.
File metadata
- Download URL: vllm_mcp_server-0.1.1.tar.gz
- Upload date:
- Size: 268.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef2e7408cf541f2f7e950e6b37b7985d44598422fbd641806985cca07f942361
|
|
| MD5 |
efb9ed8963a95971370f6ad47dd56126
|
|
| BLAKE2b-256 |
e8f7100e273de6ec8793d8065864e5f9ee802c97272c942da5da6856a0d377b3
|
File details
Details for the file vllm_mcp_server-0.1.1-py3-none-any.whl.
File metadata
- Download URL: vllm_mcp_server-0.1.1-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09744c20366fa1f0ed6c26134c4dbb1334e71b3d8509145e539922fc9b9de044
|
|
| MD5 |
5e44f3012f7cf5bc2e222f74ae57b469
|
|
| BLAKE2b-256 |
8af1e2994eb24247c70a6d54082ded70a8f3e4f521fa3bef04c95ba6ab50322c
|