Skip to main content

MCP server for vLLM - expose vLLM capabilities to AI assistants

Project description

vLLM MCP Server

Python 3.10+ License: Apache 2.0

A Model Context Protocol (MCP) server that exposes vLLM capabilities to AI assistants like Claude, Cursor, and other MCP-compatible clients.

Features

  • ๐Ÿš€ Chat & Completion: Send chat messages and text completions to vLLM
  • ๐Ÿ“‹ Model Management: List and inspect available models
  • ๐Ÿ“Š Server Monitoring: Check server health and performance metrics
  • ๐Ÿณ Platform-Aware Container Control: Supports both Podman and Docker. Automatically detects your platform (Linux/macOS/Windows) and GPU availability, selecting the appropriate container image
  • ๐Ÿ“ˆ Benchmarking: Run GuideLLM benchmarks (optional)
  • ๐Ÿ’ฌ Pre-defined Prompts: Use curated system prompts for common tasks

Installation

Using uvx (Recommended)

uvx vllm-mcp-server

Using pip

pip install vllm-mcp-server

From Source

git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server
pip install -e .

Quick Start

1. Start a vLLM Server

You can either start a vLLM server manually or let the MCP server manage it via Docker.

Option A: Let MCP Server Manage Docker (Recommended)

The MCP server can automatically start/stop vLLM containers with platform detection. Just configure your MCP client (step 2) and use the start_vllm tool.

Option B: Manual Container Setup (Podman or Docker)

Replace podman with docker if using Docker.

Linux/Windows with NVIDIA GPU:

podman run --device nvidia.com/gpu=all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

macOS (Apple Silicon / Intel):

podman run -p 8000:8000 \
  quay.io/rh_ee_micyang/vllm-service:macos \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

Linux/Windows CPU-only:

podman run -p 8000:8000 \
  quay.io/rh_ee_micyang/vllm-service:cpu \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --device cpu --dtype float32

Option C: Native vLLM Installation

vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0

2. Configure Your MCP Client

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "vllm": {
      "command": "uvx",
      "args": ["vllm-mcp-server"],
      "env": {
        "VLLM_BASE_URL": "http://localhost:8000",
        "VLLM_MODEL": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
      }
    }
  }
}

Claude Desktop

Add to your Claude Desktop configuration:

{
  "mcpServers": {
    "vllm": {
      "command": "uvx",
      "args": ["vllm-mcp-server"],
      "env": {
        "VLLM_BASE_URL": "http://localhost:8000"
      }
    }
  }
}

3. Use the Tools

Once configured, you can use these tools in your AI assistant:

Server Management:

  • start_vllm - Start a vLLM container (auto-detects platform & GPU)
  • stop_vllm - Stop a running container
  • get_platform_status - Check platform, Docker, and GPU status
  • vllm_status - Check vLLM server health

Inference:

  • vllm_chat - Send chat messages
  • vllm_complete - Generate text completions

Model Management:

  • list_models - List available models
  • get_model_info - Get model details

Configuration

Configure the server using environment variables:

Variable Description Default
VLLM_BASE_URL vLLM server URL http://localhost:8000
VLLM_API_KEY API key (if required) None
VLLM_MODEL Default model to use None (auto-detect)
VLLM_DEFAULT_TEMPERATURE Default temperature 0.7
VLLM_DEFAULT_MAX_TOKENS Default max tokens 1024
VLLM_DEFAULT_TIMEOUT Request timeout (seconds) 60.0
VLLM_CONTAINER_RUNTIME Container runtime (podman, docker, or auto) None (auto-detect, prefers Podman)
VLLM_DOCKER_IMAGE Container image (GPU mode) vllm/vllm-openai:latest
VLLM_DOCKER_IMAGE_MACOS Container image (macOS) quay.io/rh_ee_micyang/vllm-service:macos
VLLM_DOCKER_IMAGE_CPU Container image (CPU mode) quay.io/rh_ee_micyang/vllm-service:cpu
VLLM_CONTAINER_NAME Container name vllm-server
VLLM_GPU_MEMORY_UTILIZATION GPU memory fraction 0.9

Available Tools

P0 (Core)

vllm_chat

Send chat messages to vLLM with multi-turn conversation support.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "temperature": 0.7,
  "max_tokens": 1024
}

vllm_complete

Generate text completions.

{
  "prompt": "def fibonacci(n):",
  "max_tokens": 200,
  "stop": ["\n\n"]
}

P1 (Model Management)

list_models

List all available models on the vLLM server.

get_model_info

Get detailed information about a specific model.

{
  "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}

P2 (Status)

vllm_status

Check the health and status of the vLLM server.

P3 (Server Control - Platform Aware)

The server control tools support both Podman (preferred) and Docker, automatically detecting your platform and GPU availability:

Platform GPU Support Container Image
Linux (GPU) โœ… NVIDIA vllm/vllm-openai:latest
Linux (CPU) โŒ quay.io/rh_ee_micyang/vllm-service:cpu
macOS (Apple Silicon) โŒ quay.io/rh_ee_micyang/vllm-service:macos
macOS (Intel) โŒ quay.io/rh_ee_micyang/vllm-service:macos
Windows (GPU) โœ… NVIDIA vllm/vllm-openai:latest
Windows (CPU) โŒ quay.io/rh_ee_micyang/vllm-service:cpu

start_vllm

Start a vLLM server in a Docker container with automatic platform detection.

{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "port": 8000,
  "gpu_memory_utilization": 0.9,
  "cpu_only": false,
  "tensor_parallel_size": 1,
  "max_model_len": 4096,
  "dtype": "auto"
}

stop_vllm

Stop a running vLLM Docker container.

{
  "container_name": "vllm-server",
  "remove": true,
  "timeout": 10
}

restart_vllm

Restart a vLLM container.

list_vllm_containers

List all vLLM Docker containers.

{
  "all": true
}

get_vllm_logs

Get container logs to monitor loading progress.

{
  "container_name": "vllm-server",
  "tail": 100
}

get_platform_status

Get detailed platform, Docker, and GPU status information.

run_benchmark

Run a GuideLLM benchmark against the server.

{
  "rate": "sweep",
  "max_seconds": 120,
  "data": "emulated"
}

Resources

The server exposes these MCP resources:

  • vllm://status - Current server status
  • vllm://metrics - Performance metrics
  • vllm://config - Current configuration
  • vllm://platform - Platform, Docker, and GPU information

Prompts

Pre-defined prompts for common tasks:

  • coding_assistant - Expert coding help
  • code_reviewer - Code review feedback
  • technical_writer - Documentation writing
  • debugger - Debugging assistance
  • architect - System design help
  • data_analyst - Data analysis
  • ml_engineer - ML/AI development

Development

Setup

# Clone the repository
git clone https://github.com/micytao/vllm-mcp-server.git
cd vllm-mcp-server

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows

# Install with dev dependencies
uv pip install -e ".[dev]"

Running Tests

uv run pytest

Code Formatting

uv run ruff check --fix .
uv run ruff format .

Architecture

vllm-mcp-server/
โ”œโ”€โ”€ src/vllm_mcp_server/
โ”‚   โ”œโ”€โ”€ server.py              # Main MCP server entry point
โ”‚   โ”œโ”€โ”€ tools/                 # MCP tool implementations
โ”‚   โ”‚   โ”œโ”€โ”€ chat.py            # Chat/completion tools
โ”‚   โ”‚   โ”œโ”€โ”€ models.py          # Model management tools
โ”‚   โ”‚   โ”œโ”€โ”€ server_control.py  # Docker container control
โ”‚   โ”‚   โ””โ”€โ”€ benchmark.py       # GuideLLM integration
โ”‚   โ”œโ”€โ”€ resources/             # MCP resource implementations
โ”‚   โ”‚   โ”œโ”€โ”€ server_status.py   # Server health resource
โ”‚   โ”‚   โ””โ”€โ”€ metrics.py         # Prometheus metrics resource
โ”‚   โ”œโ”€โ”€ prompts/               # Pre-defined prompts
โ”‚   โ”‚   โ””โ”€โ”€ system_prompts.py  # Curated system prompts
โ”‚   โ””โ”€โ”€ utils/                 # Utilities
โ”‚       โ”œโ”€โ”€ config.py          # Configuration management
โ”‚       โ””โ”€โ”€ vllm_client.py     # vLLM API client
โ”œโ”€โ”€ tests/                     # Test suite
โ”œโ”€โ”€ examples/                  # Configuration examples
โ”œโ”€โ”€ pyproject.toml             # Project configuration
โ””โ”€โ”€ README.md                  # This file

License

Apache License 2.0 - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

  • vLLM - Fast LLM inference engine
  • MCP - Model Context Protocol
  • GuideLLM - LLM benchmarking tool

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vllm_mcp_server-0.1.2.tar.gz (268.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vllm_mcp_server-0.1.2-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file vllm_mcp_server-0.1.2.tar.gz.

File metadata

  • Download URL: vllm_mcp_server-0.1.2.tar.gz
  • Upload date:
  • Size: 268.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for vllm_mcp_server-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c710587d1018d0f2e7b8af2a70ef1b72503864f97862e7a5669f99ab15642f5d
MD5 8e6b08b98068237abdbf41f68e51584e
BLAKE2b-256 13ea58f225c34b948826483618ed62e9a362d2a0523d192448926484d95840c7

See more details on using hashes here.

File details

Details for the file vllm_mcp_server-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for vllm_mcp_server-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 07219a62aefd438eaf207cc9d606a85d9da3aed1fe554e62e1feda3aa05dcbd7
MD5 e5e8f88a1e454a8706269fa9c7df1924
BLAKE2b-256 923d0e08b50289b85ca35d88af731e46ccf3d99ae0cbe6f6b790ef4c0a6bde92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page