MCP server for autonomous multi-VM cluster orchestration on libvirt/QEMU
Project description
vmcluster-mcp
An MCP server for autonomous multi-VM cluster orchestration on libvirt/QEMU. Manages the full lifecycle of KVM virtual machine clusters — provisioning, starting, stopping, snapshotting, SSH execution, artifact distribution, and fault injection — through a structured tool interface designed for AI agents.
Table of Contents
- Overview
- Prerequisites
- Installation
- Quick Start (First 15 Minutes)
- Configuration
- Topology Files
- Integration: VS Code (GitHub Copilot)
- Integration: Claude CLI
- Available Tools
- Canonical Agent Workflow
- Troubleshooting
- Development
Overview
vmcluster-mcp is a general-purpose MCP server. It manages clusters of KVM/QEMU virtual machines and produces a ClusterHandle — a typed descriptor passed to downstream consumers for direct SSH access. The server has no knowledge of what runs inside VMs; it knows nodes, networks, snapshots, and artifacts.
Design principles:
- Topology-as-data — cluster shape is declared in a YAML file, not constructed imperatively
- Structured outputs — all tools return typed Pydantic models serialized as JSON; no free-text parsing
- Stateless server — all persistent state lives in libvirt and on disk; safe to restart at any time
- Idempotent operations —
cluster_defineand related tools are safe to call multiple times
Prerequisites
- Linux host with KVM/QEMU and libvirt installed (
libvirtdrunning) - Python 3.11+
qemu-imgavailable inPATHgenisoimageormkisofsfor cloud-init ISO generationiptables,tc(fromiproute2), andrsyncfor fault/artifact tools- Permission to run libvirt and host network commands (
sudoaccess is usually required) uv(recommended) orpipfor installation
Typical package set on Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y \
qemu-kvm libvirt-daemon-system libvirt-clients \
qemu-utils cloud-image-utils genisoimage \
iproute2 iptables rsync
# Verify libvirt access
virsh list --all
# Verify qemu-img
qemu-img --version
# Verify tc and iptables
tc -V
iptables --version
Installation
From PyPI (recommended)
Install the latest release:
pip install vmcluster-mcp
Or run directly without installing (via uv):
uvx vmcluster-mcp
Prerequisite:
libvirt-pythonrequires system-level development headers. Install them before runningpip install:# Ubuntu/Debian sudo apt-get install -y libvirt-dev pkg-config gcc # Fedora/RHEL sudo dnf install -y libvirt-devel pkgconf-pkg-config gcc # Arch Linux sudo pacman -S libvirt pkgconf gcc
From source (development)
git clone https://github.com/hornc/vmcluster-mcp.git
cd vmcluster-mcp
# Create virtual environment and install
uv venv
uv pip install -e .
Quick Start (First 15 Minutes)
This path is for first-time setup on a single Linux host.
- Create required directories and SSH key:
sudo mkdir -p /etc/vmcluster/topologies /etc/vmcluster/ssh
sudo mkdir -p /var/lib/vmcluster/{overlays,artifacts/trees,faults}
sudo ssh-keygen -t ed25519 -f /etc/vmcluster/ssh/vmcluster_id_ed25519 -N ""
- Create
/etc/vmcluster/config.yaml:
topology_dir: /etc/vmcluster/topologies
overlay_dir: /var/lib/vmcluster/overlays
artifact_registry: /var/lib/vmcluster/artifacts/registry.json
artifact_store_dir: /var/lib/vmcluster/artifacts/trees
fault_registry: /var/lib/vmcluster/faults/registry.json
ssh_key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
ssh_user: root
libvirt_uri: qemu:///system
log_level: INFO
- Prepare a base image used by the example topology:
sudo mkdir -p /var/lib/vmcluster/images
sudo wget -O /tmp/ubuntu-24.04-server-cloudimg-amd64.img \
https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img
sudo qemu-img convert -f qcow2 -O qcow2 \
/tmp/ubuntu-24.04-server-cloudimg-amd64.img \
/var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
sudo qemu-img info /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
-
Add your first topology file in
/etc/vmcluster/topologies/(see example below). -
Run the server locally to verify it starts:
VMCLUSTER_CONFIG=/etc/vmcluster/config.yaml .venv/bin/python -m vmcluster_mcp
- Connect from your MCP client (VS Code or Claude) and run this smoke flow:
cluster_define("example-3node")
cluster_start("example-3node", wait_for_ssh=True)
cluster_status("example-3node")
node_exec("example-3node", "controller", "uname -r")
snapshot_create("example-3node", "baseline")
cluster_stop("example-3node")
- Clean up when finished:
cluster_destroy("example-3node", remove_overlays=True)
Configuration
The server can be configured via a YAML file and/or environment variables. Environment variables take precedence over the config file, which takes precedence over defaults.
Config file
Default location: /etc/vmcluster/config.yaml. Override with VMCLUSTER_CONFIG env var.
# /etc/vmcluster/config.yaml
topology_dir: /etc/vmcluster/topologies # Where topology YAML files live
overlay_dir: /var/lib/vmcluster/overlays # Where per-node qcow2 overlays are created
artifact_registry: /var/lib/vmcluster/artifacts/registry.json
artifact_store_dir: /var/lib/vmcluster/artifacts/trees
fault_registry: /var/lib/vmcluster/faults/registry.json
ssh_key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
ssh_user: root
libvirt_uri: qemu:///system
log_level: INFO
Environment variables
| Variable | Config key | Default |
|---|---|---|
VMCLUSTER_CONFIG |
(config file path) | /etc/vmcluster/config.yaml |
VMCLUSTER_TOPOLOGY_DIR |
topology_dir |
/etc/vmcluster/topologies |
VMCLUSTER_OVERLAY_DIR |
overlay_dir |
/var/lib/vmcluster/overlays |
VMCLUSTER_ARTIFACT_REGISTRY |
artifact_registry |
/var/lib/vmcluster/artifacts/registry.json |
VMCLUSTER_ARTIFACT_STORE_DIR |
artifact_store_dir |
/var/lib/vmcluster/artifacts/trees |
VMCLUSTER_FAULT_REGISTRY |
fault_registry |
/var/lib/vmcluster/faults/registry.json |
VMCLUSTER_SSH_KEY_PATH |
ssh_key_path |
/etc/vmcluster/ssh/vmcluster_id_ed25519 |
VMCLUSTER_SSH_USER |
ssh_user |
root |
VMCLUSTER_LIBVIRT_URI |
libvirt_uri |
qemu:///system |
VMCLUSTER_LOG_LEVEL |
log_level |
INFO |
Quick setup
# Create directories
sudo mkdir -p /etc/vmcluster/topologies /etc/vmcluster/ssh
sudo mkdir -p /var/lib/vmcluster/{overlays,artifacts/trees,faults}
# Generate SSH key for VM access
sudo ssh-keygen -t ed25519 -f /etc/vmcluster/ssh/vmcluster_id_ed25519 -N ""
Topology Files
Topology files are YAML files placed in topology_dir. The agent references topologies by filename (without .yaml).
# /etc/vmcluster/topologies/example-3node.yaml
cluster_name: example-3node
base_image: /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
overlay_dir: /var/lib/vmcluster/overlays/
network:
name: clusternet-example
bridge: virbr-example0
subnet: 192.168.100.0/24
nodes:
- name: controller
role: control
vcpus: 2
memory_mb: 2048
ip: 192.168.100.10
extra_disks:
- path: /var/lib/vmcluster/disks/data0.qcow2
size_gb: 20
bus: virtio
- name: worker-0
role: worker
vcpus: 2
memory_mb: 2048
ip: 192.168.100.11
- name: client-0
role: client
vcpus: 2
memory_mb: 1024
ip: 192.168.100.20
ssh:
key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
user: root
connect_timeout_s: 30
snapshots:
baseline: clean-boot # Logical name for snapshot_revert("baseline")
Node IPs are configured statically via cloud-init — no DHCP is used. Each node gets a NoCloud ISO injected at first boot.
The libvirt bridge named under network.bridge is created when cluster_define defines the topology network, so it does not need to pre-exist on the host.
Integration: VS Code (GitHub Copilot)
Add the server to your VS Code MCP configuration. Open Settings → MCP or edit .vscode/mcp.json in your workspace (or ~/.vscode/mcp.json globally).
If you installed from source into a virtualenv (recommended):
{
"servers": {
"vmcluster-mcp": {
"type": "stdio",
"command": "/path/to/vmcluster-mcp/.venv/bin/python",
"args": ["-m", "vmcluster_mcp"],
"env": {
"VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
}
}
}
}
If you prefer ephemeral launch with uv run --with:
{
"servers": {
"vmcluster-mcp": {
"type": "stdio",
"command": "uv",
"args": [
"run",
"--with", "git+https://github.com/chompinbits/vmcluster-mcp.git",
"python", "-m", "vmcluster_mcp"
],
"env": {
"VMCLUSTER_TOPOLOGY_DIR": "/etc/vmcluster/topologies",
"VMCLUSTER_OVERLAY_DIR": "/var/lib/vmcluster/overlays",
"VMCLUSTER_SSH_KEY_PATH": "/etc/vmcluster/ssh/vmcluster_id_ed25519",
"VMCLUSTER_LIBVIRT_URI": "qemu:///system"
}
}
}
}
After saving, restart the MCP server from the VS Code MCP panel. The tools will appear in Copilot Chat under the vmcluster-mcp server.
Integration: Claude CLI
claude (Anthropic Claude CLI / Claude Desktop)
Add to ~/.claude/claude_desktop_config.json (Claude Desktop) or ~/.config/claude/config.json (Claude CLI):
{
"mcpServers": {
"vmcluster-mcp": {
"command": "python",
"args": ["-m", "vmcluster_mcp"],
"env": {
"VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
}
}
}
}
If using a virtualenv:
{
"mcpServers": {
"vmcluster-mcp": {
"command": "/path/to/vmcluster-mcp/.venv/bin/python",
"args": ["-m", "vmcluster_mcp"],
"env": {
"VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
}
}
}
}
For the claude CLI (interactive terminal), you can also pass it inline:
claude --mcp-server "vmcluster-mcp:python -m vmcluster_mcp"
Or register it persistently:
claude mcp add vmcluster-mcp -- python -m vmcluster_mcp
Verify the server is loaded:
claude mcp list
Available Tools
All tools return ToolResult[T] — a structured JSON object with success: bool, result: T | null, and error: { code, message, recoverable } | null.
Cluster Lifecycle and Recovery
| Tool | Description |
|---|---|
cluster_define(topology_name) |
Provision a cluster from a topology file: create network, per-node overlay disks, cloud-init ISOs, and libvirt domain definitions. Idempotent. |
cluster_start(cluster_name, wait_for_ssh, ssh_timeout_s) |
Boot all stopped nodes. Optionally waits for SSH on all nodes (strict: one failure = success=False). |
cluster_stop(cluster_name, mode) |
Stop all running nodes. mode="shutdown" (ACPI) or mode="destroy" (force-off). |
cluster_destroy(cluster_name, remove_overlays) |
Undefine all domains, destroy the network. Optionally delete overlay disk files. |
cluster_status(cluster_name) |
Return per-node domain state and SSH reachability. SSH is checked in parallel only for running nodes. |
cluster_handle(cluster_name) |
Return a ClusterHandle with node SSH descriptors, artifact_path, and kernel_version (fetched via SSH). Requires running cluster. |
node_crash(cluster_name, node, restart_after, wait_for_ssh, ssh_timeout_s) |
Simulate an unclean node failure (virsh destroy) and optionally restart/wait for SSH. |
Remote Command Execution
| Tool | Description |
|---|---|
node_exec(cluster_name, node_name, command, timeout_s) |
Run a command on one node and return structured stdout/stderr/exit metadata. |
node_exec_all(cluster_name, command, nodes, require_all, timeout_s) |
Run a command on many nodes in parallel with per-node results and failure map. |
Snapshot Management
| Tool | Description |
|---|---|
snapshot_create(cluster_name, snapshot_name, include_memory) |
Create disk snapshots for all nodes in the cluster. |
snapshot_list(cluster_name) |
List snapshots with per-node disk metadata. |
snapshot_revert(cluster_name, snapshot_name, restart_after, wait_for_ssh, ssh_timeout_s) |
Revert all nodes to a named snapshot and optionally restart/verify SSH. |
snapshot_delete(cluster_name, snapshot_name) |
Delete a named snapshot across all nodes (best effort with per-node status). |
Artifact Management
| Tool | Description |
|---|---|
artifact_register(source_path, build_type, kernel_version, metadata) |
Register a local build tree and get a content-addressed artifact id. |
artifact_list() |
List registered artifacts. |
artifact_diff(artifact_id_a, artifact_id_b) |
Diff modules/binaries between two artifacts. |
artifact_sync(cluster_name, artifact_id, nodes, force, dest_base) |
Sync artifact content to target nodes over SSH/rsync. |
artifact_install(cluster_name, artifact_id, nodes, install_mode, dest_base) |
Install synced artifacts on nodes with structured per-node install status. |
Network Fault Injection
| Tool | Description |
|---|---|
net_partition(cluster_name, partition_a, partition_b) |
Insert symmetric iptables partition rules between node groups. |
net_impair(cluster_name, source_node, target_node, latency_ms, jitter_ms, loss_pct, corrupt_pct, reorder_pct) |
Apply tc netem impairment on a source node tap interface. |
net_heal(cluster_name, fault_handle) |
Remove a specific fault and deregister its handle. |
net_heal_all(cluster_name) |
Remove all active faults for a cluster. |
net_fault_list(cluster_name) |
List all active fault handles and parameters from fault registry. |
Kernel Observability
| Tool | Description |
|---|---|
dmesg_mark(cluster_name, nodes) |
Write a shared marker into /dev/kmsg on target nodes. |
dmesg_collect(cluster_name, nodes, since_marker, filter_level) |
Collect and classify dmesg lines (all, warn+, err+). |
Return types
ClusterStatus — returned by cluster_define, cluster_start, cluster_stop, cluster_destroy, cluster_status:
{
"cluster_name": "example-3node",
"network_active": true,
"nodes": [
{
"name": "controller",
"role": "control",
"ip": "192.168.100.10",
"domain_state": "running",
"ssh_reachable": true
}
]
}
ClusterHandle — returned by cluster_handle:
{
"cluster_name": "example-3node",
"artifact_path": "/opt/vmcluster/artifacts",
"kernel_version": "6.8.0-51-generic",
"nodes": [
{
"name": "controller",
"role": "control",
"ip": "192.168.100.10",
"ssh_port": 22,
"ssh_user": "root",
"ssh_key_path": "/etc/vmcluster/ssh/vmcluster_id_ed25519"
}
]
}
Most non-lifecycle tools follow the same envelope with their own typed result
payload (for example ExecResult, SnapshotInfo, NetFaultInfo, SyncStatus).
Canonical Agent Workflow
# 1. Define the cluster (idempotent — safe to call multiple times)
cluster_define("example-3node")
# 2. Start all nodes and wait for SSH
cluster_start("example-3node", wait_for_ssh=True)
# 3. Get cluster handle for downstream SSH use
handle = cluster_handle("example-3node")
# 4. Check status at any time
cluster_status("example-3node")
# 5. Graceful shutdown
cluster_stop("example-3node", mode="shutdown")
# 6. Full teardown (remove overlays too)
cluster_destroy("example-3node", remove_overlays=True)
Extended workflow (artifacts + faults + observability, pseudo-notation)
The flow below shows the intended sequence of tool calls.
# Register and deploy build artifacts
artifact_id = artifact_register("/path/to/build/tree").result.artifact_id
artifact_sync("example-3node", artifact_id)
artifact_install("example-3node", artifact_id)
# Add a network impairment and inspect active faults
fault = net_impair("example-3node", source_node="worker-0", latency_ms=150)
net_fault_list("example-3node")
# Mark and collect dmesg around your test window
markers = dmesg_mark("example-3node")
dmesg_collect("example-3node", since_marker=markers["worker-0"], filter_level="warn+")
# Heal injected faults
net_heal("example-3node", fault.result.handle_id)
Troubleshooting
cluster_define fails creating overlays
- Ensure base image path in topology exists and is readable.
- Validate host tool availability:
qemu-img --version. - Confirm overlay directory is writable by the user running the MCP server.
SSH timeouts in cluster_start or snapshot_revert
- Confirm cloud-init configured the static IPs expected by the topology.
- Verify key/user pair:
VMCLUSTER_SSH_KEY_PATH,VMCLUSTER_SSH_USER. - Increase
ssh_timeout_sfor cold boots.
Fault tools fail (iptables/tc errors)
- Ensure the MCP process has required privileges for host networking commands.
- Confirm
iptablesandtcare installed and executable. - Validate libvirt bridge name in topology matches the active host interface.
artifact_sync or artifact_install partial failures
- Use
node_exec_all(..., command="df -h")to verify remote disk space. - Verify SSH connectivity and remote path permissions under
dest_base. - Re-run with narrowed
nodes=[...]to isolate problematic hosts.
Snapshot delete blocked
snapshot_deleterefuses to remove active backing snapshots by design.- Revert or switch active disk chain first, then delete snapshot.
Useful host checks
virsh list --all
virsh net-list --all
ip -br link
sudo iptables -S | head
sudo tc qdisc show
Development
# Clone and install with dev dependencies
git clone https://github.com/chompinbits/vmcluster-mcp.git
cd vmcluster-mcp
uv venv && uv pip install -e '.[dev]'
# Run tests
.venv/bin/pytest
# Lint
.venv/bin/ruff check vmcluster_mcp/
# Run the server directly (stdio mode)
.venv/bin/python -m vmcluster_mcp
Project structure
vmcluster_mcp/
cluster/ # Cluster lifecycle tools (define, start, stop, destroy, status, handle, crash)
libvirt_client.py # Thread-safe async libvirt wrapper
domain_builder.py # KVM domain XML generation
network_builder.py # libvirt NAT network XML generation
cloud_init.py # cloud-init NoCloud ISO generation
exec/ # Remote command execution tools (node_exec, node_exec_all)
ssh.py # SSH client and connection pool management
snapshot/ # Snapshot tools (create, list, revert, delete)
manager.py # Snapshot operations
artifact/ # Artifact tools (register, list, diff, sync, install)
installer.py # Remote artifact installation
registry.py # Content-addressed artifact registry
syncer.py # rsync-based artifact synchronization
net/ # Network fault tools (partition, impair, heal, list)
fault_registry.py # Persistent fault registry
fault.py # iptables/tc fault implementation
observe/ # Kernel observability tools (dmesg_mark, dmesg_collect)
classifier.py # dmesg line classification
dmesg.py # dmesg collection and parsing
topology/ # Topology YAML parsing and schema
parser.py # Topology loader
schema.py # Topology models
models.py # Shared Pydantic models (ToolResult, ClusterStatus, ClusterHandle, …)
config.py # Configuration loading (YAML + env vars)
server.py # FastMCP server instance and structured_tool_handler
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vmcluster_mcp-1.1.0.tar.gz.
File metadata
- Download URL: vmcluster_mcp-1.1.0.tar.gz
- Upload date:
- Size: 429.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d14a174889be8383377cd4dbeddc5e358b5d67d618b53c7cc1d0ba07aa207bd
|
|
| MD5 |
dec7eabb26cb166c537ba8f98e00a2de
|
|
| BLAKE2b-256 |
24ce8d01ce535e518e19f666c6152180bf4560205ddfadaa633de778a814a062
|
File details
Details for the file vmcluster_mcp-1.1.0-py3-none-any.whl.
File metadata
- Download URL: vmcluster_mcp-1.1.0-py3-none-any.whl
- Upload date:
- Size: 66.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e00d7af5bdf782254f03040ef8c5ac20ecb69fffc9e2a9c3d0e02e041499bb9a
|
|
| MD5 |
3fa927f8ae93afe6b8df8a719d3a3859
|
|
| BLAKE2b-256 |
52253462f2c0d1f104219175daece8dbd006335b12cbb9e6599e356659f9b77a
|