Skip to main content

MCP server for autonomous multi-VM cluster orchestration on libvirt/QEMU

Project description

vmcluster-mcp

An MCP server for autonomous multi-VM cluster orchestration on libvirt/QEMU. Manages the full lifecycle of KVM virtual machine clusters — provisioning, starting, stopping, snapshotting, SSH execution, artifact distribution, and fault injection — through a structured tool interface designed for AI agents.

Table of Contents


Overview

vmcluster-mcp is a general-purpose MCP server. It manages clusters of KVM/QEMU virtual machines and produces a ClusterHandle — a typed descriptor passed to downstream consumers for direct SSH access. The server has no knowledge of what runs inside VMs; it knows nodes, networks, snapshots, and artifacts.

Design principles:

  • Topology-as-data — cluster shape is declared in a YAML file, not constructed imperatively
  • Structured outputs — all tools return typed Pydantic models serialized as JSON; no free-text parsing
  • Stateless server — all persistent state lives in libvirt and on disk; safe to restart at any time
  • Idempotent operationscluster_define and related tools are safe to call multiple times

Prerequisites

  • Linux host with KVM/QEMU and libvirt installed (libvirtd running)
  • Python 3.11+
  • qemu-img available in PATH
  • genisoimage or mkisofs for cloud-init ISO generation
  • iptables, tc (from iproute2), and rsync for fault/artifact tools
  • Permission to run libvirt and host network commands (sudo access is usually required)
  • uv (recommended) or pip for installation

Typical package set on Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y \
  qemu-kvm libvirt-daemon-system libvirt-clients \
  qemu-utils cloud-image-utils genisoimage \
  iproute2 iptables rsync
# Verify libvirt access
virsh list --all

# Verify qemu-img
qemu-img --version

# Verify tc and iptables
tc -V
iptables --version

Installation

From PyPI (recommended)

Install the latest release:

pip install vmcluster-mcp

Or run directly without installing (via uv):

uvx vmcluster-mcp

Prerequisite: libvirt-python requires system-level development headers. Install them before running pip install:

# Ubuntu/Debian
sudo apt-get install -y libvirt-dev pkg-config gcc

# Fedora/RHEL
sudo dnf install -y libvirt-devel pkgconf-pkg-config gcc

# Arch Linux
sudo pacman -S libvirt pkgconf gcc

From source (development)

git clone https://github.com/hornc/vmcluster-mcp.git
cd vmcluster-mcp

# Create virtual environment and install
uv venv
uv pip install -e .

Quick Start (First 15 Minutes)

This path is for first-time setup on a single Linux host.

  1. Create required directories and SSH key:
sudo mkdir -p /etc/vmcluster/topologies /etc/vmcluster/ssh
sudo mkdir -p /var/lib/vmcluster/{overlays,artifacts/trees,faults}
sudo ssh-keygen -t ed25519 -f /etc/vmcluster/ssh/vmcluster_id_ed25519 -N ""
  1. Create /etc/vmcluster/config.yaml:
topology_dir: /etc/vmcluster/topologies
overlay_dir: /var/lib/vmcluster/overlays
artifact_registry: /var/lib/vmcluster/artifacts/registry.json
artifact_store_dir: /var/lib/vmcluster/artifacts/trees
fault_registry: /var/lib/vmcluster/faults/registry.json
ssh_key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
ssh_user: root
libvirt_uri: qemu:///system
log_level: INFO
  1. Prepare a base image used by the example topology:
sudo mkdir -p /var/lib/vmcluster/images
sudo wget -O /tmp/ubuntu-24.04-server-cloudimg-amd64.img \
  https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img
sudo qemu-img convert -f qcow2 -O qcow2 \
  /tmp/ubuntu-24.04-server-cloudimg-amd64.img \
  /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
sudo qemu-img info /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
  1. Add your first topology file in /etc/vmcluster/topologies/ (see example below).

  2. Run the server locally to verify it starts:

VMCLUSTER_CONFIG=/etc/vmcluster/config.yaml .venv/bin/python -m vmcluster_mcp
  1. Connect from your MCP client (VS Code or Claude) and run this smoke flow:
cluster_define("example-3node")
cluster_start("example-3node", wait_for_ssh=True)
cluster_status("example-3node")
node_exec("example-3node", "controller", "uname -r")
snapshot_create("example-3node", "baseline")
cluster_stop("example-3node")
  1. Clean up when finished:
cluster_destroy("example-3node", remove_overlays=True)

Configuration

The server can be configured via a YAML file and/or environment variables. Environment variables take precedence over the config file, which takes precedence over defaults.

Config file

Default location: /etc/vmcluster/config.yaml. Override with VMCLUSTER_CONFIG env var.

# /etc/vmcluster/config.yaml

topology_dir: /etc/vmcluster/topologies      # Where topology YAML files live
overlay_dir: /var/lib/vmcluster/overlays     # Where per-node qcow2 overlays are created
artifact_registry: /var/lib/vmcluster/artifacts/registry.json
artifact_store_dir: /var/lib/vmcluster/artifacts/trees
fault_registry: /var/lib/vmcluster/faults/registry.json
ssh_key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
ssh_user: root
libvirt_uri: qemu:///system
log_level: INFO

Environment variables

Variable Config key Default
VMCLUSTER_CONFIG (config file path) /etc/vmcluster/config.yaml
VMCLUSTER_TOPOLOGY_DIR topology_dir /etc/vmcluster/topologies
VMCLUSTER_OVERLAY_DIR overlay_dir /var/lib/vmcluster/overlays
VMCLUSTER_ARTIFACT_REGISTRY artifact_registry /var/lib/vmcluster/artifacts/registry.json
VMCLUSTER_ARTIFACT_STORE_DIR artifact_store_dir /var/lib/vmcluster/artifacts/trees
VMCLUSTER_FAULT_REGISTRY fault_registry /var/lib/vmcluster/faults/registry.json
VMCLUSTER_SSH_KEY_PATH ssh_key_path /etc/vmcluster/ssh/vmcluster_id_ed25519
VMCLUSTER_SSH_USER ssh_user root
VMCLUSTER_LIBVIRT_URI libvirt_uri qemu:///system
VMCLUSTER_LOG_LEVEL log_level INFO

Quick setup

# Create directories
sudo mkdir -p /etc/vmcluster/topologies /etc/vmcluster/ssh
sudo mkdir -p /var/lib/vmcluster/{overlays,artifacts/trees,faults}

# Generate SSH key for VM access
sudo ssh-keygen -t ed25519 -f /etc/vmcluster/ssh/vmcluster_id_ed25519 -N ""

Topology Files

Topology files are YAML files placed in topology_dir. The agent references topologies by filename (without .yaml).

# /etc/vmcluster/topologies/example-3node.yaml

cluster_name: example-3node
base_image: /var/lib/vmcluster/images/ubuntu-6.8-base.qcow2
overlay_dir: /var/lib/vmcluster/overlays/

network:
  name: clusternet-example
  bridge: virbr-example0
  subnet: 192.168.100.0/24

nodes:
  - name: controller
    role: control
    vcpus: 2
    memory_mb: 2048
    ip: 192.168.100.10
    extra_disks:
      - path: /var/lib/vmcluster/disks/data0.qcow2
        size_gb: 20
        bus: virtio

  - name: worker-0
    role: worker
    vcpus: 2
    memory_mb: 2048
    ip: 192.168.100.11

  - name: client-0
    role: client
    vcpus: 2
    memory_mb: 1024
    ip: 192.168.100.20

ssh:
  key_path: /etc/vmcluster/ssh/vmcluster_id_ed25519
  user: root
  connect_timeout_s: 30

snapshots:
  baseline: clean-boot   # Logical name for snapshot_revert("baseline")

Node IPs are configured statically via cloud-init — no DHCP is used. Each node gets a NoCloud ISO injected at first boot. The libvirt bridge named under network.bridge is created when cluster_define defines the topology network, so it does not need to pre-exist on the host.


Integration: VS Code (GitHub Copilot)

Add the server to your VS Code MCP configuration. Open Settings → MCP or edit .vscode/mcp.json in your workspace (or ~/.vscode/mcp.json globally).

If you installed from source into a virtualenv (recommended):

{
  "servers": {
    "vmcluster-mcp": {
      "type": "stdio",
      "command": "/path/to/vmcluster-mcp/.venv/bin/python",
      "args": ["-m", "vmcluster_mcp"],
      "env": {
        "VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
      }
    }
  }
}

If you prefer ephemeral launch with uv run --with:

{
  "servers": {
    "vmcluster-mcp": {
      "type": "stdio",
      "command": "uv",
      "args": [
        "run",
        "--with", "git+https://github.com/chompinbits/vmcluster-mcp.git",
        "python", "-m", "vmcluster_mcp"
      ],
      "env": {
        "VMCLUSTER_TOPOLOGY_DIR": "/etc/vmcluster/topologies",
        "VMCLUSTER_OVERLAY_DIR": "/var/lib/vmcluster/overlays",
        "VMCLUSTER_SSH_KEY_PATH": "/etc/vmcluster/ssh/vmcluster_id_ed25519",
        "VMCLUSTER_LIBVIRT_URI": "qemu:///system"
      }
    }
  }
}

After saving, restart the MCP server from the VS Code MCP panel. The tools will appear in Copilot Chat under the vmcluster-mcp server.


Integration: Claude CLI

claude (Anthropic Claude CLI / Claude Desktop)

Add to ~/.claude/claude_desktop_config.json (Claude Desktop) or ~/.config/claude/config.json (Claude CLI):

{
  "mcpServers": {
    "vmcluster-mcp": {
      "command": "python",
      "args": ["-m", "vmcluster_mcp"],
      "env": {
        "VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
      }
    }
  }
}

If using a virtualenv:

{
  "mcpServers": {
    "vmcluster-mcp": {
      "command": "/path/to/vmcluster-mcp/.venv/bin/python",
      "args": ["-m", "vmcluster_mcp"],
      "env": {
        "VMCLUSTER_CONFIG": "/etc/vmcluster/config.yaml"
      }
    }
  }
}

For the claude CLI (interactive terminal), you can also pass it inline:

claude --mcp-server "vmcluster-mcp:python -m vmcluster_mcp"

Or register it persistently:

claude mcp add vmcluster-mcp -- python -m vmcluster_mcp

Verify the server is loaded:

claude mcp list

Available Tools

All tools return ToolResult[T] — a structured JSON object with success: bool, result: T | null, and error: { code, message, recoverable } | null.

Cluster Lifecycle and Recovery

Tool Description
cluster_define(topology_name) Provision a cluster from a topology file: create network, per-node overlay disks, cloud-init ISOs, and libvirt domain definitions. Idempotent.
cluster_start(cluster_name, wait_for_ssh, ssh_timeout_s) Boot all stopped nodes. Optionally waits for SSH on all nodes (strict: one failure = success=False).
cluster_stop(cluster_name, mode) Stop all running nodes. mode="shutdown" (ACPI) or mode="destroy" (force-off).
cluster_destroy(cluster_name, remove_overlays) Undefine all domains, destroy the network. Optionally delete overlay disk files.
cluster_status(cluster_name) Return per-node domain state and SSH reachability. SSH is checked in parallel only for running nodes.
cluster_handle(cluster_name) Return a ClusterHandle with node SSH descriptors, artifact_path, and kernel_version (fetched via SSH). Requires running cluster.
node_crash(cluster_name, node, restart_after, wait_for_ssh, ssh_timeout_s) Simulate an unclean node failure (virsh destroy) and optionally restart/wait for SSH.

Remote Command Execution

Tool Description
node_exec(cluster_name, node_name, command, timeout_s) Run a command on one node and return structured stdout/stderr/exit metadata.
node_exec_all(cluster_name, command, nodes, require_all, timeout_s) Run a command on many nodes in parallel with per-node results and failure map.

Snapshot Management

Tool Description
snapshot_create(cluster_name, snapshot_name, include_memory) Create disk snapshots for all nodes in the cluster.
snapshot_list(cluster_name) List snapshots with per-node disk metadata.
snapshot_revert(cluster_name, snapshot_name, restart_after, wait_for_ssh, ssh_timeout_s) Revert all nodes to a named snapshot and optionally restart/verify SSH.
snapshot_delete(cluster_name, snapshot_name) Delete a named snapshot across all nodes (best effort with per-node status).

Artifact Management

Tool Description
artifact_register(source_path, build_type, kernel_version, metadata) Register a local build tree and get a content-addressed artifact id.
artifact_list() List registered artifacts.
artifact_diff(artifact_id_a, artifact_id_b) Diff modules/binaries between two artifacts.
artifact_sync(cluster_name, artifact_id, nodes, force, dest_base) Sync artifact content to target nodes over SSH/rsync.
artifact_install(cluster_name, artifact_id, nodes, install_mode, dest_base) Install synced artifacts on nodes with structured per-node install status.

Network Fault Injection

Tool Description
net_partition(cluster_name, partition_a, partition_b) Insert symmetric iptables partition rules between node groups.
net_impair(cluster_name, source_node, target_node, latency_ms, jitter_ms, loss_pct, corrupt_pct, reorder_pct) Apply tc netem impairment on a source node tap interface.
net_heal(cluster_name, fault_handle) Remove a specific fault and deregister its handle.
net_heal_all(cluster_name) Remove all active faults for a cluster.
net_fault_list(cluster_name) List all active fault handles and parameters from fault registry.

Kernel Observability

Tool Description
dmesg_mark(cluster_name, nodes) Write a shared marker into /dev/kmsg on target nodes.
dmesg_collect(cluster_name, nodes, since_marker, filter_level) Collect and classify dmesg lines (all, warn+, err+).

Return types

ClusterStatus — returned by cluster_define, cluster_start, cluster_stop, cluster_destroy, cluster_status:

{
  "cluster_name": "example-3node",
  "network_active": true,
  "nodes": [
    {
      "name": "controller",
      "role": "control",
      "ip": "192.168.100.10",
      "domain_state": "running",
      "ssh_reachable": true
    }
  ]
}

ClusterHandle — returned by cluster_handle:

{
  "cluster_name": "example-3node",
  "artifact_path": "/opt/vmcluster/artifacts",
  "kernel_version": "6.8.0-51-generic",
  "nodes": [
    {
      "name": "controller",
      "role": "control",
      "ip": "192.168.100.10",
      "ssh_port": 22,
      "ssh_user": "root",
      "ssh_key_path": "/etc/vmcluster/ssh/vmcluster_id_ed25519"
    }
  ]
}

Most non-lifecycle tools follow the same envelope with their own typed result payload (for example ExecResult, SnapshotInfo, NetFaultInfo, SyncStatus).


Canonical Agent Workflow

# 1. Define the cluster (idempotent — safe to call multiple times)
cluster_define("example-3node")

# 2. Start all nodes and wait for SSH
cluster_start("example-3node", wait_for_ssh=True)

# 3. Get cluster handle for downstream SSH use
handle = cluster_handle("example-3node")

# 4. Check status at any time
cluster_status("example-3node")

# 5. Graceful shutdown
cluster_stop("example-3node", mode="shutdown")

# 6. Full teardown (remove overlays too)
cluster_destroy("example-3node", remove_overlays=True)

Extended workflow (artifacts + faults + observability, pseudo-notation)

The flow below shows the intended sequence of tool calls.

# Register and deploy build artifacts
artifact_id = artifact_register("/path/to/build/tree").result.artifact_id
artifact_sync("example-3node", artifact_id)
artifact_install("example-3node", artifact_id)

# Add a network impairment and inspect active faults
fault = net_impair("example-3node", source_node="worker-0", latency_ms=150)
net_fault_list("example-3node")

# Mark and collect dmesg around your test window
markers = dmesg_mark("example-3node")
dmesg_collect("example-3node", since_marker=markers["worker-0"], filter_level="warn+")

# Heal injected faults
net_heal("example-3node", fault.result.handle_id)

Troubleshooting

cluster_define fails creating overlays

  • Ensure base image path in topology exists and is readable.
  • Validate host tool availability: qemu-img --version.
  • Confirm overlay directory is writable by the user running the MCP server.

SSH timeouts in cluster_start or snapshot_revert

  • Confirm cloud-init configured the static IPs expected by the topology.
  • Verify key/user pair: VMCLUSTER_SSH_KEY_PATH, VMCLUSTER_SSH_USER.
  • Increase ssh_timeout_s for cold boots.

Fault tools fail (iptables/tc errors)

  • Ensure the MCP process has required privileges for host networking commands.
  • Confirm iptables and tc are installed and executable.
  • Validate libvirt bridge name in topology matches the active host interface.

artifact_sync or artifact_install partial failures

  • Use node_exec_all(..., command="df -h") to verify remote disk space.
  • Verify SSH connectivity and remote path permissions under dest_base.
  • Re-run with narrowed nodes=[...] to isolate problematic hosts.

Snapshot delete blocked

  • snapshot_delete refuses to remove active backing snapshots by design.
  • Revert or switch active disk chain first, then delete snapshot.

Useful host checks

virsh list --all
virsh net-list --all
ip -br link
sudo iptables -S | head
sudo tc qdisc show

Development

# Clone and install with dev dependencies
git clone https://github.com/chompinbits/vmcluster-mcp.git
cd vmcluster-mcp
uv venv && uv pip install -e '.[dev]'

# Run tests
.venv/bin/pytest

# Lint
.venv/bin/ruff check vmcluster_mcp/

# Run the server directly (stdio mode)
.venv/bin/python -m vmcluster_mcp

Project structure

vmcluster_mcp/
  cluster/          # Cluster lifecycle tools (define, start, stop, destroy, status, handle, crash)
    libvirt_client.py   # Thread-safe async libvirt wrapper
    domain_builder.py   # KVM domain XML generation
    network_builder.py  # libvirt NAT network XML generation
    cloud_init.py       # cloud-init NoCloud ISO generation
  exec/             # Remote command execution tools (node_exec, node_exec_all)
    ssh.py          # SSH client and connection pool management
  snapshot/         # Snapshot tools (create, list, revert, delete)
    manager.py      # Snapshot operations
  artifact/         # Artifact tools (register, list, diff, sync, install)
    installer.py    # Remote artifact installation
    registry.py     # Content-addressed artifact registry
    syncer.py       # rsync-based artifact synchronization
  net/              # Network fault tools (partition, impair, heal, list)
    fault_registry.py   # Persistent fault registry
    fault.py        # iptables/tc fault implementation
  observe/          # Kernel observability tools (dmesg_mark, dmesg_collect)
    classifier.py   # dmesg line classification
    dmesg.py        # dmesg collection and parsing
  topology/         # Topology YAML parsing and schema
    parser.py       # Topology loader
    schema.py       # Topology models
  models.py         # Shared Pydantic models (ToolResult, ClusterStatus, ClusterHandle, …)
  config.py         # Configuration loading (YAML + env vars)
  server.py         # FastMCP server instance and structured_tool_handler

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vmcluster_mcp-1.1.0.tar.gz (429.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vmcluster_mcp-1.1.0-py3-none-any.whl (66.9 kB view details)

Uploaded Python 3

File details

Details for the file vmcluster_mcp-1.1.0.tar.gz.

File metadata

  • Download URL: vmcluster_mcp-1.1.0.tar.gz
  • Upload date:
  • Size: 429.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vmcluster_mcp-1.1.0.tar.gz
Algorithm Hash digest
SHA256 1d14a174889be8383377cd4dbeddc5e358b5d67d618b53c7cc1d0ba07aa207bd
MD5 dec7eabb26cb166c537ba8f98e00a2de
BLAKE2b-256 24ce8d01ce535e518e19f666c6152180bf4560205ddfadaa633de778a814a062

See more details on using hashes here.

File details

Details for the file vmcluster_mcp-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: vmcluster_mcp-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 66.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.12 {"installer":{"name":"uv","version":"0.10.12","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vmcluster_mcp-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e00d7af5bdf782254f03040ef8c5ac20ecb69fffc9e2a9c3d0e02e041499bb9a
MD5 3fa927f8ae93afe6b8df8a719d3a3859
BLAKE2b-256 52253462f2c0d1f104219175daece8dbd006335b12cbb9e6599e356659f9b77a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page