Server Guardian MCP
The most comprehensive server management MCP ever built. 63 tools, 8 connection types, 16 modules — log search, access log APM, SLO tracking, anomaly detection, auto-remediation playbooks, CIS benchmarks, CVE scanning, database monitoring, network monitoring, file integrity, live web dashboard, compliance reports, public status pages, team RBAC, PagerDuty/Telegram/OpsGenie — all through Claude. No agents. Just SSH.
"The AI SRE that lives in your terminal. SSH into any server, diagnose any problem, fix it automatically — all through a conversation with Claude. No agents. No SaaS bills. No PromQL."
Live Dashboard
python -m server_guardian_mcp dashboard # start on port 8080
python -m server_guardian_mcp dashboard --port 9090
Real-time web UI with auto-refresh every 30 seconds. Dark theme, Chart.js charts for CPU/memory/disk trends, active alerts feed, incident timeline.
Why Server Guardian?
| What you say to Claude |
What happens |
| "Is my server okay?" |
SSH in, check CPU/RAM/disk/temp, detect anomalies vs baseline |
| "Why is production slow?" |
Check processes, disk, logs, access log APM, identify the bottleneck |
| "Search logs for OOM errors" |
Index logs in SQLite, search with pattern detection, show error rates |
| "Show me endpoint latency" |
Parse nginx access logs — p50/p95/p99 latency, error rates, slowest endpoints |
| "Are we meeting our SLOs?" |
Track uptime/latency/error targets, calculate error budget remaining |
| "What happened overnight?" |
Generate incident narrative from alerts, service events, playbook runs |
| "Fix it automatically" |
Run playbooks: clear disk, restart services, renew SSL certs |
| "Run a security audit" |
61 CIS benchmark checks + CVE scan + rootkit detection + FIM |
| "Generate a compliance report" |
Branded HTML report with score (A-F) for SOC2/ISO prep |
| "How's the database?" |
Slow query analysis, connection counts, replication lag, table sizes |
| "Am I overpaying?" |
Rightsizing analysis: "CPU at 0.4%, memory at 7.7% — downsize to save 50%" |
| "What connects to what?" |
Map service dependencies from active network connections |
| "Write the postmortem" |
Auto-generate structured postmortem from incident timeline |
| "Create a status page" |
Public-facing uptime page for customers (replaces $29/mo tools) |
Benchmarks vs Alternatives
| Feature |
Server Guardian |
ssh-mcp |
mcp-ssh-manager |
HomeButler |
| Total tools |
63 |
2 |
37 |
20 |
| Connection types |
8 |
1 |
1 |
1 |
| Log search + pattern detection |
Yes |
- |
- |
- |
| Access log APM (p50/p95/p99) |
Yes |
- |
- |
- |
| SLO tracking + error budgets |
Yes |
- |
- |
- |
| Smart anomaly detection |
Yes |
- |
- |
- |
| Auto-remediation playbooks |
Yes |
- |
- |
- |
| CIS benchmark (61 checks) |
Yes |
- |
- |
- |
| CVE scanning + rootkit detection |
Yes |
- |
- |
- |
| File integrity monitoring |
Yes |
- |
- |
- |
| Database monitoring (MySQL/PG) |
Yes |
- |
- |
- |
| Network bandwidth monitoring |
Yes |
- |
- |
- |
| Service dependency mapping |
Yes |
- |
- |
- |
| Root cause correlation |
Yes |
- |
- |
- |
| Resource rightsizing |
Yes |
- |
- |
- |
| Multi-step API tests |
Yes |
- |
- |
- |
| Maintenance windows |
Yes |
- |
- |
- |
| Public status page |
Yes |
- |
- |
- |
| AI postmortem generation |
Yes |
- |
- |
- |
| Live web dashboard (Chart.js) |
Yes |
- |
- |
- |
| Compliance report (SOC2/ISO) |
Yes |
- |
- |
- |
| Team RBAC (admin/operator/viewer) |
Yes |
- |
- |
- |
| PagerDuty / Telegram / OpsGenie |
Yes |
- |
- |
- |
| Background watchdog daemon |
Yes |
- |
- |
Yes |
| Email / Slack / Discord alerts |
Yes |
- |
- |
Yes |
| Multi-cloud (AWS/GCP/Azure) |
Yes |
- |
- |
- |
| Docker container management |
Yes |
- |
Yes |
Yes |
Quick Install
Claude Code (recommended)
claude mcp add server-guardian -- uvx server-guardian-mcp
pip
pip install server-guardian-mcp
claude mcp add server-guardian -- python -m server_guardian_mcp
From source
pip install -e .
claude mcp add server-guardian -- python -m server_guardian_mcp
Setup (2 minutes)
1. Create your .env
cp .env.example .env
2. Add your servers
# SSH (most common)
SERVER_PROD=ssh,203.0.113.10,22,deploy,key,~/.ssh/prod_key,Production
# Local machine
SERVER_LOCAL=local,,,,,My Machine
# Docker / Kubernetes / AWS SSM / GCP / Azure / WinRM also supported
3. Auto-discover existing servers
"Discover my SSH servers" — reads ~/.ssh/config and shows ready-to-paste .env lines.
4. Add aliases (optional)
SERVER_ALIASES=prod:PROD,stg:STAGING,dev:DEV
All 63 Tools
Core Server Management (6)
| Tool |
What it does |
list_all_servers |
Show all servers with online/offline status and latency |
check_server_health |
Full snapshot: CPU, RAM, disk, swap, temp, load, top processes, network |
run_shell_commands |
Run one or more shell commands on any server |
run_shell_script |
Run multi-line bash scripts with shared variables |
fetch_system_logs |
Fetch dmesg/syslog/journal/auth/nginx/custom logs with grep filter |
list_running_processes |
Processes sorted by CPU or memory, with name filter |
Service Management (5)
| Tool |
What it does |
manage_systemd_service |
Start/stop/restart/enable/disable/status/logs for any systemd service |
list_all_services |
List ALL systemd services, filter by running/failed/inactive |
find_failed_services |
Find every crashed/failed service in one call |
restart_failed_services |
Bulk restart failed services — pass names or "ALL_FAILED" |
watch_service_status |
Quick is-active + is-enabled check for specific services |
Monitoring & Alerting (5)
| Tool |
What it does |
check_ssl_certificate |
SSL cert expiry, chain, issuer for any domain (no SSH) |
check_http_endpoint |
HTTP status, response time, headers for any URL (no SSH) |
monitor_server_health |
Health check + store in SQLite + auto-alert on thresholds |
monitor_endpoints |
Check HTTP/SSL targets + store + alert on failures |
get_active_alerts |
Show unresolved alerts grouped by severity |
Log Search & APM (2)
| Tool |
What it does |
search_logs |
Index logs in SQLite, search with pattern detection, extract error rates |
analyze_access_logs |
Nginx/Apache APM — per-endpoint p50/p95/p99 latency, error rates, throughput, top IPs |
SLO Tracking & Reporting (4)
| Tool |
What it does |
manage_slos |
Define uptime/latency/error rate targets, track compliance, error budgets |
generate_postmortem_tool |
Structured incident postmortem from alerts, services, playbook data |
generate_status_page_tool |
Public-facing status page for customers (replaces Better Stack $29/mo) |
get_weekly_report |
Weekly health summary for email or team review |
Database Monitoring (2)
| Tool |
What it does |
query_database |
Run SQL queries on MySQL, PostgreSQL, or SQLite on any server |
monitor_database |
Slow queries, connections, replication lag, table sizes (MySQL/PostgreSQL auto-detected) |
Network Monitoring (2)
| Tool |
What it does |
inspect_network |
Listening ports, active connections, interfaces, DNS, routing |
monitor_network |
Bandwidth per interface, connection states, TCP retransmissions, throughput rates |
Security & Compliance (6)
| Tool |
What it does |
run_security_audit |
10-point security check (SSH, firewall, logins, updates, sudo) |
run_cis_benchmark |
61 CIS Linux Benchmark checks across filesystem, network, SSH, PAM, logging |
scan_vulnerabilities |
CVE scanning (package versions), rootkit detection, crypto miner detection |
check_file_integrity |
FIM — hash critical files (/etc/passwd, sshd_config, etc.), detect unauthorized changes |
manage_firewall |
UFW/iptables: status, allow, deny, delete rules, enable/disable |
generate_compliance_report_tool |
Branded HTML report with score (A-F), suitable for SOC2/ISO |
Docker (2)
| Tool |
What it does |
list_docker_containers |
Containers with CPU, memory, network, block I/O stats |
fetch_docker_logs |
Container logs with grep filter and time range |
Disk & Files (4)
| Tool |
What it does |
analyze_disk_usage |
Find largest items, files >100MB, inode usage |
read_remote_file |
Read files on server (tail/head/all) with metadata |
upload_file_to_server |
SFTP upload with size verification |
download_file_from_server |
SFTP download |
Multi-Server (2)
| Tool |
What it does |
run_on_all_servers |
Same commands on multiple servers — pass ["ALL"] for all |
compare_across_servers |
Spot config drift: same command, side-by-side results |
System Administration (4)
| Tool |
What it does |
manage_cron_jobs |
List, add, remove cron jobs on any server |
manage_users |
List users, user info, add SSH keys, list keys, who is logged in |
manage_packages |
List/install/remove/upgrade packages (apt, yum, dnf, apk auto-detected) |
manage_nginx |
Status, list sites, show config, test, reload, restart, access/error logs |
Git Deploy (1)
| Tool |
What it does |
git_deploy |
Status, pull, log, branch, switch, stash, diff on server git repos |
Discovery (1)
| Tool |
What it does |
discover_ssh_servers |
Auto-discover servers from ~/.ssh/config with ready-to-paste .env lines |
Dashboard & Analytics (6)
| Tool |
What it does |
multi_server_dashboard |
One-call summary of ALL servers: health, CPU, RAM, disk, failed services |
get_monitoring_history |
Query health trends, service events, endpoint checks from SQLite |
get_incident_timeline |
Chronological event log for a server |
forecast_disk_usage |
Predict when disk will be full based on growth rate |
generate_html_dashboard |
Self-contained HTML status page — open in any browser |
resolve_alert |
Mark an alert as resolved |
Intelligence & Automation (3)
| Tool |
What it does |
detect_anomalies_tool |
Statistical anomaly detection — flags metrics >2.5 sigma from baseline |
replay_incident |
Generate chronological narrative from alerts, service events, playbook runs |
manage_playbooks |
Auto-remediation: disk cleanup, service restart, SSL renewal, custom playbooks |
Team & Integrations (3)
| Tool |
What it does |
team_manage |
RBAC user management: admin/operator/viewer roles with API keys |
check_integrations |
Status and test for PagerDuty, Telegram, OpsGenie |
live_dashboard_info |
How to start the live web dashboard and available API endpoints |
Advanced Operations (5)
| Tool |
What it does |
run_api_test_tool |
Multi-step API tests with variable extraction and assertions |
manage_maintenance_windows |
Suppress alerts during planned work |
get_rightsizing_recommendations |
Identify over/under-provisioned resources to save costs |
map_service_dependencies |
Discover service topology from active network connections |
analyze_root_cause |
Correlate anomalies across metrics, services, alerts for root cause analysis |
Access Log APM
80% of APM value with zero agent install. Parse nginx/Apache access logs for:
Tell Claude: "analyze access logs on PROD"
- Per-endpoint latency percentiles (p50, p95, p99)
- Error rates (4xx, 5xx) per endpoint
- Throughput (requests per endpoint)
- Slowest endpoints ranked
- Status code breakdown
- Top IPs by request volume
- URL normalization (replaces IDs/UUIDs with placeholders)
Log Search & Pattern Detection
Tell Claude: "search logs on PROD for OOM" or "show me log patterns"
- Fetches logs via SSH, indexes in SQLite for future searching
- Pattern detection — clusters similar log lines, shows frequency
- Error rate extraction (log-to-metrics)
- Supports journal, syslog, auth, nginx, or any custom log path
SLO Tracking & Error Budgets
Tell Claude: "create an SLO for 99.9% uptime on PROD"
Tell Claude: "show me SLO status"
- Define uptime, latency, or error rate targets
- Track compliance from stored health/endpoint data
- Calculate error budget remaining and burn rate
- Configurable measurement windows (7d, 30d, 90d)
CIS Benchmark & Vulnerability Scanning
Tell Claude: "run CIS benchmark on PROD"
Tell Claude: "scan for vulnerabilities on PROD"
- 61 CIS Linux Benchmark checks across: filesystem, software updates, boot security, process hardening, network config, SSH, PAM, user management, logging, cron
- CVE scanning — lists installed packages, checks for security updates
- Rootkit detection — hidden processes, suspicious kernel modules, SUID files, crypto miners, suspicious cron jobs
- File integrity monitoring — hashes critical files, alerts on unauthorized changes
Database Monitoring
Tell Claude: "monitor database on PROD"
- MySQL: slow query log, connection stats, replication lag, table sizes, processlist
- PostgreSQL: pg_stat_statements, connections, replication, table sizes, lock analysis, cache hit ratio
- Auto-detects which database is installed
Network Monitoring
Tell Claude: "monitor network on PROD"
- Bandwidth per interface (bytes/sec, Mbps)
- Connection state tracking (ESTABLISHED, TIME_WAIT, CLOSE_WAIT)
- TCP retransmission rates
- Historical trends stored in SQLite
Resource Rightsizing
Tell Claude: "rightsizing recommendations for PROD"
- Analyzes CPU, memory, disk usage over time
- Identifies over-provisioned resources ("CPU at 0.4% — downsize from 16 to 8 cores")
- Identifies under-provisioned resources ("Memory at 92% — upgrade RAM")
- Cost savings estimates
Service Dependency Mapping
Tell Claude: "map dependencies on PROD"
- Parses active TCP connections to discover what processes talk to what
- Groups by process (nginx -> database:5432, app -> redis:6379)
- Stored in SQLite for historical tracking
Root Cause Analysis
Tell Claude: "analyze root cause on PROD"
- Correlates metric spikes with service failures and alerts
- Detects cascading failure patterns
- Identifies resource exhaustion as cause of service crashes
- Temporal correlation across all monitoring data
Smart Anomaly Detection
Tell Claude: "detect anomalies on PROD"
- Builds baselines per metric grouped by hour and day of week
- Flags values >2.5 standard deviations from the mean
- No ML dependencies — pure statistics from SQLite data
Auto-Remediation Playbooks
5 built-in playbooks:
| Playbook |
Trigger |
Action |
disk_cleanup |
Disk > 90% |
Clear journal, /tmp, old logs, package cache |
restart_failed_services |
Failed services detected |
Restart each failed service |
high_memory_cleanup |
Memory > 95% |
Drop filesystem caches |
high_cpu_investigation |
CPU load > 3x cores |
Log top CPU consumers |
ssl_renewal |
SSL cert < 7 days |
Run certbot renew, reload nginx |
Custom playbooks: drop JSON files in ~/.server-guardian-mcp/playbooks/
Public Status Page
Tell Claude: "generate a status page"
- Self-hosted uptime page for customers
- Shows server and endpoint health
- Active incidents section
- Auto-refreshes every 60 seconds
- Replaces Better Stack ($29/mo) and Instatus ($20/mo) — free
Multi-Step API Tests
Tell Claude: "test my API"
- Chain API calls: login -> extract token -> call API with token -> verify response
- Variable extraction from JSON responses
- Assertions: status code, body content, response time
- Save and re-run named tests
Maintenance Windows
Tell Claude: "create maintenance window for PROD for 2 hours"
- Suppress alerts during planned work
- Configurable duration
- List and delete windows
Compliance Reports
Tell Claude: "generate a compliance report for PROD"
- Security score (0-100) with letter grade (A-F)
- Detailed check results with pass/fail/warning badges
- Active alerts section
- Print-friendly, works in any browser
- Suitable for SOC2/ISO prep and client deliverables
Team Mode (RBAC)
GUARDIAN_TEAM_MODE=true
GUARDIAN_API_KEY=sg_your_api_key_here
| Role |
Permissions |
| admin |
Full access — all tools, user management |
| operator |
Run commands, restart services, deploy — no user management |
| viewer |
Read-only — view health, logs, alerts, dashboards |
External Integrations
PAGERDUTY_ROUTING_KEY=your-routing-key
TELEGRAM_BOT_TOKEN=your-bot-token
TELEGRAM_CHAT_ID=your-chat-id
OPSGENIE_API_KEY=your-api-key
Background Watchdog
Runs independently of Claude — no AI, no API cost. Monitors 24/7 and sends alerts via email, Slack, Discord.
python -m server_guardian_mcp watchdog # run forever
python -m server_guardian_mcp watchdog --once # run one cycle
Alert thresholds
| Condition |
Severity |
| Disk > 90% |
Critical |
| Disk > 80% |
Warning |
| CPU load > 2x cores |
Warning |
| Temperature > 85C |
Warning |
| Server unreachable |
Critical |
| Failed services |
Warning |
| HTTP endpoint down |
Critical |
| SSL cert < 7 days |
Critical |
| SSL cert < 30 days |
Warning |
Connection Types
| Type |
Connects to |
Requires |
ssh |
Linux/Mac servers |
paramiko (included) |
local |
Your own machine |
nothing |
docker |
Docker containers |
docker CLI |
winrm |
Windows servers |
pip install pywinrm |
k8s |
Kubernetes pods |
kubectl CLI |
aws-ssm |
AWS EC2 instances |
aws CLI |
gcloud |
GCP Compute Engine |
gcloud CLI |
azure |
Azure VMs |
az CLI |
Security
- Command blocklist — blocks rm -rf, fork bombs, reverse shells
- Sensitive file protection — blocks .pem, .key, .env, /etc/shadow
- SQL safety — read-only by default
- Read-only mode —
GUARDIAN_MODE=readonly
- Rate limiting — 30 calls/min per tool
- Audit logging — all invocations logged with sensitive param redaction
- Shell injection prevention — shlex.quote on all inputs
- Output capped at 512KB per command
- File integrity monitoring — detect unauthorized file changes
- CIS benchmark compliance — 61 security checks
- CVE + rootkit scanning — detect known vulnerabilities and malware
Architecture
- 63 MCP tools across 16 modules
- 8 connection adapters (SSH, Local, Docker, WinRM, K8s, AWS SSM, GCloud, Azure)
- 15 SQLite tables (health, services, endpoints, alerts, audit, baselines, playbooks, users, logs, SLOs, file hashes, network, maintenance, API tests, dependencies)
- Background watchdog with email/Slack/Discord/PagerDuty/Telegram/OpsGenie alerts
- Live web dashboard (Starlette + Chart.js)
- Statistical anomaly detection engine
- Auto-remediation playbook engine
- Access log APM parser
- CIS benchmark + CVE scanner
- Database monitoring (MySQL + PostgreSQL)
- Network monitoring with bandwidth tracking
- SLO tracking with error budgets
- Team RBAC (admin/operator/viewer)
- Compliance report generator
- Public status page generator
Requirements
- Python 3.10+
mcp>=1.0.0
paramiko>=3.0.0
uvicorn>=0.27.0
starlette>=0.36.0
License
Proprietary — Copyright (c) 2026 Md Nazish Arman. All rights reserved.
Free for personal, non-commercial evaluation only. Commercial use, business use, or any revenue-generating use requires a paid license. See LICENSE for full terms.
Author
Md Nazish Arman