Skip to main content

Automated MongoDB cluster health checks via the Ops Manager API — for incident triage and proactive review.

Project description

om-health-check

A CLI tool that queries the MongoDB Ops Manager API to produce a structured health assessment of MongoDB clusters. Designed for use during incidents and proactive health checks.

What it does

Runs 9 categories of checks against one or more clusters via the Ops Manager API:

  1. Connectivity & Infrastructure — API reachability, node status, agent status, active alerts, network throughput
  2. Compute Resources — CPU (user, iowait, process), memory, swap; deeper CPU breakdown when issues detected
  3. Disk Resources — read/write latency, IOPS, partition space, iowait correlation
  4. Cache Resources — WiredTiger cache used bytes, dirty bytes, cache read/write rates
  5. Database Activity & Workload — query targeting, scan and order, opcounters, document metrics, execution times, global lock queues, Performance Advisor
  6. Replication — replication lag, oplog window, oplog rate
  7. Connections — connection count, zero-connection detection, connection storm correlation
  8. Backup — backup config status, snapshot schedule adherence, capture lag
  9. Version Information — version consistency across nodes, known-bad version detection (CVEs)

Each metric is compared against both an absolute threshold and a 1-week baseline (same day-of-week, same hour) to reduce false positives from normal workload variance.

Installation

pip install om-health-check

Requires Python 3.9+.

API key permissions

The API key must have the Project Read Only role on each project being checked. This provides read access to deployments, measurements, alerts, agents, backup status, and Performance Advisor data.

No write permissions are required. The tool never modifies any Ops Manager configuration.

If the API key lacks sufficient permissions, affected sections will report a RED status with a message indicating which permission is missing.

Usage

export OPS_MANAGER_USER=your-public-key
export OPS_MANAGER_API_KEY=your-private-key

om-health-check --om-url https://ops-manager.example.com --project "My Project"

Options

Flag Required Description
--om-url Yes Ops Manager base URL
--project Yes Project name (repeatable for multiple projects)
--cluster No Cluster name filter; omit to check all clusters in the project(s)
--format No txt (default), json, html, or comma-separated (e.g. txt,html)
--config No Path to YAML config file for threshold overrides

Output formats

  • txt — plain text, suitable for pasting into incident tickets
  • json — machine-readable, for downstream tooling or dashboards
  • html — self-contained HTML with color-coded status and collapsible sections

Examples

Check all clusters in a project:

om-health-check --om-url https://om.example.com --project "Production"

Check a specific cluster across two projects, output as text and HTML:

om-health-check --om-url https://om.example.com \
  --project "Production" --project "Staging" \
  --cluster "rs0" \
  --format txt,html

Threshold configuration

Every metric has a default threshold. To override defaults, create a YAML config file.

The tool looks for config in this order:

  1. Path passed via --config
  2. OM_HEALTH_CHECK_CONFIG environment variable
  3. ~/.om-health-check.yaml

Only metrics you want to change need to be specified. Unspecified fields retain their defaults.

thresholds:
  CONNECTIONS:
    red: 30000
    warn: 25000
  SYSTEM_NORMALIZED_CPU_USER:
    red: 90.0
    mode: "or"
  DISK_PARTITION_LATENCY_READ:
    red: 15.0
    warn: 8.0

See examples/all-thresholds.yaml for a reference file listing every metric with its built-in defaults — copy, subset, and edit to produce a custom config.

See examples/low-thresholds.yaml for a smoke-test config with aggressively low thresholds designed to trigger RED on a healthy cluster — useful for verifying the tool runs end-to-end.

Threshold fields

Field Type Description
red float Value that triggers RED status
warn float Value that triggers WARN status
direction string "above" (RED when value >= red) or "below" (RED when value <= red)
deviation float Baseline multiplier (e.g. 3.0 = RED if current >= 3x baseline)
mode string How threshold and baseline interact (see below)

Evaluation modes

Mode Behavior
absolute RED if value crosses threshold. Baseline is informational.
baseline RED only if value deviates from baseline by the configured multiplier. No absolute threshold.
and RED only if value crosses threshold AND deviates from baseline. Suppresses false positives from stable elevated values.
or RED if value crosses threshold OR deviates from baseline. Catches both absolute danger and unusual spikes.

Baseline comparison

Current metric values are compared against the same hour, same day of week, one week prior. This accounts for recurring workload patterns (business hours vs nights vs weekends) and avoids flagging normal variance as anomalous.

Current values are fetched at PT1M granularity over the past hour and averaged, producing a 1-hour rolling average. This sidesteps Ops Manager's mid-hour PT1H rollup, which is not yet populated for rate-based metrics (CPU %, network bytes/sec) until the hour boundary.

Baseline values are fetched at PT1H granularity from the 1-hour window one week ago. Ops Manager retains hourly data for 2 months by default.

Comparing two hourly averages keeps the check apples-to-apples and resistant to single-minute spikes.

Graceful degradation when data is missing

The tool is resilient to gaps in OM data:

  • No current data available → reported as INFO (e.g., no read activity means no DISK_PARTITION_LATENCY_READ sample)
  • No baseline data available (cluster is less than 1 week old) → behavior depends on evaluation mode:
    • absolute — works unchanged (baseline is informational)
    • baseline — reports INFO with the current value and "no baseline yet (cluster < 1 week old)"
    • and / or — degrades to threshold-only evaluation, with a "no baseline yet" note appended to the message
  • Metric not exposed by the OM API version → batched fetch falls back to per-metric calls; unavailable metrics are summarized once on stderr

Status rollup

Each check produces one of four statuses:

  • GREEN — healthy
  • WARN — approaching threshold
  • RED — threshold crossed or baseline significantly deviated
  • INFO — informational only (missing data, advisory alerts, degraded evaluation)

Section, cluster, and overall status roll up the worst status among their children — with one important rule: INFO never bubbles up. A cluster with only INFO items still reports overall GREEN. This keeps the headline color honest about operational health without hiding informational details.

Certain advisory alerts (e.g., HOST_SECURITY_CHECKUP_NOT_MET, which commonly fires as a false positive for deployments using external auth like LDAP) are classified as INFO so they are visible but do not color the overall report.

Monitoring agents

Ops Manager uses leader election for monitoring agents: exactly one agent per project is ACTIVE, the rest are STANDBY (ready to take over if the active agent fails). The tool reports a single GREEN "Agent status" check when at least one agent is ACTIVE, and RED only if no ACTIVE agent exists (which means monitoring data is not being collected).

Dependencies

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

om_health_check-0.3.0.tar.gz (44.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

om_health_check-0.3.0-py3-none-any.whl (43.2 kB view details)

Uploaded Python 3

File details

Details for the file om_health_check-0.3.0.tar.gz.

File metadata

  • Download URL: om_health_check-0.3.0.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for om_health_check-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c730ea8e7d51bc0417a4e961efd692cdb56dad25da10844234048ad31d573f6a
MD5 6bb610dfd39655faa5c6ed958caf32e0
BLAKE2b-256 2627e939a19298cb2a77539a71ec168aee6ac6deefe0a622cac73ee4199fe717

See more details on using hashes here.

File details

Details for the file om_health_check-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for om_health_check-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0eaa3198eb11a524323ea8fcad6449db4df16b49feaa5d097bf5483351b71cb4
MD5 3efd97087cd982fc7a7ded71c03aaba3
BLAKE2b-256 d0b221506c02f5de79ac237a32adda285ef550655fae242e5ee7af954134f332

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page