Skip to main content

Automated MongoDB cluster health checks via the Ops Manager API — for incident triage and proactive review.

Project description

om-health-check

A CLI tool that queries the MongoDB Ops Manager API to produce a structured health assessment of MongoDB clusters. Designed for use during incidents and proactive health checks.

What it does

Runs 9 categories of checks against one or more clusters via the Ops Manager API:

  1. Connectivity & Infrastructure — API reachability, node status, agent status, active alerts, network throughput
  2. Compute Resources — CPU (user, iowait, process), memory, swap; deeper CPU breakdown when issues detected
  3. Disk Resources — read/write latency, IOPS, partition space, iowait correlation
  4. Cache Resources — WiredTiger cache used bytes, dirty bytes, cache read/write rates
  5. Database Activity & Workload — query targeting, scan and order, opcounters, document metrics, execution times, global lock queues, Performance Advisor
  6. Replication — replication lag, oplog window, oplog rate
  7. Connections — connection count, zero-connection detection, connection storm correlation
  8. Backup — backup config status, snapshot schedule adherence, capture lag
  9. Version Information — version consistency across nodes, known-bad version detection (CVEs)

Each metric is compared against both an absolute threshold and a 1-week baseline (same day-of-week, same hour) to reduce false positives from normal workload variance.

Installation

pip install om-health-check

Requires Python 3.9+.

API key permissions

The API key must have the Project Read Only role on each project being checked. This provides read access to deployments, measurements, alerts, agents, and backup status — covering 8 of the 9 check sections.

No write permissions are required. The tool never modifies any Ops Manager configuration.

Performance Advisor section — additional role required

The Performance Advisor section calls endpoints that require the Project Data Access Read Only role (or higher). Per the Ops Manager docs, the allowed roles are: Project Owner, Project Data Access Admin, Project Data Access Read/Write, or Project Data Access Read Only.

The minimum role granting this access (Project Data Access Read Only) also grants the holder read access to database contents. There is no narrower read-only-observability role for Performance Advisor in Ops Manager.

For security-conscious deployments where most personnel should not have database read access, run the tool with a Project Read Only key. The Performance Advisor section will report an INFO message — "Performance Advisor access denied — requires Project Data Access Read Only role or higher" — and the other 8 sections work normally. To minimize API load when access is denied, the script makes only one Performance Advisor call per cluster and reuses the message for the remaining hosts.

If the API key lacks sufficient permissions, affected checks report a clear message indicating which permission is missing rather than failing the whole report.

Usage

export OPS_MANAGER_USER=your-public-key
export OPS_MANAGER_API_KEY=your-private-key

om-health-check --om-url https://ops-manager.example.com --project "My Project"

Options

Flag Required Description
--om-url Yes Ops Manager base URL
--project Yes Project name (repeatable for multiple projects)
--cluster No Cluster name filter; omit to check all clusters in the project(s)
--format No txt (default), json, html, or comma-separated (e.g. txt,html)
--config No Path to YAML config file for threshold overrides

Output formats

  • txt — plain text, suitable for pasting into incident tickets
  • json — machine-readable, for downstream tooling or dashboards
  • html — self-contained HTML with color-coded status and collapsible sections

Examples

Check all clusters in a project:

om-health-check --om-url https://om.example.com --project "Production"

Check a specific cluster across two projects, output as text and HTML:

om-health-check --om-url https://om.example.com \
  --project "Production" --project "Staging" \
  --cluster "rs0" \
  --format txt,html

Threshold configuration

Every metric has a default threshold. To override defaults, create a YAML config file.

The tool looks for config in this order:

  1. Path passed via --config
  2. OM_HEALTH_CHECK_CONFIG environment variable
  3. ~/.om-health-check.yaml

Only metrics you want to change need to be specified. Unspecified fields retain their defaults.

thresholds:
  CONNECTIONS:
    red: 30000
    warn: 25000
  SYSTEM_NORMALIZED_CPU_USER:
    red: 90.0
    mode: "or"
  DISK_PARTITION_LATENCY_READ:
    red: 15.0
    warn: 8.0

See examples/all-thresholds.yaml for a reference file listing every metric with its built-in defaults — copy, subset, and edit to produce a custom config.

See examples/low-thresholds.yaml for a smoke-test config with aggressively low thresholds designed to trigger RED on a healthy cluster — useful for verifying the tool runs end-to-end.

Threshold fields

Field Type Description
red float Value that triggers RED status
warn float Value that triggers WARN status
direction string "above" (RED when value >= red) or "below" (RED when value <= red)
deviation float Baseline multiplier (e.g. 3.0 = RED if current >= 3x baseline)
mode string How threshold and baseline interact (see below)

Evaluation modes

Mode Behavior
absolute RED if value crosses threshold. Baseline is informational.
baseline RED only if value deviates from baseline by the configured multiplier. No absolute threshold.
and RED only if value crosses threshold AND deviates from baseline. Suppresses false positives from stable elevated values.
or RED if value crosses threshold OR deviates from baseline. Catches both absolute danger and unusual spikes.

Baseline comparison

Current metric values are compared against the same hour, same day of week, one week prior. This accounts for recurring workload patterns (business hours vs nights vs weekends) and avoids flagging normal variance as anomalous.

Current values are fetched at PT1M granularity over the past hour and averaged, producing a 1-hour rolling average. This sidesteps Ops Manager's mid-hour PT1H rollup, which is not yet populated for rate-based metrics (CPU %, network bytes/sec) until the hour boundary.

Baseline values are fetched at PT1H granularity from the 1-hour window one week ago. Ops Manager retains hourly data for 2 months by default.

Comparing two hourly averages keeps the check apples-to-apples and resistant to single-minute spikes.

Graceful degradation when data is missing

The tool is resilient to gaps in OM data:

  • No current data available → reported as INFO (e.g., no read activity means no DISK_PARTITION_LATENCY_READ sample)
  • No baseline data available (cluster is less than 1 week old) → behavior depends on evaluation mode:
    • absolute — works unchanged (baseline is informational)
    • baseline — reports INFO with the current value and "no baseline yet (cluster < 1 week old)"
    • and / or — degrades to threshold-only evaluation, with a "no baseline yet" note appended to the message
  • Metric not exposed by the OM API version → batched fetch falls back to per-metric calls; unavailable metrics are summarized once on stderr

Status rollup

Each check produces one of four statuses:

  • GREEN — healthy
  • WARN — approaching threshold
  • RED — threshold crossed or baseline significantly deviated
  • INFO — informational only (missing data, advisory alerts, degraded evaluation)

Section, cluster, and overall status roll up the worst status among their children — with one important rule: INFO never bubbles up. A cluster with only INFO items still reports overall GREEN. This keeps the headline color honest about operational health without hiding informational details.

Certain advisory alerts (e.g., HOST_SECURITY_CHECKUP_NOT_MET, which commonly fires as a false positive for deployments using external auth like LDAP) are classified as INFO so they are visible but do not color the overall report.

Monitoring agents

Ops Manager uses leader election for monitoring agents: exactly one agent per project is ACTIVE, the rest are STANDBY (ready to take over if the active agent fails). The tool reports a single GREEN "Agent status" check when at least one agent is ACTIVE, and RED only if no ACTIVE agent exists (which means monitoring data is not being collected).

Dependencies

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

om_health_check-0.4.0.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

om_health_check-0.4.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file om_health_check-0.4.0.tar.gz.

File metadata

  • Download URL: om_health_check-0.4.0.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for om_health_check-0.4.0.tar.gz
Algorithm Hash digest
SHA256 99c4b7c610f1127f3ab4ce52c0a7e51274f90bf3f4abccad82f04b9bbe2bf459
MD5 d307ee1596a0173f75244d7f312f8c55
BLAKE2b-256 249239b2374fa4edd57291e63bb9b554505479c37191e7b97b2c11dcbf078815

See more details on using hashes here.

File details

Details for the file om_health_check-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for om_health_check-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 19b6e0b83259fdba935fe1ad191c1a969ab79599d126aa9be9c0170784d6e8b9
MD5 3cdf5342a7377975afadd1fc552fbb7c
BLAKE2b-256 4bb52978e3e40852b6fcdaf668f950e81145139f05358d2e60fb0184fce084ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page