Automated MongoDB cluster health checks via the Ops Manager API — for incident triage and proactive review.

These details have not been verified by PyPI

Project links

Project description

om-health-check

A CLI tool that queries the MongoDB Ops Manager API to produce a structured health assessment of MongoDB clusters. Designed for use during incidents and proactive health checks.

What it does

Runs 9 categories of checks against one or more clusters via the Ops Manager API:

Connectivity & Infrastructure — API reachability, node status, agent status, active alerts, network throughput
Compute Resources — CPU (user, iowait, process), memory, swap; deeper CPU breakdown when issues detected
Disk Resources — read/write latency, IOPS, partition space, iowait correlation
Cache Resources — WiredTiger cache used bytes, dirty bytes, cache read/write rates
Database Activity & Workload — query targeting, scan and order, opcounters, document metrics, execution times, global lock queues, Performance Advisor
Replication — replication lag, oplog window, oplog rate
Connections — connection count, zero-connection detection, connection storm correlation
Backup — backup config status, snapshot schedule adherence, capture lag
Version Information — version consistency across nodes, known-bad version detection (CVEs)

Each metric is compared against both an absolute threshold and a 1-week baseline (same day-of-week, same hour) to reduce false positives from normal workload variance.

Installation

pip install om-health-check

Requires Python 3.9+.

API key permissions

The API key must have the Project Read Only role on each project being checked. This provides read access to deployments, measurements, alerts, agents, backup status, and Performance Advisor data.

No write permissions are required. The tool never modifies any Ops Manager configuration.

If the API key lacks sufficient permissions, affected sections will report a RED status with a message indicating which permission is missing.

Usage

export OPS_MANAGER_USER=your-public-key
export OPS_MANAGER_API_KEY=your-private-key

om-health-check --om-url https://ops-manager.example.com --project "My Project"

Options

Flag	Required	Description
`--om-url`	Yes	Ops Manager base URL
`--project`	Yes	Project name (repeatable for multiple projects)
`--cluster`	No	Cluster name filter; omit to check all clusters in the project(s)
`--format`	No	`txt` (default), `json`, `html`, or comma-separated (e.g. `txt,html`)
`--config`	No	Path to YAML config file for threshold overrides

Output formats

txt — plain text, suitable for pasting into incident tickets
json — machine-readable, for downstream tooling or dashboards
html — self-contained HTML with color-coded status and collapsible sections

Examples

Check all clusters in a project:

om-health-check --om-url https://om.example.com --project "Production"

Check a specific cluster across two projects, output as text and HTML:

om-health-check --om-url https://om.example.com \
  --project "Production" --project "Staging" \
  --cluster "rs0" \
  --format txt,html

Threshold configuration

Every metric has a default threshold. To override defaults, create a YAML config file.

The tool looks for config in this order:

Path passed via --config
OM_HEALTH_CHECK_CONFIG environment variable
~/.om-health-check.yaml

Only metrics you want to change need to be specified. Unspecified fields retain their defaults.

thresholds:
  CONNECTIONS:
    red: 30000
    warn: 25000
  SYSTEM_NORMALIZED_CPU_USER:
    red: 90.0
    mode: "or"
  DISK_PARTITION_LATENCY_READ:
    red: 15.0
    warn: 8.0

See examples/all-thresholds.yaml for a reference file listing every metric with its built-in defaults — copy, subset, and edit to produce a custom config.

See examples/low-thresholds.yaml for a smoke-test config with aggressively low thresholds designed to trigger RED on a healthy cluster — useful for verifying the tool runs end-to-end.

Threshold fields

Field	Type	Description
`red`	float	Value that triggers RED status
`warn`	float	Value that triggers WARN status
`direction`	string	`"above"` (RED when value >= red) or `"below"` (RED when value <= red)
`deviation`	float	Baseline multiplier (e.g. `3.0` = RED if current >= 3x baseline)
`mode`	string	How threshold and baseline interact (see below)

Evaluation modes

Mode	Behavior
`absolute`	RED if value crosses threshold. Baseline is informational.
`baseline`	RED only if value deviates from baseline by the configured multiplier. No absolute threshold.
`and`	RED only if value crosses threshold AND deviates from baseline. Suppresses false positives from stable elevated values.
`or`	RED if value crosses threshold OR deviates from baseline. Catches both absolute danger and unusual spikes.

Baseline comparison

Current metric values are compared against the same hour, same day of week, one week prior. This accounts for recurring workload patterns (business hours vs nights vs weekends) and avoids flagging normal variance as anomalous.

Current values are fetched at PT1M granularity over the past hour and averaged, producing a 1-hour rolling average. This sidesteps Ops Manager's mid-hour PT1H rollup, which is not yet populated for rate-based metrics (CPU %, network bytes/sec) until the hour boundary.

Baseline values are fetched at PT1H granularity from the 1-hour window one week ago. Ops Manager retains hourly data for 2 months by default.

Comparing two hourly averages keeps the check apples-to-apples and resistant to single-minute spikes.

Graceful degradation when data is missing

The tool is resilient to gaps in OM data:

No current data available → reported as INFO (e.g., no read activity means no DISK_PARTITION_LATENCY_READ sample)
No baseline data available (cluster is less than 1 week old) → behavior depends on evaluation mode:
- absolute — works unchanged (baseline is informational)
- baseline — reports INFO with the current value and "no baseline yet (cluster < 1 week old)"
- and / or — degrades to threshold-only evaluation, with a "no baseline yet" note appended to the message
Metric not exposed by the OM API version → batched fetch falls back to per-metric calls; unavailable metrics are summarized once on stderr

Status rollup

Each check produces one of four statuses:

GREEN — healthy
WARN — approaching threshold
RED — threshold crossed or baseline significantly deviated
INFO — informational only (missing data, advisory alerts, degraded evaluation)

Section, cluster, and overall status roll up the worst status among their children — with one important rule: INFO never bubbles up. A cluster with only INFO items still reports overall GREEN. This keeps the headline color honest about operational health without hiding informational details.

Certain advisory alerts (e.g., HOST_SECURITY_CHECKUP_NOT_MET, which commonly fires as a false positive for deployments using external auth like LDAP) are classified as INFO so they are visible but do not color the overall report.

Monitoring agents

Ops Manager uses leader election for monitoring agents: exactly one agent per project is ACTIVE, the rest are STANDBY (ready to take over if the active agent fails). The tool reports a single GREEN "Agent status" check when at least one agent is ACTIVE, and RED only if no ACTIVE agent exists (which means monitoring data is not being collected).

Dependencies

opsmanager — Ops Manager API client
Jinja2 — HTML report templating
packaging — version comparison
PyYAML — config file parsing

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

May 14, 2026

0.3.1

May 11, 2026

This version

0.3.0

May 11, 2026

0.2.0

Apr 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

om_health_check-0.3.0.tar.gz (44.2 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

om_health_check-0.3.0-py3-none-any.whl (43.2 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file om_health_check-0.3.0.tar.gz.

File metadata

Download URL: om_health_check-0.3.0.tar.gz
Upload date: May 11, 2026
Size: 44.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for om_health_check-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c730ea8e7d51bc0417a4e961efd692cdb56dad25da10844234048ad31d573f6a`
MD5	`6bb610dfd39655faa5c6ed958caf32e0`
BLAKE2b-256	`2627e939a19298cb2a77539a71ec168aee6ac6deefe0a622cac73ee4199fe717`

See more details on using hashes here.

File details

Details for the file om_health_check-0.3.0-py3-none-any.whl.

File metadata

Download URL: om_health_check-0.3.0-py3-none-any.whl
Upload date: May 11, 2026
Size: 43.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for om_health_check-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0eaa3198eb11a524323ea8fcad6449db4df16b49feaa5d097bf5483351b71cb4`
MD5	`3efd97087cd982fc7a7ded71c03aaba3`
BLAKE2b-256	`d0b221506c02f5de79ac237a32adda285ef550655fae242e5ee7af954134f332`

See more details on using hashes here.

om-health-check 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

om-health-check

What it does

Installation

API key permissions

Usage

Options

Output formats

Examples

Threshold configuration

Threshold fields

Evaluation modes

Baseline comparison

Graceful degradation when data is missing

Status rollup

Monitoring agents

Dependencies

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes