Skip to main content

Situational awareness through automated signal correlation

Project description

nthlayer-correlate

Situational awareness through automated signal correlation.

Status: Phase 2 Tier 1 License: Apache 2.0

Enterprise-scale distributed systems produce an enormous volume of observability signals: metrics at 15-second intervals across thousands of services, structured logs on every request, distributed traces, alerts from multiple monitoring systems, change events from CI/CD pipelines, feature flag changes, and infrastructure scaling events. This is millions of events per minute. No human can correlate across all of these signals during an incident by reading dashboards, and no existing tool pre-processes these signals into a form that AI agents (or humans) can consume efficiently.

nthlayer-correlate solves this by continuously pre-correlating signals in the background so that when something goes wrong, the correlated picture is already built. Rather than querying raw events at incident time (which is too slow and too noisy at scale), nthlayer-correlate groups related signals, computes temporal proximity, identifies co-occurring changes, and maintains a rolling window of pre-correlated state. When an incident happens, generating a situational snapshot takes seconds rather than minutes of ad-hoc querying across Prometheus, Loki, Jaeger, and your change history.

Phase 2 Tier 1 is fully implemented. The design documented below reflects the implemented architecture.


The Problem

Prometheus handles metrics. Loki handles logs. Jaeger handles traces. But correlating across all three plus change events plus quality scores at enterprise scale is an unsolved problem that most teams handle manually during incidents (or don't handle at all). A company running thousands of services (think SaaS platforms like Workday, Stripe, Twilio) needs something that continuously watches across signal types and has the correlated view ready before anyone asks for it.

Agentic systems add additional volume on top of the enterprise baseline. AI agent decisions, quality scores, model version changes, prompt updates, and adapter deployments all produce signals that existing observability infrastructure wasn't designed to handle. nthlayer-correlate exists because raw observability data at enterprise scale is unusable without a pre-correlation layer.


Pre-Correlation

This is the core architectural concept. Pre-correlation is the difference between "let me spend 20 minutes querying four different systems during an incident" and "here's what happened, already correlated, in 3 seconds."

nthlayer-correlate continuously runs in the background, grouping related signals by service, time window, and topology. When a metric anomaly appears near a deployment event for a related service, nthlayer-correlate has already noted the temporal proximity before anyone asks. The pre-correlated data is indexed and ready for snapshot generation at any time.

Pre-correlation itself is transport (deterministic grouping, windowing, counting). Interpreting what the correlations mean is judgment (the model decides whether a temporal correlation is likely causal). This follows Zero Framework Cognition.


Situational Snapshots

A snapshot is a point-in-time document that answers "what's happening right now?" with structured evidence. Every snapshot follows the same schema regardless of how it was triggered:

snapshot:
  id: sitrep-2026-03-06T14:23:00Z
  triggered_by: alert | schedule | manual
  window: 15m
  severity: info | warning | critical
  summary: "model-generated natural language summary"
  signals:
    - source: arbiter
      type: quality_degradation
      detail: "worker ace-mjxwfy7e rejection rate 0.33 (threshold 0.20)"
      timestamp: 2026-03-06T14:18:00Z
    - source: otel
      type: deploy
      detail: "model version updated on rig-webapp 12m ago"
      timestamp: 2026-03-06T14:11:00Z
  correlations:
    - signals: [0, 1]
      confidence: 0.82
      interpretation: "quality degradation started within 7m of model version change"
  topology:
    affected_services: [webapp, api-gateway]
    dependency_chain: [webapp -> api-gateway -> database]
  recommended_actions:
    - "investigate model version change on rig-webapp"
    - "check if other workers on same rig are affected"

The schema captures what happened (signals), what's related (correlations with confidence scores), what's affected (topology from OpenSRM manifests), and what to do next (recommended actions). The signals and topology sections are transport (structured data from known sources). The summary, correlation interpretation, and recommended actions are judgment (model-generated).


Event Ingestion

At enterprise scale, nthlayer-correlate needs a streaming/queuing layer between event producers and the correlation engine. Raw events from OTel collectors, monitoring systems, CI/CD pipelines, change event sources, and quality score producers flow through a message queue that handles backpressure, replay, and fan-out.

  • Enterprise scale: Kafka, with partitioning by service and topics by signal type. Kafka's compaction and replay capabilities are designed for exactly this volume, and its consumer group model maps naturally to having multiple ecosystem components (nthlayer-correlate, nthlayer-measure, nthlayer-respond) each consuming the same event stream independently.
  • Smaller deployments: NATS provides a lighter-weight alternative for teams that don't need Kafka's full feature set.

nthlayer-correlate consumes from the queue, pre-correlates, and stores the results. This decouples event production rate from correlation processing rate, which is essential when thousands of services are each producing metrics, logs, and traces continuously.


Generation Modes

nthlayer-correlate generates snapshots in three modes, each producing the same schema but with different urgency and depth:

  • Batch (periodic): Lightweight summaries every N minutes (default: 5 minutes in WATCHING state) for continuous situational awareness. These snapshots capture the ambient state of the system.
  • Incident-triggered: On alert firing, nthlayer-correlate pulls in more context and performs deeper correlation. These snapshots are richer and more detailed, designed to give an incident responder (human or agent) immediate context.
  • Refresh (on-demand): When a human or agent requests an updated picture, nthlayer-correlate generates a fresh snapshot incorporating any new information that arrived since the last one. During active incidents, refresh snapshots run on a 1-minute cycle.

Agent States

nthlayer-correlate operates in distinct states that affect its behaviour:

State Trigger Behaviour
WATCHING Normal operations Background correlation, 5-minute snapshot cycle
ALERT Elevated signal detected Increased correlation frequency, broader signal ingestion
INCIDENT Incident declared Continuous reassessment, 1-minute snapshots, pushes context to nthlayer-respond
DEGRADED Own judgment SLO metrics below threshold Conservative mode, reduced confidence in correlations, flags for human review

The DEGRADED state is important: nthlayer-correlate monitors its own quality and reduces confidence when it detects its correlations are less reliable. This is self-awareness as a feature, not an afterthought.


Change Attribution

When quality degrades (signalled by nthlayer-measure), nthlayer-correlate looks for recent changes that temporally correlate with the degradation. It consumes changes via the standardised change event schema defined in the OpenSRM spec, which means all change sources (deploys, config updates, model version swaps, prompt changes, adapter deployments, formula revisions) arrive in a uniform format:

change_event:
  id: chg-2026-03-06-001
  timestamp: "2026-03-06T14:11:00Z"
  type: model_version
  scope:
    service: webapp
    environment: production
    rig: rig-webapp
  source: model-registry
  actor: deploy-pipeline
  detail:
    from_version: "claude-sonnet-4-20250514"
    to_version: "claude-sonnet-4-20250715"
  risk: low
  rollback_available: true

nthlayer-correlate doesn't need per-source integrations because the change event schema normalises everything. The pre-correlation layer continuously maintains a rolling window of changes, so when a quality signal fires, the candidate causes are already indexed. Identifying the candidate set is transport (pre-computed by the correlation engine). Evaluating whether a temporal correlation is causal is judgment (the model decides).


Signal Sources

nthlayer-correlate consumes signals from multiple source types:

  • OTel metrics and traces via OTel Collector (Prometheus remote write, OTLP)
  • Alerts from Alertmanager (webhook)
  • Change events from all sources, normalised via the OpenSRM change event schema (GitHub, ArgoCD, LaunchDarkly, model registries, prompt management systems)
  • Quality scores from nthlayer-measure (OTel metrics)
  • Deployment records from CI/CD pipelines

OpenSRM Integration

nthlayer-correlate reads service topology from OpenSRM manifests to understand dependency relationships when correlating signals. A quality drop in service A that depends on service B (as declared in the manifest) triggers nthlayer-correlate to check service B's signals automatically. The manifest provides the dependency graph that makes topology-aware correlation possible.


Self-Measurement

nthlayer-correlate has its own judgment SLOs, measured through nthlayer-measure's governance framework:

  • Correlation accuracy: What percentage of nthlayer-correlate's 'related change' assessments do humans agree with?
  • False positive rate: How often does nthlayer-correlate flag a change as incident-related when it isn't?

Every correlation assessment emits a gen_ai.decision.* OTel event, and human disagreements emit gen_ai.override.* events that feed nthlayer-correlate's own quality measurement. If nthlayer-correlate's correlation quality drops, nthlayer-measure's governance layer can reduce nthlayer-correlate's confidence levels or flag it for human review.


OpenSRM Ecosystem

nthlayer-correlate is one component in the OpenSRM ecosystem. Each component solves a complete problem independently, and they compose when used together through shared OpenSRM manifests and OTel telemetry conventions.

                        ┌─────────────────────────┐
                        │     OpenSRM Manifest     │
                        │  (the shared contract)   │
                        └────────────┬────────────┘
                                     │
                    reads            │           reads
               ┌─────────────┬──────┴──────┬─────────────┐
               ▼             ▼             ▼             ▼
         ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
         │ MEASURE  │ │ NthLayer │ │>CORRELATE│ │ RESPOND  │
         │          │ │          │ │          │ │          │
         │ quality  │ │ generate │ │correlate │ │ incident │
         │+govern   │ │ monitoring│ │ signals  │ │ response │
         │+cost     │ │          │ │          │ │          │
         └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
              │             │             │             │
              └─────────────┴──────┬──────┴─────────────┘
                                   ▼
                     ┌──────────────────────────┐
                     │      Verdict Store       │
                     │  (shared data substrate) │
                     │ create · resolve · link  │
                     │ accuracy · gaming-check  │
                     └────────────┬─────────────┘
                                  │ OTel side-effects
                                  ▼
                     ┌──────────────────────────┐
                     │    OTel Collector /      │
                     │   Prometheus / Grafana   │
                     └──────────────────────────┘

              Learning loop (post-incident):
              nthlayer-respond findings → manifest updates
              → NthLayer regenerates → nthlayer-measure
              refines → nthlayer-correlate improves → OpenSRM

How nthlayer-correlate fits in:

  • nthlayer-correlate emits correlation verdicts for every correlation assessment, stored in the shared Verdict Store. nthlayer-respond consumes these verdicts (with confidence scores and lineage) as the starting context for incident response — no direct coupling between components.
  • nthlayer-correlate consumes nthlayer-measure quality verdicts as events and correlates them with other signals (deployments, config changes, model version swaps) to identify what caused quality degradation
  • nthlayer-correlate reads service topology from OpenSRM manifests (via NthLayer's topology export) to understand dependency relationships when correlating
  • nthlayer-correlate's correlation accuracy improves over time as the learning loop feeds post-incident findings back into its models

Each component works alone. Someone who just needs signal correlation adopts nthlayer-correlate without needing nthlayer-measure, NthLayer, or nthlayer-respond.

Component What it does Link
OpenSRM Specification for declaring service reliability requirements OpenSRM
nthlayer-learn Data primitive for recording AI judgments and measuring correctness nthlayer-learn
nthlayer-measure Quality measurement and governance for AI agents nthlayer-measure
NthLayer Generate monitoring infrastructure from manifests nthlayer
nthlayer-correlate Situational awareness through signal correlation (this repo) nthlayer-correlate
nthlayer-respond Multi-agent incident response nthlayer-respond

Architecture

nthlayer-correlate follows Zero Framework Cognition. The boundary is clear:

Transport (code): Ingesting events from the streaming layer, grouping signals by service and time window, maintaining the rolling pre-correlation index, computing temporal proximity between signals, generating the structured snapshot schema, publishing snapshots via API and SSE.

Judgment (model): Interpreting what correlations mean, assessing whether a temporal correlation is likely causal, generating the natural language summary, recommending actions, deciding the snapshot severity level.


Status

Phase 2 Tier 1 of nthlayer-correlate is fully implemented. The design documented here reflects the implemented architecture. The pre-correlation concept has been validated in the existing OpenSRM ecosystem design (see the nthlayer-correlate technical appendix in the OpenSRM repo).


Contributing

See CONTRIBUTING.md for guidelines.


License

Apache License 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nthlayer_correlate-0.2.1.tar.gz (54.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nthlayer_correlate-0.2.1-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file nthlayer_correlate-0.2.1.tar.gz.

File metadata

  • Download URL: nthlayer_correlate-0.2.1.tar.gz
  • Upload date:
  • Size: 54.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for nthlayer_correlate-0.2.1.tar.gz
Algorithm Hash digest
SHA256 134a5d2882cd69a8762259a603ef33377fb8fbd38452869b7c3b1346b0f1d181
MD5 2060562012a889bc8360de802340b320
BLAKE2b-256 dd6be96ff4fd2da8df4a4c40dd83bcf3c1da0b93add2a89bf09e3245ea4f2fd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for nthlayer_correlate-0.2.1.tar.gz:

Publisher: release.yml on rsionnach/nthlayer-correlate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nthlayer_correlate-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for nthlayer_correlate-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c4ae50f4e5fe7e96bb921680c4b4771bae52f27ffc463181b0a9e6a68aa0ce84
MD5 e1e584a43fe5d12f8ea5fd489103ee36
BLAKE2b-256 f72df6f5b4ff5c4db7ffd504df5dab493cf887ef225bef9232f659c83227fa78

See more details on using hashes here.

Provenance

The following attestation bundles were made for nthlayer_correlate-0.2.1-py3-none-any.whl:

Publisher: release.yml on rsionnach/nthlayer-correlate

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page