DEPRECATED — Functionality moved to nthlayer-workers (LearnOutcomeModule, LearnRetrospectiveModule) as of v1.5; verdict model migrated to nthlayer-common.verdicts. Install nthlayer-workers instead.
Project description
nthlayer-learn — The Atomic Unit of AI Judgment
AI systems today are fire-and-forget. They make thousands of decisions per day (approving code, correlating signals, triaging incidents, moderating content, generating recommendations) and almost never find out whether those decisions were right. Quality degrades silently, confidence doesn't track reality, and the same mistakes repeat indefinitely because there is no feedback loop.
Verdicts close that loop. A verdict is a structured record of an AI decision that tracks what was evaluated, what was decided, how confident the AI was, and (critically) whether the decision turned out to be correct. The outcome phase is what makes verdicts different from logging: it turns isolated decisions into measurable, improvable judgment quality.
Verdicts are independent of the OpenSRM ecosystem, independent of any specific agent framework, and independent of any specific model provider. Any system where an AI makes decisions can emit verdicts.
The Problem
Most AI systems record what they decided but never measure whether those decisions were good. The consequences compound:
- Silent degradation. A model update, prompt change, or context shift reduces judgment quality by 15%. Nobody notices for weeks because there is no accuracy measurement, only activity logs.
- Uncalibrated confidence. An agent says "0.85 confidence" on every decision regardless of difficulty. Is 0.85 reliable? Nobody knows, because confidence has never been compared against actual outcomes.
- No learning. A code reviewer approves a change that causes an incident. The reviewer keeps reviewing with the same prompt, the same blind spot, the same failure mode, because there is no signal flowing back from the outcome to the judgment.
- Invisible gaming. An agent optimises for the evaluation rubric rather than actual correctness. Its scores are high but its real-world outcomes are poor. Without tracking both sides, the divergence is invisible.
The root cause is the same in every case: the decision and its outcome are never connected. Verdicts connect them.
The Three Phases
Every verdict has three phases:
- Judgment (filled at decision time): what was the input, what did the AI decide, why, and how confident is it?
- Outcome (filled later): was the decision confirmed, overridden, or contradicted by downstream evidence?
- Lineage (optional): which other verdicts informed this one, and which verdicts does this one inform?
The outcome phase is what makes verdicts powerful. It transforms AI decisions from opaque events into measurable data points with known correctness.
How the Loop Closes at Scale
The obvious question: if an AI makes thousands of decisions per day, who confirms or overrides all of them? The answer is that the feedback loop doesn't depend on a human reviewing every verdict. Outcomes resolve through multiple mechanisms, most of which are automatic:
Downstream outcome signals
Real-world consequences resolve verdicts without human intervention. When an AI approves a code change and that change causes a test failure in CI, that's automatic ground truth. When an approved deploy triggers a latency spike, that's a signal. When code gets reverted within 7 days, that's a signal. These downstream events flow back and resolve the original verdict's outcome automatically.
outcome_tracking:
sources:
- type: deployment_metrics
signal: "error_rate increase > 2x within 1h of deploy"
- type: test_results
signal: "test failure within same PR"
- type: revert
signal: "git revert of the commit within 7 days"
- type: incident_correlation
signal: "nthlayer-correlate correlation linking this change to an incident"
Lineage propagation
When verdicts are linked through lineage (a correlation verdict informs an investigation verdict which informs a remediation verdict), one human override at any point in the chain propagates calibration signals to every verdict upstream. A single human action resolves five verdicts. This is the efficiency multiplier that makes human review scalable: humans review the decisions that matter most (remediations, governance actions), and lineage carries that signal back through the entire decision chain.
nthlayer-correlate correlation verdict (vrd-001)
"deploy v2.3.1 caused latency spike, confidence 0.71"
|
+-> nthlayer-respond investigation verdict (vrd-002, context: [vrd-001])
"root cause: connection pooling removed, confidence 0.65"
|
+-> nthlayer-respond remediation verdict (vrd-003, parent: vrd-002)
| "rollback to v2.3.0, confidence 0.90"
|
+-> Human override (vrd-004, parent: vrd-002)
"root cause correct, but hotfix not rollback"
(overrides vrd-003, confirms vrd-002)
Result: nthlayer-correlate gets a positive signal. Investigation agent gets a positive signal.
Remediation agent gets a negative signal. All from one human action.
Calibration sampling
Not every verdict needs direct validation. A configurable percentage (default 5%) of auto-approved outputs are randomly sampled and sent through full evaluation. If the sample consistently confirms the original judgments, the system is calibrated. If the sample finds problems, the approval criteria tighten. Statistical confidence without exhaustive review.
Score-outcome divergence
The gaming-check query compares an agent's average judgment score against its actual outcome confirmation rate over a rolling window. An agent scoring 0.88 on average but with only 71% of its decisions confirmed by outcomes has a 17-point divergence, which is a signal that its scores don't reflect reality. This surfaces problems across thousands of verdicts without reviewing any of them individually.
nthlayer-learn gaming-check --producer arbiter --agent code-reviewer --window 90d
# code-reviewer: score 0.88, outcome confirmation 0.71, divergence 0.17 -> ALERT
# doc-writer: score 0.79, outcome confirmation 0.81, divergence -0.02 -> OK
Time-based expiry
A verdict that remains unresolved past its TTL (default 90 days) expires with a weak negative signal. This doesn't mean the decision was wrong. It means the feedback loop is broken for this verdict, which is itself useful information: if 60% of verdicts expire unresolved, the system isn't generating enough outcome signal to calibrate.
Quick Start
Install
pip install nthlayer-learn
Create a verdict
from nthlayer_learn import create, resolve
# Record a judgment
v = create(
subject={"type": "review", "ref": "git:abc123", "summary": "Auth middleware change"},
judgment={"action": "approve", "confidence": 0.82, "reasoning": "Auth check is sound"},
producer={"system": "my-reviewer", "model": "claude-sonnet-4-20250514"},
)
# Later, when you know if the decision was right
v = resolve(v, status="confirmed")
Measure accuracy
from nthlayer_learn import MemoryStore, AccuracyFilter
store = MemoryStore()
store.put(v)
report = store.accuracy(AccuracyFilter(producer_system="my-reviewer"))
print(f"Confirmation rate: {report.confirmation_rate}")
print(f"Override rate: {report.override_rate}")
print(f"Calibration gap: {report.mean_confidence_on_confirmed - report.confirmation_rate}")
Serialise and deserialise
from nthlayer_learn import to_json, from_json
json_str = to_json(v)
v2 = from_json(json_str)
CLI
# Show accuracy report for a producer
nthlayer-learn accuracy --producer my-reviewer --window 30d
# List recent verdicts
nthlayer-learn list --producer my-reviewer --status pending --limit 20
What You Get
Replay. Every verdict contains a reference to its input and a content hash for integrity verification. Change a prompt, swap a model, adjust context, then replay historical verdicts and diff: X improved, Y regressed, Z unchanged. This is regression testing for judgment quality.
Calibration. The accuracy() query computes confirmation rate, override rate, calibration gap (the difference between an agent's confidence and its actual accuracy), and per-dimension breakdowns from resolved verdicts. Track these over rolling windows and you know exactly how your AI's judgment quality is trending.
Gaming detection. The gap between judgment.score and outcome.status across a population of verdicts is the gaming signal. High scores with poor real-world outcomes means something is wrong, whether that's prompt misalignment, evaluation rubric blind spots, or deliberate optimisation for the rubric rather than actual correctness.
Lineage. Trace how one decision influenced another across components, teams, or systems. When components exchange verdicts instead of bespoke formats, the provenance of every judgment is traceable and the calibration signal from any override flows through the entire chain.
No framework required. No ecosystem required. Just a schema and a small library.
Schema
See SPEC.md for the full schema specification, or schema/verdict.json for the JSON Schema.
Outcome statuses
| Status | Meaning | Calibration Signal |
|---|---|---|
pending |
No outcome yet. Judgment stands but hasn't been validated. | None (not counted in accuracy until resolved) |
confirmed |
Human or downstream signal confirmed the judgment was correct. | Positive |
overridden |
Human or downstream signal contradicted the judgment. | Negative |
partial |
Judgment was partially correct. Some dimensions right, others wrong. | Mixed (per-dimension) |
superseded |
A newer verdict on the same subject replaced this one. | None |
expired |
TTL elapsed without resolution. | Weak negative (feedback loop may be broken) |
OTel Integration
Verdicts are the data primitive. OpenTelemetry is the transmission format. The verdict library maps between the two automatically.
When a verdict is created, resolved, or overridden, it emits OTel events using the gen_ai.* semantic conventions that the broader OTel community is developing. This means verdict data flows into any OTel-compatible backend (Prometheus, Grafana, Datadog, Honeycomb) without custom integrations.
Events
| OTel Event | Trigger |
|---|---|
gen_ai.decision.created |
verdict.create() |
gen_ai.decision.confirmed |
verdict.resolve(status: confirmed) |
gen_ai.override.recorded |
verdict.resolve(status: overridden) |
Metrics (via OTel Collector -> Prometheus)
| Metric | Type | What It Measures |
|---|---|---|
gen_ai_decision_total |
counter | Total judgments produced |
gen_ai_decision_score |
gauge | Quality score per judgment |
gen_ai_decision_confidence |
gauge | Producer confidence per judgment |
gen_ai_override_reversal_total |
counter | Judgments overridden by humans |
gen_ai_override_correction_total |
counter | Judgments partially corrected |
gen_ai_decision_cost_tokens |
counter | Token consumption |
gen_ai_decision_cost_currency |
gauge | Estimated cost in USD |
All metrics carry system, agent, and environment labels derived from the verdict's producer and subject fields.
Attribute Mapping
Key verdict fields map to standard OTel attributes:
| Verdict Field | OTel Attribute |
|---|---|
producer.system |
gen_ai.system |
producer.model |
gen_ai.request.model |
judgment.action |
gen_ai.decision.action |
judgment.confidence |
gen_ai.decision.confidence |
subject.service |
service.name |
outcome.override.by |
gen_ai.override.actor |
For systems using verdicts outside of generative AI contexts (traditional ML, rule-based systems, manual decision tracking), the library can emit using a decision.* namespace instead. The verdict schema is the same regardless of namespace.
See conventions/ for the full semantic convention specifications.
OpenSRM Ecosystem
Verdicts are independent of any specific ecosystem, but within the OpenSRM reliability stack, the Verdict Store is the shared data substrate that all judgment-producing components communicate through:
Static Layer (Data + Tools)
OpenSRM Manifests → NthLayer → Generated Artifacts
│
▼
Verdict Layer (Data Primitive) ← this library
verdict.create() verdict.resolve() verdict.query()
│
▼
Agent Layer (Reasoning)
nthlayer-correlate → [verdict] → nthlayer-respond Agents ← [verdict.accuracy()] → nthlayer-measure
All agents emit verdicts with lineage.
│ OTel side-effects
▼
Semantic Conventions (OTel)
Change Events │ Decision Telemetry │ Outcomes
Verdicts with lineage are the primary cross-component communication mechanism. nthlayer-correlate emits correlation verdicts, nthlayer-respond agents emit triage/investigation/remediation verdicts linked via lineage, and nthlayer-measure queries verdict.accuracy() for self-calibration. One human override at any point in the chain propagates calibration signals to every verdict upstream.
| Component | How it uses Verdict |
|---|---|
| nthlayer-measure | Produces agent_output verdicts for every evaluation; queries accuracy() for self-calibration |
| nthlayer-correlate | Produces correlation verdicts; ingests nthlayer-measure quality verdicts as events |
| nthlayer-respond | Produces triage, investigation, communication, remediation verdicts; consumes nthlayer-correlate verdicts as context |
| NthLayer | Queries Prometheus metrics that originate from verdict OTel emission |
Storage
| Tier | Store | Use Case |
|---|---|---|
| Tier 1 | SQLite | Single file, zero dependencies. Default. |
| Tier 2 | PostgreSQL | Concurrent access, full-text search, real-time consumption. |
| Tier 3 | ClickHouse | Analytics over millions of verdicts. |
| Any | Git | Evaluation datasets (curated verdicts with known outcomes). |
Project Structure
nthlayer-learn/
├── SPEC.md # Full schema specification
├── schema/ # JSON Schema and annotated examples
├── conventions/ # OTel semantic conventions
├── lib/ # Transport libraries (Python, Go, TypeScript)
├── stores/ # Storage implementations
├── cli/ # CLI tools
└── eval/ # Example evaluation datasets
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nthlayer_learn-1.0.0.tar.gz.
File metadata
- Download URL: nthlayer_learn-1.0.0.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea005eeb454162bfa807dbd9bf4c6bf1673d06a507a9f8c677008c64ed650a6f
|
|
| MD5 |
e28b08fad5563d0692e91aac901484e9
|
|
| BLAKE2b-256 |
ddcf77a5764c8ce5506450f4da9a75a17c3ecaeb5521c4868879bbf2ceff3d68
|
Provenance
The following attestation bundles were made for nthlayer_learn-1.0.0.tar.gz:
Publisher:
release.yml on rsionnach/nthlayer-learn
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nthlayer_learn-1.0.0.tar.gz -
Subject digest:
ea005eeb454162bfa807dbd9bf4c6bf1673d06a507a9f8c677008c64ed650a6f - Sigstore transparency entry: 1397513679
- Sigstore integration time:
-
Permalink:
rsionnach/nthlayer-learn@85d51e5e0ee52744717248c0329ced5a2e2de72e -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/rsionnach
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@85d51e5e0ee52744717248c0329ced5a2e2de72e -
Trigger Event:
release
-
Statement type:
File details
Details for the file nthlayer_learn-1.0.0-py3-none-any.whl.
File metadata
- Download URL: nthlayer_learn-1.0.0-py3-none-any.whl
- Upload date:
- Size: 27.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e58b938bef26243c0d6ced2a8158761af38d23e0e7bd30ed032915506bb08ef8
|
|
| MD5 |
0ece98e3b47838d6ae484e591815b467
|
|
| BLAKE2b-256 |
b1cbdaa8f1ef0b289b7ea207f96db5f77305d0aec3d10f9ef9455fa6f900c9ae
|
Provenance
The following attestation bundles were made for nthlayer_learn-1.0.0-py3-none-any.whl:
Publisher:
release.yml on rsionnach/nthlayer-learn
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nthlayer_learn-1.0.0-py3-none-any.whl -
Subject digest:
e58b938bef26243c0d6ced2a8158761af38d23e0e7bd30ed032915506bb08ef8 - Sigstore transparency entry: 1397513689
- Sigstore integration time:
-
Permalink:
rsionnach/nthlayer-learn@85d51e5e0ee52744717248c0329ced5a2e2de72e -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/rsionnach
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@85d51e5e0ee52744717248c0329ced5a2e2de72e -
Trigger Event:
release
-
Statement type: