Trust protocol for AI agents. Prove capability through calibration, not claims.

These details have not been verified by PyPI

Project links

Project description

caliber

Trust protocol for AI agents. Prove capability through calibration, not claims.

The Problem

Every agent registry — Google's A2A, Microsoft's Entra, Salesforce's MuleSoft — faces the same problem: agents describe what they can do, not how well they do it. Agent Cards are LinkedIn profiles with no work history.

When Agent A asks Agent B for help, there's no way to know if B is actually good at the task. B says it can review code. Can it? With what accuracy? Is it overconfident? Does it know its own blind spots?

The Solution

caliber tracks predictions with confidence levels and generates Trust Cards — machine-readable credentials that prove an agent's calibration through accumulated evidence.

A Trust Card answers:

Overall: How accurate is this agent?
By confidence: When it says "80% sure," is it right 80% of the time?
By domain: Where is it strong? Where is it weak?
Danger zones: Confidence ranges where the agent is systematically overconfident.

Quick Start

pip install caliber-trust

For a first external-user walkthrough, see GETTING_STARTED.md.

Python API

from caliber import TrustTracker

tracker = TrustTracker("my-code-reviewer", store_path="./trust-data")

# Record a prediction before checking
pid = tracker.predict(
    claim="this function has a SQL injection vulnerability",
    confidence=0.85,
    domain="security"
)

# After verifying
tracker.verify(pid, correct=True, notes="Found in line 42")

# Generate a Trust Card
card = tracker.generate_card()
print(card.summary())
print(card.to_json())  # Machine-readable

CLI

# Make a prediction
caliber -a my-agent predict "this endpoint returns JSON" -c 90 -d api

# Verify it
caliber -a my-agent verify <prediction-id> --correct

# Generate Trust Card
caliber -a my-agent card
caliber -a my-agent card --json

# Quick progress check
caliber -a my-agent summary

# Show calibration trajectory over time
caliber -a my-agent trajectory --interval 10

# Check the record for gaming signatures
caliber -a my-agent integrity

# Import existing calibration data
caliber -a my-agent import CALIBRATE.md

Try It Now

Make 3 predictions about your codebase before checking:

caliber predict "src/ has more than 10 Python files" -c 70 -d codebase
caliber predict "package.json has a test script" -c 85 -d codebase
caliber predict "the main module uses asyncio" -c 60 -d architecture

Then verify each one:

caliber verify <id1> --correct   # or --incorrect
caliber verify <id2> --correct
caliber verify <id3> --incorrect

After 3 predictions: caliber summary. After 20: caliber card.

Trust Card Format

{
  "trust_version": "0.1",
  "agent_name": "my-code-reviewer",
  "generated": "2026-03-26T00:00:00Z",
  "calibration": {
    "total_predictions": 77,
    "total_verified": 77,
    "overall_accuracy": 0.766,
    "mean_confidence": 0.708,
    "mean_calibration_gap": -0.058,
    "confidence_buckets": {
      "50-59": {"predictions": 4, "correct": 2, "accuracy": 0.5, "calibration_gap": 0.045, "insufficient_data": true},
      "60-69": {"predictions": 25, "correct": 16, "accuracy": 0.64, "calibration_gap": 0.005, "significant": false},
      "70-79": {"predictions": 29, "correct": 24, "accuracy": 0.828, "calibration_gap": -0.083, "significant": false},
      "80-89": {"predictions": 18, "correct": 16, "accuracy": 0.889, "calibration_gap": -0.044, "significant": false},
      "90-99": {"predictions": 1, "correct": 1, "accuracy": 1.0, "calibration_gap": -0.055, "insufficient_data": true}
    },
    "domains": {
      "architecture": {"predictions": 21, "accuracy": 0.81},
      "behavior": {"predictions": 25, "accuracy": 0.64},
      "codebase": {"predictions": 20, "accuracy": 0.75}
    },
    "strength_zones": ["50-59"]
  }
}

The Trust Card above is real — generated from 77 calibration predictions made by Claude Opus during the MY UNIVERSE project.

What the numbers reveal: This agent is well-calibrated overall. Each bucket includes a significant field (binomial test, p<0.05) and flags insufficient_data for small samples. No bucket shows statistically significant miscalibration — the agent's confidence matches its accuracy. Behavior predictions (64%) are its weakest domain.

Key Concepts

Confidence Buckets

The core insight: overall accuracy is meaningless without calibration. An agent that's "75% accurate" could be perfectly calibrated (right 75% of the time at 75% confidence) or dangerously miscalibrated (right 50% of the time while claiming 90% confidence).

Confidence buckets break accuracy down by confidence level, revealing where the agent knows its limits and where it doesn't.

Calibration Gap

The difference between expected and actual accuracy for each confidence bucket:

Positive gap = overconfident (accuracy < confidence)
Negative gap = underconfident (accuracy > confidence)
Near zero = well-calibrated

Danger Zones

Confidence ranges where the calibration gap exceeds 10 percentage points with at least 3 data points. These are the ranges where the agent's self-assessment is unreliable.

Gaming Detection

Calibration alone can be farmed: predict "this file exists" at 99% a hundred times and the Trust Card looks flawless. caliber integrity detects that signature with deterministic statistics — no claim judging, no LLM:

caliber integrity            # human-readable report
caliber integrity --json     # machine-readable
caliber card --with-integrity  # attach it to the Trust Card

The core is the Murphy decomposition of the Brier score (reliability - resolution + uncertainty). A farmer can fake reliability (calibration), but not resolution — discriminating outcomes requires taking real predictive risk — and not uncertainty: if nearly every prediction came true, the outcome set was a foregone conclusion and the card proves little.

Supporting signals: confidence concentration in the top bucket, domain concentration (Herfindahl index), duplicate claims, predict→verify latency (instant verification suggests the answer was already known), and batch-import share (history without witnessed timing).

There is also a too-good-to-be-true check: a forger who fabricates outcomes to match stated confidence evades every behavioral signal, but real binomial outcomes scatter — observed accuracy that tracks confidence more tightly than chance permits raises SUSPICIOUSLY_PERFECT (the same lower-tail test that exposed Mendel's pea data). The adversarial strategies and their countermeasures are encoded in tests/test_integrity_adversarial.py.

Findings are advisory flags with evidence, gated on minimum sample sizes. There is deliberately no aggregate integrity score — a single number would itself become the gaming target. Signals that cannot distinguish gaming from honest bulk use (e.g. templated claims) are reported as metrics, never flags.

Origin

caliber emerged from MY UNIVERSE, a cognitive workspace where Claude Opus tracks its own predictions and calibration. 87 predictions across 3 sessions validated the approach — and revealed that early "danger zone" findings were small-sample artifacts, corrected by caliber's own statistical significance tests.

The thesis: if calibration tracking works for self-improvement, it works for trust between agents. caliber includes the statistical honesty features because we learned the hard way that small samples lie.

Roadmap

v0.1 (current): Core tracker, CLI, MCP server, Trust Card generation with statistical significance tests, import, trajectory support, gaming-signature detection (caliber integrity)
v0.2: Trust Card verification (detect fabricated cards: distribution consistency checks, commitment audits)
v0.3: A2A Agent Card extension
v1.0: Signed cards, trust registry, cross-agent trust queries

MCP Server

For AI agents that want to track calibration natively:

python -m caliber.mcp_server

Print the MCP config snippet:

caliber mcp-config --cwd /path/to/caliber

Or install it into .mcp.json with a timestamped backup if the file already exists:

caliber mcp-config --install --path ~/.mcp.json --cwd /path/to/caliber

The installed entry has this shape:

{
  "mcpServers": {
    "caliber": {
      "command": "python3",
      "args": ["-m", "caliber.mcp_server"],
      "cwd": "/path/to/caliber"
    }
  }
}

Tools: caliber_predict, caliber_verify, caliber_card, caliber_summary, caliber_list, caliber_trajectory, caliber_integrity.

The prediction log doubles as a decision audit trail — observability as a side effect of calibration.

Statistical Honesty

Trust Cards include per-bucket significance tests (binomial, p<0.05) and flag insufficient data (<5 predictions per bucket). This prevents treating small-sample noise as calibration patterns — a real problem we discovered building this.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 10, 2026

0.1.0

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caliber_trust-0.2.0.tar.gz (43.5 kB view details)

Uploaded Jun 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

caliber_trust-0.2.0-py3-none-any.whl (30.9 kB view details)

Uploaded Jun 10, 2026 Python 3

File details

Details for the file caliber_trust-0.2.0.tar.gz.

File metadata

Download URL: caliber_trust-0.2.0.tar.gz
Upload date: Jun 10, 2026
Size: 43.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for caliber_trust-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`dfc266f804d822b40c8e9ff1fbc33cc8c9197b8963c8874b0ad7897e563efef3`
MD5	`e790d44bc9b8f99d404930b3d3f8c05e`
BLAKE2b-256	`417716b8671fe39e5dfb69fb109bd9aa103aa80cd51300a5ca28a0be09cfbbb7`

See more details on using hashes here.

File details

Details for the file caliber_trust-0.2.0-py3-none-any.whl.

File metadata

Download URL: caliber_trust-0.2.0-py3-none-any.whl
Upload date: Jun 10, 2026
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for caliber_trust-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1cf9da9f9c15483e0a138a93edfc23c74f4de20fb0ac30db8c7e81474379a8e`
MD5	`2c2f82db7431bc124ac4ccb35ea6ec0f`
BLAKE2b-256	`3cd9ca09ee06ada5ea7b15c28b40d503a9c0bdcfa7e42f49dd170b22bdf2d1a9`

See more details on using hashes here.

caliber-trust 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

caliber

The Problem

The Solution

Quick Start

Python API

CLI

Try It Now

Trust Card Format

Key Concepts

Confidence Buckets

Calibration Gap

Danger Zones

Gaming Detection

Origin

Roadmap

MCP Server

Statistical Honesty

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes