Skip to main content

Trust protocol for AI agents. Prove capability through calibration, not claims.

Project description

caliber

Trust protocol for AI agents. Prove capability through calibration, not claims.

The Problem

Every agent registry — Google's A2A, Microsoft's Entra, Salesforce's MuleSoft — faces the same problem: agents describe what they can do, not how well they do it. Agent Cards are LinkedIn profiles with no work history.

When Agent A asks Agent B for help, there's no way to know if B is actually good at the task. B says it can review code. Can it? With what accuracy? Is it overconfident? Does it know its own blind spots?

The Solution

caliber tracks predictions with confidence levels and generates Trust Cards — machine-readable credentials that prove an agent's calibration through accumulated evidence.

A Trust Card answers:

  • Overall: How accurate is this agent?
  • By confidence: When it says "80% sure," is it right 80% of the time?
  • By domain: Where is it strong? Where is it weak?
  • Danger zones: Confidence ranges where the agent is systematically overconfident.

Quick Start

pip install agent-trust

Python API

from caliber import TrustTracker

tracker = TrustTracker("my-code-reviewer", store_path="./trust-data")

# Record a prediction before checking
pid = tracker.predict(
    claim="this function has a SQL injection vulnerability",
    confidence=0.85,
    domain="security"
)

# After verifying
tracker.verify(pid, correct=True, notes="Found in line 42")

# Generate a Trust Card
card = tracker.generate_card()
print(card.summary())
print(card.to_json())  # Machine-readable

CLI

# Make a prediction
caliber -a my-agent predict "this endpoint returns JSON" -c 90 -d api

# Verify it
caliber -a my-agent verify <prediction-id> --correct

# Generate Trust Card
caliber -a my-agent card
caliber -a my-agent card --json

# Quick progress check
caliber -a my-agent summary

Try It Now

Make 3 predictions about your codebase before checking:

caliber predict "src/ has more than 10 Python files" -c 70 -d codebase
caliber predict "package.json has a test script" -c 85 -d codebase
caliber predict "the main module uses asyncio" -c 60 -d architecture

Then verify each one:

caliber verify <id1> --correct   # or --incorrect
caliber verify <id2> --correct
caliber verify <id3> --incorrect

After 3 predictions: caliber summary. After 20: caliber card.

Trust Card Format

{
  "trust_version": "0.1",
  "agent_name": "my-code-reviewer",
  "generated": "2026-03-26T00:00:00Z",
  "calibration": {
    "total_predictions": 77,
    "total_verified": 77,
    "overall_accuracy": 0.766,
    "mean_confidence": 0.708,
    "mean_calibration_gap": -0.058,
    "confidence_buckets": {
      "50-59": {"predictions": 4, "correct": 2, "accuracy": 0.5, "calibration_gap": 0.045, "insufficient_data": true},
      "60-69": {"predictions": 25, "correct": 16, "accuracy": 0.64, "calibration_gap": 0.005, "significant": false},
      "70-79": {"predictions": 29, "correct": 24, "accuracy": 0.828, "calibration_gap": -0.083, "significant": false},
      "80-89": {"predictions": 18, "correct": 16, "accuracy": 0.889, "calibration_gap": -0.044, "significant": false},
      "90-99": {"predictions": 1, "correct": 1, "accuracy": 1.0, "calibration_gap": -0.055, "insufficient_data": true}
    },
    "domains": {
      "architecture": {"predictions": 21, "accuracy": 0.81},
      "behavior": {"predictions": 25, "accuracy": 0.64},
      "codebase": {"predictions": 20, "accuracy": 0.75}
    },
    "strength_zones": ["50-59"]
  }
}

The Trust Card above is real — generated from 77 calibration predictions made by Claude Opus during the MY UNIVERSE project.

What the numbers reveal: This agent is well-calibrated overall. Each bucket includes a significant field (binomial test, p<0.05) and flags insufficient_data for small samples. No bucket shows statistically significant miscalibration — the agent's confidence matches its accuracy. Behavior predictions (64%) are its weakest domain.

Key Concepts

Confidence Buckets

The core insight: overall accuracy is meaningless without calibration. An agent that's "75% accurate" could be perfectly calibrated (right 75% of the time at 75% confidence) or dangerously miscalibrated (right 50% of the time while claiming 90% confidence).

Confidence buckets break accuracy down by confidence level, revealing where the agent knows its limits and where it doesn't.

Calibration Gap

The difference between expected and actual accuracy for each confidence bucket:

  • Positive gap = overconfident (accuracy < confidence)
  • Negative gap = underconfident (accuracy > confidence)
  • Near zero = well-calibrated

Danger Zones

Confidence ranges where the calibration gap exceeds 10 percentage points with at least 3 data points. These are the ranges where the agent's self-assessment is unreliable.

Origin

caliber emerged from MY UNIVERSE, a cognitive workspace where Claude Opus tracks its own predictions and calibration. 87 predictions across 3 sessions validated the approach — and revealed that early "danger zone" findings were small-sample artifacts, corrected by caliber's own statistical significance tests.

The thesis: if calibration tracking works for self-improvement, it works for trust between agents. caliber includes the statistical honesty features because we learned the hard way that small samples lie.

Roadmap

  • v0.1 (current): Core tracker, CLI, MCP server, Trust Card generation with statistical significance tests
  • v0.2: Trust Card verification (detect fabricated/gamed cards), trajectory support
  • v0.3: A2A Agent Card extension, commitment scheme (prediction anchoring)
  • v1.0: Signed cards, trust registry, cross-agent trust queries

MCP Server

For AI agents that want to track calibration natively:

python -m caliber.mcp_server

Or add to .mcp.json:

{
  "mcpServers": {
    "caliber": {
      "command": "python3",
      "args": ["-m", "caliber.mcp_server"],
      "cwd": "/path/to/caliber"
    }
  }
}

Tools: caliber_predict, caliber_verify, caliber_card, caliber_summary, caliber_list.

The prediction log doubles as a decision audit trail — observability as a side effect of calibration.

Statistical Honesty

Trust Cards include per-bucket significance tests (binomial, p<0.05) and flag insufficient data (<5 predictions per bucket). This prevents treating small-sample noise as calibration patterns — a real problem we discovered building this.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

caliber_trust-0.1.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

caliber_trust-0.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file caliber_trust-0.1.0.tar.gz.

File metadata

  • Download URL: caliber_trust-0.1.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for caliber_trust-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f1b9debccfd1cc670db0a8a8b7b366c28babdc25ed31fbb5bc0912985c85a828
MD5 f8cd42bf426965104e0f8834e5825d4b
BLAKE2b-256 2b4bc8aeab7bf9f86e130dda48104d200589e4097dd170800bbb1a048f7f3572

See more details on using hashes here.

File details

Details for the file caliber_trust-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: caliber_trust-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for caliber_trust-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9fcd7989e2098b3aa8a739c50f702c9686e0a20454b4c8f6bc689ba3de6d07ff
MD5 3fbcca5fa723609df604c4c8cc12e035
BLAKE2b-256 509f850242787290867e9fc7b9b4b1fffde3c8b267c62f43458be9bbc254760a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page