Skip to main content

Black-box AI reliability certification via self-consistency sampling and conformal calibration

Project description

Know if your AI is ready to ship — one number, one guarantee.

License PyPI arXiv Docs Follow @CohorteAI

TrustGate certifies the reliability of any AI endpoint — LLMs, agents, RAG pipelines, or any system you can ask a question to. It uses self-consistency sampling and conformal prediction to produce a single reliability level (e.g., 94.6%) backed by a formal statistical guarantee. Not a vibe, not a leaderboard score — a mathematical proof.

What's included:

  • Self-consistency sampling — ask the same question K times, measure agreement
  • Conformal calibration — formal coverage guarantee, distribution-free
  • LLM semantic canonicalization — groups equivalent answers via a lightweight LLM
  • Human calibration — shareable HTML questionnaire for domain experts (no server needed)
  • Automated LLM judge — calibration without human reviewers (--auto-judge)
  • Runtime trust layer — wrap any endpoint with reliability metadata
  • Sequential stopping — Hoeffding bounds reduce API costs by ~50%
  • Dynamic time estimation — measures your API latency before running
  • Profile diagnostics — automatic detection of canonicalization failures

[!NOTE] Part of the theaios ecosystem. New here? Start with the End-to-End Test Guide — it walks you through every feature from a fresh install.

How to Use TrustGate

pip install theaios-trustgate

Step 1: Connect your AI system

Set your API key (works with any OpenAI-compatible provider):

# macOS / Linux
export LLM_API_KEY="sk-your-key-here"

# Windows (PowerShell)
$env:LLM_API_KEY="sk-your-key-here"

You need two things in trustgate.yaml: the endpoint to test (your AI system) and a judge LLM (a cheap model for canonicalization and calibration matching).

# trustgate.yaml

# The AI system you're certifying (any OpenAI-compatible endpoint)
endpoint:
  url: "https://api.openai.com/v1/chat/completions"
  model: "gpt-4.1-mini"
  api_key_env: "LLM_API_KEY"               # reads from environment variable
  # Or use custom auth headers for LiteLLM, Azure, etc.:
  # headers:
  #   API-Key: "your-key-here"

# The judge LLM — used for canonicalization (grouping answers)
# and calibration (matching ground truth to canonical answers).
# Use a cheap, fast model. Can be the same or different provider.
canonicalization:
  type: "llm"
  judge_endpoint:
    url: "https://api.openai.com/v1/chat/completions"
    model: "gpt-4.1-nano"
    api_key_env: "LLM_API_KEY"
    # Or custom auth (same headers option as endpoint):
    # headers:
    #   API-Key: "your-key-here"

Works with any OpenAI-compatible API: OpenAI, Together, Ollama, LiteLLM, Azure OpenAI, vLLM, Mistral, etc. For providers with non-standard auth, use the headers field instead of api_key_env — works on both endpoint and judge_endpoint.

For custom endpoints (agents, RAG, internal APIs):

endpoint:
  url: "https://my-agent.example.com/api/ask"
  temperature: null
  request_template:
    query: "{{question}}"
  response_path: "answer"
  cost_per_request: 0.03      # measure this first from your billing

canonicalization:
  type: "llm"
  judge_endpoint:
    url: "https://api.openai.com/v1/chat/completions"
    model: "gpt-4.1-nano"
    api_key_env: "LLM_API_KEY"

[!IMPORTANT] The judge_endpoint is always required. It powers both canonicalization (grouping semantically equivalent answers) and calibration (matching ground truth to canonical answers). Use a cheap model like gpt-4.1-nano — it only needs to extract short strings and compare meanings, not reason.

For custom endpoints, also set cost_per_request — TrustGate cannot estimate cost without it.

Step 2: Prepare your questions

You need questions that represent what your system faces in production. Three ways:

  • Generate with AI — ask an LLM to produce realistic questions for your use case
  • Extract from production logs — pull real queries from Langfuse, LangSmith, Datadog, or your own logs
  • Use built-in benchmarksload_mmlu(), load_gsm8k() for standard tasks

If you have correct answers, add them as acceptable_answers in the CSV. If not, you'll use human calibration in step 4.

Full guide: Getting Your Questions

Step 3: Certify

trustgate certify

TrustGate measures your API latency, shows a cost/time estimate, and asks for confirmation:

     Pre-flight Estimate
┌──────────────────────────┬───────────────────────────────┐
│ Questions                │ 120                           │
│ Samples per question (K) │ 10                            │
│ Requests                 │ 600                           │
│ Sequential stopping      │ enabled (~50% fewer requests) │
│ Est. cost                │ $0.53                         │
│ Measured latency         │ 0.8s per call                 │
│ Est. time                │ ~1.2 min                      │
└──────────────────────────┴───────────────────────────────┘
              Cost / Reliability Tradeoff
┌────┬──────────┬───────────┬───────────┬────────────┐
│  K │ Requests │ Est. Cost │ Est. Time │ Resolution │
│  3 │      180 │ $0.16     │ ~20s      │   coarse   │
│ 10←│      600 │ $0.53     │ ~1.2 min  │    fine    │
│ 20 │    1,200 │ $1.06     │ ~2.3 min  │    fine    │
└────┴──────────┴───────────┴───────────┴────────────┘
Proceed? Enter Y, N, or a number to change K [Y]:

Then the result:

     TrustGate Certification Result
┌──────────────────────────┬───────┐
│ Reliability Level        │ 98.0% │
│ M* (at 95% confidence)   │ 1     │
│ Empirical Coverage       │ 1.000 │
│ Capability Gap           │ 0.0%  │
│ Status                   │ PASS  │
└──────────────────────────┴───────┘

Reliability Level: your AI's top answer is correct for 98.0% of
questions — the highest confidence with a formal guarantee.

M* = 1: at 95% confidence, the top answer alone is sufficient.

Use --alpha to change the confidence level for M*:

trustgate certify --alpha 0.01   # 99% confidence (stricter, M* may be larger)
trustgate certify --alpha 0.05   # 95% confidence (default)
trustgate certify --alpha 0.10   # 90% confidence (looser, M* may be smaller)

Step 4: Calibrate (if no ground truth)

If you don't have correct answers, a domain expert provides them. Three options:

Share a questionnaire (recommended — works across organizations):

trustgate calibrate --export questionnaire.html
# Share via email/Slack → reviewer opens in browser → downloads labels.json
trustgate certify --ground-truth labels.json

Local web UI (reviewer on your network):

trustgate calibrate --serve --port 8080

Automated LLM judge (fast but less rigorous):

trustgate certify --auto-judge

[!TIP] For the full theory, see our paper: Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration (Mouzouni, 2026).


Using TrustGate in Production

Deployment gate — certify before shipping

# CI/CD: fail the build if reliability < 90%
trustgate certify --min-reliability 90 --yes
# Exit code 0 = PASS (≥90%), exit code 1 = FAIL (<90%)
# GitHub Actions
- name: AI Reliability Gate
  run: trustgate certify --min-reliability 90 --yes
  env:
    LLM_API_KEY: ${{ secrets.LLM_API_KEY }}

Runtime trust layer — confidence on every query

from theaios.trustgate import TrustGate, certify

result = certify(config_path="trustgate.yaml")

# Passthrough (1 call/query): reliability metadata attached
gate = TrustGate(config=config, certification=result)
response = gate.query("What is the treatment for X?")
response.reliability_level  # 0.946

# Sampled (K calls/query): per-query prediction set
gate = TrustGate(config=config, certification=result, mode="sampled")
response = gate.query("What is the treatment for X?")
response.prediction_set  # ["Aspirin + PCI"]
response.consensus       # 0.8

Periodic recalibration

Re-run certification on a schedule (cron, CI) to detect drift:

trustgate certify --yes --output json --output-file latest.json

Where Do the Questions Come From?

You don't need a gold-standard dataset to use TrustGate. Three ways to get started:

1. Generate questions with AI. Ask an LLM to produce realistic questions for your system:

"Generate 100 realistic customer support questions that users would ask
our e-commerce chatbot, covering orders, returns, shipping, and products."

2. Extract from production logs. Pull real queries from your observability stack (Datadog, Langfuse, LangSmith, custom logs). These are the actual questions your system faces — the most representative test set possible.

3. Use built-in benchmarks. For standard tasks, TrustGate ships dataset loaders:

from theaios.trustgate.datasets import load_gsm8k, load_mmlu
questions = load_mmlu(subjects=["abstract_algebra"], n=100)

No ground truth labels? No problem — use human calibration. A domain expert reviews 50 items in 10 minutes.

Full guide: Getting Your Questions

Works With Any Endpoint

LLMs, agents, RAG pipelines — anything with an HTTP API.

For known LLMs — TrustGate auto-estimates cost from its pricing table:

endpoint:
  url: "https://api.openai.com/v1/chat/completions"
  model: "gpt-4.1"
  api_key_env: "LLM_API_KEY"

For custom endpoints (agents, RAG, internal APIs) — you control the endpoint, so you need to tell TrustGate the cost per request. Measure it first (check your billing dashboard or estimate from your infrastructure costs), then set cost_per_request:

endpoint:
  url: "https://my-agent.example.com/api/ask"
  temperature: null                          # endpoint controls randomness
  request_template:
    query: "{{question}}"
  response_path: "answer"
  cost_per_request: 0.03                     # USD per request — YOU must provide this

Or pass it via CLI:

trustgate certify --cost-per-request 0.03

[!IMPORTANT] For custom endpoints, always set cost_per_request before running certification. Without it, TrustGate cannot show you a cost estimate in the pre-flight check, and you risk unexpected charges. TrustGate will make questions × K API calls (reduced ~50% by sequential stopping).

Certify Each Component, Not Just the Final Output

Complex AI systems are pipelines — retriever, reranker, reasoning, generation. Don't certify the whole pipeline as a black box. Certify each component independently to find exactly where reliability breaks down.

Query → [Retriever] → [Reranker] → [Generator] → Answer
            ↑              ↑             ↑
      certify: 94%    certify: 91%   certify: 87%
      "right docs?"  "right order?" "right answer?"

Each component is just an endpoint — TrustGate certifies it independently with its own questions, canonicalization, and reliability level. This lets you:

  • Pinpoint failures: the generator is the weak link, not the retriever
  • Iterate efficiently: improve one component, re-certify just that one
  • Stay agnostic: document changes don't invalidate the retriever certification
Component Certify on Canonicalization
RAG retriever Retrieved document IDs Exact match
SQL agent Generated SQL query Normalized SQL
Classification step Category label MCQ
Reasoning step Intermediate conclusion llm or custom
Final answer Short structured output Numeric / MCQ

TrustGate warns you automatically when outputs are too long or diverse for meaningful self-consistency measurement.

Tuning Performance

Use --concurrency to control how many API requests run in parallel:

# Safe for rate-limited APIs (default)
trustgate certify --concurrency 10

# Faster for high-throughput APIs
trustgate certify --concurrency 30

# Very conservative for strict rate limits
trustgate certify --concurrency 3

Documentation

Additional Resources

  • End-to-End Test Guide — Try every feature from a fresh install (26 steps, works on macOS/Linux/Windows)
  • RAG Agent Example — Certify a real agent with retriever + calculator tools
  • Examples — Working certification scripts
  • FAQ — Common questions
  • Paper — The research behind TrustGate

Why TrustGate?

  • Formal guarantee — conformal coverage bound, not a heuristic score
  • Black-box — no model internals, no logprobs, just API access
  • Any endpoint — LLMs, agents, RAG, custom APIs
  • Human-in-the-loop — shareable questionnaire, no server needed
  • Cost-aware — pre-flight estimates with measured latency, sequential stopping saves ~50%
  • Tunable--concurrency, --min-reliability, --auto-judge for any workflow
  • Production-ready — CI/CD gating, runtime trust layer, periodic recalibration

Citation

@article{mouzouni2026trustgate,
  title={TrustGate: Black-Box AI Reliability Certification via
         Self-Consistency Sampling and Conformal Calibration},
  author={Mouzouni, Charafeddine},
  year={2026}
}

License

Apache 2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

theaios_trustgate-0.3.5.tar.gz (149.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

theaios_trustgate-0.3.5-py3-none-any.whl (73.9 kB view details)

Uploaded Python 3

File details

Details for the file theaios_trustgate-0.3.5.tar.gz.

File metadata

  • Download URL: theaios_trustgate-0.3.5.tar.gz
  • Upload date:
  • Size: 149.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for theaios_trustgate-0.3.5.tar.gz
Algorithm Hash digest
SHA256 a86cbbb53490afce4b416fe9a08699999cfa1e2b1f391cd4f4e2850872f6ed76
MD5 d7993f3586632e976959a1d531b5c473
BLAKE2b-256 fd51f683fb52fdeb8d8c32cc1ea727dd58391ffecd3a7405bdcc78d6adf7c7e2

See more details on using hashes here.

File details

Details for the file theaios_trustgate-0.3.5-py3-none-any.whl.

File metadata

File hashes

Hashes for theaios_trustgate-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bec4df560a3e76059a66256eb8afdc67fb52ebd9ecef2df26278f62c1c6e24cb
MD5 8f64174b66bdd9bb42103a06c8652d89
BLAKE2b-256 ea2b50ddc150d54fe5c4adbcf76e19d5d3768cad642328deebddf236a3e35cdf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page