Skip to main content

Regression detection and CI quality gates for AI agents.

Project description

Kalibra
Regression detection and CI quality gates for AI agents.

PyPI Python License


You change a prompt, swap a model, or refactor a tool — did the agent get better or worse?

Kalibra compares two populations of traces and tells you, with statistical rigor, what changed — success rate, cost, latency, tokens, per-task regressions, per-span breakdowns. Two dependencies. One command.

pip install kalibra
kalibra demo
  Kalibra Compare
  ──────────────────────────────────────────────────────────
  Baseline       100 traces   (baseline.jsonl)
  Current        100 traces   (current.jsonl)
  Direction ~ MIXED

  Trace metrics

  ▲ Success rate      50.0% → 75.0%  +25.0 pp   (p=0.000)
  ▲ Cost              $0.0358 → $0.0213 median  -40.5%
  ▼ Duration          7.6s → 15.2s median  +99.1%
  ≈ Steps             4 → 4 steps/trace (median)  +0.0%
  ▼ Error rate        0.2% → 4.3%  +4.1 pp
  ≈ Token usage       7,746 → 7,738 tokens/trace (median)  -0.1%
  ≈ Token efficiency  8,443 → 7,090 tokens/success (median)  -16.0%
  ▲ Cost / quality    $0.0385 → $0.0189 per success (median)  -51.0%

  Trace breakdown

  ~ Per trace         20 matched — ✓ 10 improved, ✗ 5 regressed

  Span breakdown

  ▼ Per span          5 matched — ✗ 1 regressed, ~ 4 mixed

  ──────────────────────────────────────────────────────────
  ~ MIXED — no quality gates configured

Add -v for per-task outcome changes, per-span breakdowns, and confidence intervals.

Why Kalibra

Aggregate metrics hide task-level regressions. Your success rate went from 80% to 82% — great. Except five tasks that used to pass now fail, masked by eight new easy ones that pass. Kalibra catches this.

  • 10 metrics — success rate, cost, duration, steps, error rate, tokens, token efficiency, cost/quality, per-task breakdown, per-span breakdown
  • Statistical rigor — bootstrap 95% CIs on continuous metrics, two-proportion z-test on rates, noise thresholds to ignore jitter
  • Quality gatesrequire: success_rate_delta >= -5 fails your CI pipeline (exit 1) when thresholds are violated
  • Any JSONL — flat traces, nested spans, non-standard field names. Use --suggest to auto-detect field mappings
  • Three output formats — terminal (human), markdown (PR comments), JSON (automation)
  • Two dependencies — click + pyyaml. No ML frameworks, no API keys

Quickstart

1. Install

pip install kalibra

2. Try the demo

kalibra demo

This creates a kalibra-demo/ directory with sample traces and runs a comparison. Afterwards:

kalibra compare kalibra-demo/baseline.jsonl kalibra-demo/current.jsonl -v

3. Compare your own data

If your fields don't match the defaults, let Kalibra figure it out:

kalibra inspect your-traces.jsonl --suggest

This scans your data and prints a copy-pasteable compare command with the right --outcome, --cost, --trace-id flags.

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase
  - span_regressions <= 3        # max 3 span types regressed
kalibra compare        # reads kalibra.yml, exits 1 on failure

GitHub Actions

- uses: khan5v/kalibra-action@v1
  with:
    baseline: baselines/production.jsonl
    current: current.jsonl
    config: kalibra.yml

Posts a markdown report as a PR comment. Exits 1 on gate failure.

Full workflow example
name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

# kalibra.yml
fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name
kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Comparing files with different schemas? Override fields per source:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }

Python API

from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Commands

kalibra compare [a.jsonl b.jsonl]     Compare traces — flags, config, or positional args
kalibra compare -v                    Verbose — CIs, per-task/per-span detail
kalibra compare --format markdown     Markdown for PR comments
kalibra compare --format json         Machine-readable JSON
kalibra compare --metrics             List all threshold fields
kalibra inspect traces.jsonl          Show data coverage and fields
kalibra inspect traces.jsonl --suggest  Auto-detect field mappings
kalibra init                          Create kalibra.yml interactively
kalibra demo                          Run comparison on built-in sample data

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kalibra-0.1.2.tar.gz (76.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kalibra-0.1.2-py3-none-any.whl (80.8 kB view details)

Uploaded Python 3

File details

Details for the file kalibra-0.1.2.tar.gz.

File metadata

  • Download URL: kalibra-0.1.2.tar.gz
  • Upload date:
  • Size: 76.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.1.2.tar.gz
Algorithm Hash digest
SHA256 57ce360020997ac36c1bdc5b0d5e4fedb5b5c0f2e4972bfaca7cccf58b157353
MD5 8aa811965404e1aa7f4cbd4bd5d70a29
BLAKE2b-256 4b7f215e8e6d8b4aa3bd17eff5d6e295e6cb853c805435366ab334395f6251a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.1.2.tar.gz:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kalibra-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: kalibra-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 80.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ae4f3839bab32e803acd80cf23e59ceb323f4f7089643fb2d237360dd9742c75
MD5 73b8aaff9efe8022f2af95e42c27cfd5
BLAKE2b-256 e01ebfc5f7fc14e6eb2bd9f4bb95a5f58173e80b4969cccb552aec75c548fafb

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.1.2-py3-none-any.whl:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page