Skip to main content

Regression detection and CI quality gates for AI agents.

Project description

Kalibra
Regression detection and CI quality gates for AI agents.


You change a prompt, swap a model, or refactor a tool — did the agent get better or worse?

Kalibra compares two populations of traces and tells you, with statistical rigor, what changed — success rate, cost, latency, tokens, per-task regressions, per-span breakdowns. Two dependencies. One command.

pip install kalibra
kalibra demo
  Kalibra Compare
  ──────────────────────────────────────────────────────────
  Baseline       100 traces   (baseline.jsonl)
  Current        100 traces   (current.jsonl)
  Direction ~ MIXED

  Trace metrics

  ▲ Success rate      50.0% → 75.0%  +25.0 pp   (p=0.000)
  ▲ Cost              $0.0358 → $0.0213 median  -40.5%
  ▼ Duration          7.6s → 15.2s median  +99.1%
  ≈ Steps             4 → 4 steps/trace (median)  +0.0%
  ▼ Error rate        0.2% → 4.3%  +4.1 pp
  ≈ Token usage       7,746 → 7,738 tokens/trace (median)  -0.1%
  ≈ Token efficiency  8,443 → 7,090 tokens/success (median)  -16.0%
  ▲ Cost / quality    $0.0385 → $0.0189 per success (median)  -51.0%

  Trace breakdown

  ~ Per trace         20 matched — ✓ 10 improved, ✗ 5 regressed

  Span breakdown

  ▼ Per span          5 matched — ✗ 1 regressed, ~ 4 mixed

  ──────────────────────────────────────────────────────────
  ~ MIXED — no quality gates configured

Add -v for per-task outcome changes, per-span breakdowns, and confidence intervals.

Why Kalibra

Aggregate metrics hide task-level regressions. Your success rate went from 80% to 82% — great. Except five tasks that used to pass now fail, masked by eight new easy ones that pass. Kalibra catches this.

  • 10 metrics — success rate, cost, duration, steps, error rate, tokens, token efficiency, cost/quality, per-task breakdown, per-span breakdown
  • Statistical rigor — bootstrap 95% CIs on continuous metrics, two-proportion z-test on rates, noise thresholds to ignore jitter
  • Quality gatesrequire: success_rate_delta >= -5 fails your CI pipeline (exit 1) when thresholds are violated
  • Any JSONL — flat traces, nested spans, non-standard field names. Use --suggest to auto-detect field mappings
  • Three output formats — terminal (human), markdown (PR comments), JSON (automation)
  • Two dependencies — click + pyyaml. No ML frameworks, no API keys

Quickstart

1. Install

pip install kalibra

2. Try the demo

kalibra demo

This creates a kalibra-demo/ directory with sample traces and runs a comparison. Afterwards:

kalibra compare kalibra-demo/baseline.jsonl kalibra-demo/current.jsonl -v

3. Compare your own data

If your fields don't match the defaults, let Kalibra figure it out:

kalibra inspect your-traces.jsonl --suggest

This scans your data and prints a copy-pasteable compare command with the right --outcome, --cost, --trace-id flags.

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase
  - span_regressions <= 3        # max 3 span types regressed
kalibra compare        # reads kalibra.yml, exits 1 on failure

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

# kalibra.yml
fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name
kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Comparing files with different schemas? Override fields per source:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }

Python API

from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Commands

kalibra compare [a.jsonl b.jsonl]     Compare traces — flags, config, or positional args
kalibra compare -v                    Verbose — CIs, per-task/per-span detail
kalibra compare --format markdown     Markdown for PR comments
kalibra compare --format json         Machine-readable JSON
kalibra compare --metrics             List all threshold fields
kalibra inspect traces.jsonl          Show data coverage and fields
kalibra inspect traces.jsonl --suggest  Auto-detect field mappings
kalibra init                          Create kalibra.yml interactively
kalibra demo                          Run comparison on built-in sample data

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kalibra-0.1.1.tar.gz (72.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kalibra-0.1.1-py3-none-any.whl (80.2 kB view details)

Uploaded Python 3

File details

Details for the file kalibra-0.1.1.tar.gz.

File metadata

  • Download URL: kalibra-0.1.1.tar.gz
  • Upload date:
  • Size: 72.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6379c4a3b39d39f81982e1e4a136c48a03c7240b4c6100be97f1b0ebaeddcdf0
MD5 b293677c01f85141813976b6d9c6616a
BLAKE2b-256 3520f6ef2099d151e27fbf9f4ede1605bc2dbcd401cd16b28a7b46d685768adc

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.1.1.tar.gz:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kalibra-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kalibra-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 80.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 09a0178b3d75fd98eaf1c6971cf6c1cf37a6f3fbe5ce27bd6a62b4e9531e0d9e
MD5 56a64cbcf35102b77d9be5e1aea10603
BLAKE2b-256 04eb06c68d8cbdfcd8ee2ece90077074ef50b1ac0ab3aa99d56ae97e43d70ae9

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.1.1-py3-none-any.whl:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page