kalibra

Regression detection and CI quality gates for AI agents.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

khan5v

These details have not been verified by PyPI

Project description

Kalibra
Regression detection and CI quality gates for AI agents.

You change a prompt, swap a model, or refactor a tool — did the agent get better or worse?

Kalibra compares two populations of traces and tells you, with statistical rigor, what changed — success rate, cost, latency, tokens, per-task regressions, per-span breakdowns. Two dependencies. One command.

pip install kalibra
kalibra demo

  Kalibra Compare
  ──────────────────────────────────────────────────────────
  Baseline       100 traces   (baseline.jsonl)
  Current        100 traces   (current.jsonl)
  Direction ~ MIXED

  Trace metrics

  ▲ Success rate      50.0% → 75.0%  +25.0 pp   (p=0.000)
  ▲ Cost              $0.0358 → $0.0213 median  -40.5%
  ▼ Duration          7.6s → 15.2s median  +99.1%
  ≈ Steps             4 → 4 steps/trace (median)  +0.0%
  ▼ Error rate        0.2% → 4.3%  +4.1 pp
  ≈ Token usage       7,746 → 7,738 tokens/trace (median)  -0.1%
  ≈ Token efficiency  8,443 → 7,090 tokens/success (median)  -16.0%
  ▲ Cost / quality    $0.0385 → $0.0189 per success (median)  -51.0%

  Trace breakdown

  ~ Per trace         20 matched — ✓ 10 improved, ✗ 5 regressed

  Span breakdown

  ▼ Per span          5 matched — ✗ 1 regressed, ~ 4 mixed

  ──────────────────────────────────────────────────────────
  ~ MIXED — no quality gates configured

Add -v for per-task outcome changes, per-span breakdowns, and confidence intervals.

Why Kalibra

Aggregate metrics hide task-level regressions. Your success rate went from 80% to 82% — great. Except five tasks that used to pass now fail, masked by eight new easy ones that pass. Kalibra catches this.

10 metrics — success rate, cost, duration, steps, error rate, tokens, token efficiency, cost/quality, per-task breakdown, per-span breakdown
Statistical rigor — bootstrap 95% CIs on continuous metrics, two-proportion z-test on rates, noise thresholds to ignore jitter
Quality gates — require: success_rate_delta >= -5 fails your CI pipeline (exit 1) when thresholds are violated
Any JSONL — flat traces, nested spans, non-standard field names. Use --suggest to auto-detect field mappings
Three output formats — terminal (human), markdown (PR comments), JSON (automation)
Two dependencies — click + pyyaml. No ML frameworks, no API keys

Quickstart

1. Install

pip install kalibra

2. Try the demo

kalibra demo

This creates a kalibra-demo/ directory with sample traces and runs a comparison. Afterwards:

kalibra compare kalibra-demo/baseline.jsonl kalibra-demo/current.jsonl -v

3. Compare your own data

If your fields don't match the defaults, let Kalibra figure it out:

kalibra inspect your-traces.jsonl --suggest

This scans your data and prints a copy-pasteable compare command with the right --outcome, --cost, --trace-id flags.

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase
  - span_regressions <= 3        # max 3 span types regressed

kalibra compare        # reads kalibra.yml, exits 1 on failure

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

# kalibra.yml
fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name

kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Comparing files with different schemas? Override fields per source:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }

Python API

from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Commands

kalibra compare [a.jsonl b.jsonl]     Compare traces — flags, config, or positional args
kalibra compare -v                    Verbose — CIs, per-task/per-span detail
kalibra compare --format markdown     Markdown for PR comments
kalibra compare --format json         Machine-readable JSON
kalibra compare --metrics             List all threshold fields
kalibra inspect traces.jsonl          Show data coverage and fields
kalibra inspect traces.jsonl --suggest  Auto-detect field mappings
kalibra init                          Create kalibra.yml interactively
kalibra demo                          Run comparison on built-in sample data

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

khan5v

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.2

Mar 24, 2026

0.2.1

Mar 21, 2026

0.2.0

Mar 21, 2026

0.1.2

Mar 19, 2026

This version

0.1.1

Mar 19, 2026

0.0.1

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kalibra-0.1.1.tar.gz (72.5 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kalibra-0.1.1-py3-none-any.whl (80.2 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file kalibra-0.1.1.tar.gz.

File metadata

Download URL: kalibra-0.1.1.tar.gz
Upload date: Mar 19, 2026
Size: 72.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6379c4a3b39d39f81982e1e4a136c48a03c7240b4c6100be97f1b0ebaeddcdf0`
MD5	`b293677c01f85141813976b6d9c6616a`
BLAKE2b-256	`3520f6ef2099d151e27fbf9f4ede1605bc2dbcd401cd16b28a7b46d685768adc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.1.1.tar.gz:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kalibra-0.1.1.tar.gz
- Subject digest: 6379c4a3b39d39f81982e1e4a136c48a03c7240b4c6100be97f1b0ebaeddcdf0
- Sigstore transparency entry: 1132914233
- Sigstore integration time: Mar 19, 2026
Source repository:
- Permalink: khan5v/kalibra@143c967c7d559b22aefad4b108acd47a5a66430f
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/khan5v
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@143c967c7d559b22aefad4b108acd47a5a66430f
- Trigger Event: push

File details

Details for the file kalibra-0.1.1-py3-none-any.whl.

File metadata

Download URL: kalibra-0.1.1-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 80.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`09a0178b3d75fd98eaf1c6971cf6c1cf37a6f3fbe5ce27bd6a62b4e9531e0d9e`
MD5	`56a64cbcf35102b77d9be5e1aea10603`
BLAKE2b-256	`04eb06c68d8cbdfcd8ee2ece90077074ef50b1ac0ab3aa99d56ae97e43d70ae9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.1.1-py3-none-any.whl:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kalibra-0.1.1-py3-none-any.whl
- Subject digest: 09a0178b3d75fd98eaf1c6971cf6c1cf37a6f3fbe5ce27bd6a62b4e9531e0d9e
- Sigstore transparency entry: 1132914256
- Sigstore integration time: Mar 19, 2026
Source repository:
- Permalink: khan5v/kalibra@143c967c7d559b22aefad4b108acd47a5a66430f
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/khan5v
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@143c967c7d559b22aefad4b108acd47a5a66430f
- Trigger Event: push

kalibra 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Why Kalibra

Quickstart

Quality gates for CI

Field mapping

Python API

Commands

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance