Skip to main content

Regression detection and CI quality gates for AI agents.

Project description

Kalibra
The diff tool for AI agent runs.
The CLI that catches what the dashboard misses.

PyPI Python License Docs

Kalibra catching a hidden regression — success rate flat at 80%, but 2 task types regressed

Success rate: 80% → 80%. Duration: flat. Tokens: flat. Everything looks the same — but 2 task types that always passed started failing, and 2 that always failed started passing. The aggregate hid it. The per-task breakdown caught it.

"Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems." — Hamel Husain, Your AI Product Needs Evals


pip install kalibra
kalibra compare baseline.jsonl current.jsonl -v
kalibra demo    # try it with sample data

Who this is for

  • Teams running agent evals in CI who want a regression gate without adopting a dashboard
  • Anyone who's been burned by averages hiding regressions
  • Developers who prefer a CLI and a config file over another UI to log into

What it doesn't do

  • Not a tracing backend. It reads Phoenix, OTel GenAI, Langfuse, and flat JSONL exports.
  • Not a dashboard. Output is terminal text, markdown, or JSON.
  • Not an LLM judge. No model calls, no API keys, no evaluator prompts.
  • Doesn't replace Phoenix or Langfuse. It compares the traces they produce.

What it does

  • Statistically transparent — two-proportion z-test on rates, percentile bootstrap (n=1000) on continuous metrics. Every number has a named method behind it.
  • Significance-gated thresholdssuccess_rate_delta >= -2 fails your CI pipeline (exit 1) only when the change is statistically significant; insignificant deltas skip the gate instead of failing it
  • Per-task and per-span breakdown — catches regressions that cancel out in the aggregate
  • Two dependencies — click + pyyaml. No ML frameworks, no API keys, no LLM calls

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase
kalibra compare        # reads kalibra.yml, exits 1 on failure

GitHub Actions

- uses: khan5v/kalibra-action@v1
  with:
    baseline: baselines/production.jsonl
    current: current.jsonl
    config: kalibra.yml

Posts a markdown report as a PR comment. Exits 1 on gate failure.

Full workflow example
name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

Integrations

Kalibra auto-detects trace formats. Each tutorial works without an API key.

Integration Trace format Demo scenario Tutorial
Phoenix / OpenInference llm.*, openinference.* Multi-step agent with span tree aggregation Open in Colab
OTel GenAI gen_ai.* Truncation regression hidden by aggregate improvement Open in Colab
CrewAI Flat JSONL Failure redistribution and cost explosion Open in Colab
Filtering with where

Split a single trace file into populations using Prometheus-style matchers:

sources:
  baseline:
    path: ./traces.jsonl
    where:
      - variant == baseline
  current:
    path: ./traces.jsonl
    where:
      - variant == current

Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name
kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Override fields per source for different schemas:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }
Python API
from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kalibra-0.2.4.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kalibra-0.2.4-py3-none-any.whl (93.9 kB view details)

Uploaded Python 3

File details

Details for the file kalibra-0.2.4.tar.gz.

File metadata

  • Download URL: kalibra-0.2.4.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kalibra-0.2.4.tar.gz
Algorithm Hash digest
SHA256 1d9d79deb9a27c8f5a45324804e59c52429de9a9436e5aadbd7789ef6f8a25f0
MD5 384862856da81fb05a44730adf6e3dfa
BLAKE2b-256 e9902db971585416ed9b231855b1a6ab5150297402beecfa880fb06562e64e6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.2.4.tar.gz:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kalibra-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: kalibra-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 93.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kalibra-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c2256e7a581f9e17acfb8954b49b9932553fcff1f8dbe6087a4c36ec7d9974ee
MD5 c1eb2b849b58419abed1d8d62c3c3b08
BLAKE2b-256 27692cde4b4c508ffca739607ec59023583f6fccd1f4d02523ea2f18704c6d14

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.2.4-py3-none-any.whl:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page