Skip to main content

Regression detection and CI quality gates for AI agents.

Project description

Kalibra
The diff tool for AI agent runs.
The CLI that catches what the dashboard misses.

PyPI Python License Docs

Kalibra catching a hidden regression — success rate flat at 80%, but 2 task types regressed

Success rate: 80% → 80%. Duration: flat. Tokens: flat. Everything looks the same — but 2 task types that always passed started failing, and 2 that always failed started passing. The aggregate hid it. The per-task breakdown caught it.

"Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems." — Hamel Husain, Your AI Product Needs Evals


pip install kalibra
kalibra compare baseline.jsonl current.jsonl -v
kalibra demo    # try it with sample data

Who this is for

  • Teams running agent evals in CI who want a regression gate without adopting a dashboard
  • Anyone who's been burned by averages hiding regressions
  • Developers who prefer a CLI and a config file over another UI to log into

What it doesn't do

  • Not a tracing backend. It reads Phoenix, OTel GenAI, Langfuse, and flat JSONL exports.
  • Not a dashboard. Output is terminal text, markdown, or JSON.
  • Not an LLM judge. No model calls, no API keys, no evaluator prompts.
  • Doesn't replace Phoenix or Langfuse. It compares the traces they produce.

What it does

  • Statistically transparent — two-proportion z-test on rates, percentile bootstrap (n=1000) on continuous metrics. Every number has a named method behind it.
  • Quality gatesregressions <= 2 fails your CI pipeline (exit 1) when thresholds are violated
  • Per-task and per-span breakdown — catches regressions that cancel out in the aggregate
  • Two dependencies — click + pyyaml. No ML frameworks, no API keys, no LLM calls

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase
kalibra compare        # reads kalibra.yml, exits 1 on failure

GitHub Actions

- uses: khan5v/kalibra-action@v1
  with:
    baseline: baselines/production.jsonl
    current: current.jsonl
    config: kalibra.yml

Posts a markdown report as a PR comment. Exits 1 on gate failure.

Full workflow example
name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

Integrations

Kalibra auto-detects trace formats. Each tutorial works without an API key.

Integration Trace format Demo scenario Tutorial
Phoenix / OpenInference llm.*, openinference.* Multi-step agent with span tree aggregation Open in Colab
OTel GenAI gen_ai.* Truncation regression hidden by aggregate improvement Open in Colab
CrewAI Flat JSONL Failure redistribution and cost explosion Open in Colab
Filtering with where

Split a single trace file into populations using Prometheus-style matchers:

sources:
  baseline:
    path: ./traces.jsonl
    where:
      - variant == baseline
  current:
    path: ./traces.jsonl
    where:
      - variant == current

Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name
kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Override fields per source for different schemas:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }
Python API
from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kalibra-0.2.3.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kalibra-0.2.3-py3-none-any.whl (93.0 kB view details)

Uploaded Python 3

File details

Details for the file kalibra-0.2.3.tar.gz.

File metadata

  • Download URL: kalibra-0.2.3.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kalibra-0.2.3.tar.gz
Algorithm Hash digest
SHA256 b58d8e04a368338b7e9089b1a276a0e2af1f9359d57f9e555d570996fc12fa69
MD5 48c1d2539388f0de3cbb9899060c86a3
BLAKE2b-256 9fea8405c52b5c8526a21d0ed2af9112d029423ff719c66119f61560425aceca

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.2.3.tar.gz:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kalibra-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: kalibra-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 93.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kalibra-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5396a4809516006814515018e9d5d87f3f64e7b1fd1785b97155e43296a93db8
MD5 f52c824eb70f0093cafc91ad7624fed6
BLAKE2b-256 343b3205b62906f8d3627d786ebb5d8f04a2e60797de3919b104f54ddc530f94

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.2.3-py3-none-any.whl:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page