Skip to main content

Regression detection and CI quality gates for AI agents.

Project description

Kalibra
Regression detection and CI quality gates for AI agents.

PyPI Python License Docs

Kalibra catching a hidden regression — success rate flat at 80%, but 2 task types regressed

Success rate: 80% → 80%. Duration: flat. Tokens: flat. Everything looks the same — but 2 task types that always passed started failing, and 2 that always failed started passing. The aggregate hid it. The per-task breakdown caught it.


pip install kalibra
kalibra compare baseline.jsonl current.jsonl -v
kalibra demo    # try it with sample data
  • Statistical rigor — bootstrap 95% CIs on continuous metrics, two-proportion z-test on rates
  • Quality gatesregressions <= 2 fails your CI pipeline (exit 1) when thresholds are violated
  • Per-task and per-span breakdown — catches regressions that cancel out in the aggregate
  • Two dependencies — click + pyyaml. No ML frameworks, no API keys, no LLM calls

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase
kalibra compare        # reads kalibra.yml, exits 1 on failure

GitHub Actions

- uses: khan5v/kalibra-action@v1
  with:
    baseline: baselines/production.jsonl
    current: current.jsonl
    config: kalibra.yml

Posts a markdown report as a PR comment. Exits 1 on gate failure.

Full workflow example
name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

Integrations

Kalibra auto-detects trace formats. Each tutorial works without an API key.

Integration Trace format Demo scenario Tutorial
Phoenix / OpenInference llm.*, openinference.* Multi-step agent with span tree aggregation Open in Colab
OTel GenAI gen_ai.* Truncation regression hidden by aggregate improvement Open in Colab
CrewAI Flat JSONL Failure redistribution and cost explosion Open in Colab
Filtering with where

Split a single trace file into populations using Prometheus-style matchers:

sources:
  baseline:
    path: ./traces.jsonl
    where:
      - variant == baseline
  current:
    path: ./traces.jsonl
    where:
      - variant == current

Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name
kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Override fields per source for different schemas:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }
Python API
from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kalibra-0.2.2.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kalibra-0.2.2-py3-none-any.whl (92.1 kB view details)

Uploaded Python 3

File details

Details for the file kalibra-0.2.2.tar.gz.

File metadata

  • Download URL: kalibra-0.2.2.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.2.2.tar.gz
Algorithm Hash digest
SHA256 82d9cf23d6478565178de7f695613346d36052dd72745325b726eaa616c7bb84
MD5 309f858b85e8b85eea30148fab271211
BLAKE2b-256 e2fff4520e118e7a01ba295ac30732c83f4476b1193f2f5d3ea156e45401b36f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.2.2.tar.gz:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kalibra-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: kalibra-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 92.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kalibra-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a9a7499f8ab8b2cf629ea8f9052ba7c510b1034222c851d437323e1a56779b19
MD5 28223bf16a077cf0c2eb5ce31aa3598a
BLAKE2b-256 7eb99434fb6a2e8837da4adf84b1be8303582775c58dfe8159c9cd63fd02c255

See more details on using hashes here.

Provenance

The following attestation bundles were made for kalibra-0.2.2-py3-none-any.whl:

Publisher: release.yml on khan5v/kalibra

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page