kensa

The open source agent evals harness

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

satyaborg

These details have not been verified by PyPI

Project links

Homepage

Project description

kensa - the open source agent evals harness

Tell your coding agent to evaluate an agent. Get a working eval suite in minutes.

Agent evals have a cold-start problem. Manual prompting is noisy and non-deterministic, but building a full eval harness from scratch is too much overhead for fast shipping teams.

kensa is an opinionated CLI with bundled skills for evaluating agent codebases with tools your team already uses, like Claude Code and Codex.

That gets you from zero to an eval loop without building a custom harness first.

Tell your coding agent to evaluate the repo.
It reads the codebase and traces, identifies failure modes, and writes scenarios and judges.
Runs the evals with kensa.
You review and approve fixes, then it runs them again.

Works with all major coding agents that support skills and bash commands.

Installation

Skills + CLI (recommended)

npx skills add satyaborg/kensa   # install eval skills
uv add kensa                     # or: pip install kensa

This is the recommended setup for Codex, Cursor, OpenCode, Gemini CLI, and other agents. Installs five skills (audit-evals, generate-scenarios, generate-judges, validate-judge, diagnose-errors) plus the kensa CLI runtime.

Claude Code plugin

If you primarily use Claude Code, you can install kensa as a plugin instead:

/plugin marketplace add satyaborg/kensa
/plugin install kensa

Provider extras

Install the extra that matches your stack:

uv add "kensa[anthropic]"
uv add "kensa[openai]"
uv add "kensa[langchain]"
uv add "kensa[all]"

We're continuing to add more model providers and frameworks.

Quickstart

1. Ask your coding agent to evaluate the repo. For e.g.

> evaluate this agent

That is the primary workflow. The bundled skills inspect the codebase, scaffold .kensa/, and generate scenarios and judges.

2. Add instrumentation if needed

Coding agents will automatically add missing instrumentation but you can also manually setup instrument() before importing your LLM SDK like so:

from kensa import instrument

instrument()

Manual setup mainly applies if you use kensa without the skills flow.

3. Run the evals

kensa eval

That runs the scenarios, applies checks, calls the judge if needed, and writes results you can inspect with kensa report.

Deterministic checks run before the LLM judge, so failures short-circuit without spending tokens.

4. Fix and iterate

Use the report output to tighten prompts, tools, and guards, then ask the coding agent to update the evals or diagnose failures.

Manual CLI Workflow

If you want to author evals and set up instrumentation by hand:

kensa init --blank
kensa doctor

Then create, for example: .kensa/scenarios/classify_ticket.yaml:

id: classify_ticket
name: Support ticket triage
description: Classify a support ticket by severity.
source: user

input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: python agent.py {{input}}

expected_outcome: Agent returns the correct priority label.

checks:
  - type: output_matches
    params: { pattern: "^P[123]$" }
    description: Output must be exactly P1, P2, or P3.
  - type: max_cost
    params: { max_usd: 0.05 }
    description: Stay under five cents.

criteria: |
  P1 is for outages or data loss affecting multiple users.
  The agent must classify based on business impact, not tone.

What You Type -> What Happens

$ npx skills add satyaborg/kensa
→ Installs the coding-agent skills that drive the eval workflow

$ uv add kensa
→ Adds the runtime that executes scenarios, judges, and reports

$ "evaluate this agent"
→ Your coding agent inspects the repo, writes evals, and helps run them

$ kensa eval
→ Runs scenarios, applies checks, calls the judge when needed, and writes a report

$ kensa analyze
→ Surfaces slow, expensive, flaky, or error-prone traces

Core Commands

Command	What it does
`kensa init`	Scaffold with an example agent and scenario
`kensa init --blank`	Scaffold directories only
`kensa doctor`	Check instrumentation, config, and environment readiness
`kensa run`	Execute scenarios and capture traces
`kensa judge`	Run deterministic checks and, if configured, an LLM judge
`kensa report`	Generate terminal, Markdown, JSON, or HTML output
`kensa eval`	Run + judge + report in one command
`kensa analyze`	Flag cost, latency, error, and looping anomalies in traces

kensa helps you test an agent the same way you test the rest of your software: with repeatable scenarios, clear pass/fail signals, and reports you can use in CI.

Architecture

In practice, agent evals often force a choice between vibes and infrastructure: either you test them manually in prompts, or you spend weeks building a harness before you learn anything.

kensa exists to close that gap. The coding agent reads the codebase, identifies failure modes from past traces, and writes scenarios. The CLI runs those scenarios in subprocesses, captures traces, applies deterministic checks, and only calls the LLM judge when the cheap checks pass. The bundled skills connect those steps into a usable eval loop instead of leaving you to wire it together by hand.

That split is intentional:

the coding agent decides what is worth testing
the CLI executes the eval suite consistently
the skills drive setup, scenario generation, judge authoring, and iteration
deterministic checks gate expensive judge calls
reports and traces make the results usable in CI and iteration

Scenario format

Scenarios live in .kensa/scenarios/*.yaml. You can write a single input by hand, or point at a dataset so one scenario definition expands into many runs.

Example dataset-driven scenario:

id: booking_variations
name: Booking across routes
dataset: data/routes.jsonl
input_field: query
run_command: python agent.py {{input}}

checks:
  - type: tool_called
    params: { name: search_flights }
  - type: max_turns
    params: { max: 5 }

criteria: |
  The agent must confirm with the user before booking.
  The final answer must include a confirmation number.

Built-in checks:

Check	What it tests
`output_contains`	Output includes a string or pattern
`output_matches`	Output matches a regex
`tool_called`	A specific tool was invoked
`tool_not_called`	A specific tool was not invoked
`tool_order`	Tools were called in the expected sequence
`max_cost`	Total cost stays under a threshold
`max_turns`	LLM call count stays under a limit
`max_duration`	Execution time stays under a limit
`no_repeat_calls`	Duplicate tool calls with identical args are rejected

A scenario passes when every configured check passes and any configured judge passes.

Examples

The repo includes five example agents under examples/:

Example	What it tests
`sql-analyst`	Multi-tool SQL analysis with soft-delete, currency, and aggregation traps
`incident-triage`	Operational diagnosis across runbooks, deploys, metrics, and paging
`code-reviewer`	Review behavior, false positives, and missed security issues
`customer-support`	Refunds, policy checks, ticket routing, and flaky downstream tools
`sdr-qualifier`	Qualification logic, CRM hygiene, and competitor signals

Try one:

git clone https://github.com/satyaborg/kensa.git
cd kensa
uv sync --extra openai
cd examples/sql-analyst

Then ask your coding agent to evaluate it, or write scenarios yourself and run kensa eval.

CI

- name: Run evals
  run: uv run kensa eval --format markdown

If you only use deterministic checks, you do not need API keys. If your scenarios include criteria or judge, add judge provider secrets in CI.

Contributing

See CONTRIBUTING.md for the full guide.

git clone https://github.com/satyaborg/kensa.git
cd kensa
uv sync --extra dev
pre-commit install
pytest -m "not integration"
ruff check src/ tests/
ruff format --check src/ tests/
uv run ty check

Judge model resolution

KENSA_JUDGE_MODEL override
ANTHROPIC_API_KEY -> claude-sonnet-4-6
OPENAI_API_KEY -> gpt-5.4-mini
Neither -> setup error

OpenTelemetry notes

kensa writes standard OpenTelemetry spans as JSONL and works well with OpenInference instrumentors. Auto-instrumentation currently supports Anthropic, OpenAI, and LangChain. If you already export spans yourself, you can still feed JSONL traces into kensa via KENSA_TRACE_DIR.

Homepage · Issues · MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

satyaborg

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.8.0

May 5, 2026

0.7.0

May 1, 2026

0.6.2

Apr 27, 2026

0.6.1

Apr 24, 2026

0.6.0

Apr 24, 2026

0.5.2

Apr 18, 2026

0.5.1

Apr 18, 2026

0.5.0

Apr 15, 2026

0.4.0

Apr 13, 2026

0.3.0

Apr 10, 2026

0.2.0

Apr 8, 2026

This version

0.1.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kensa-0.1.0.tar.gz (52.7 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kensa-0.1.0-py3-none-any.whl (55.1 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file kensa-0.1.0.tar.gz.

File metadata

Download URL: kensa-0.1.0.tar.gz
Upload date: Apr 7, 2026
Size: 52.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kensa-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8e72d8f4e4e533f9652a7e2ac4ad4681565f76b10121e6f6de9b67c61b6f5a8e`
MD5	`34cfd0b41116fc07ec38ab9b66fc5ed4`
BLAKE2b-256	`caa5d025ecec83bc58703d290a66ecdd3d587f28dc9394efdc2c7bd13d83ec5b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kensa-0.1.0.tar.gz:

Publisher: release.yml on satyaborg/kensa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kensa-0.1.0.tar.gz
- Subject digest: 8e72d8f4e4e533f9652a7e2ac4ad4681565f76b10121e6f6de9b67c61b6f5a8e
- Sigstore transparency entry: 1247128862
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: satyaborg/kensa@b80ec54d70aeef06b85692417affe812a50e00ff
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/satyaborg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b80ec54d70aeef06b85692417affe812a50e00ff
- Trigger Event: push

File details

Details for the file kensa-0.1.0-py3-none-any.whl.

File metadata

Download URL: kensa-0.1.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 55.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kensa-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f07dc00f2fcf4e6c34bef247942222b65c38a4705cf0496bb898ad5276e7b6ce`
MD5	`4e29e4a947ed2c7b7b4bbc8ac7e541f1`
BLAKE2b-256	`859e61eb84bf37a478d4cf7ee0fdb466494afb20587a28f894744f018638d7d9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for kensa-0.1.0-py3-none-any.whl:

Publisher: release.yml on satyaborg/kensa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: kensa-0.1.0-py3-none-any.whl
- Subject digest: f07dc00f2fcf4e6c34bef247942222b65c38a4705cf0496bb898ad5276e7b6ce
- Sigstore transparency entry: 1247128867
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: satyaborg/kensa@b80ec54d70aeef06b85692417affe812a50e00ff
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/satyaborg
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b80ec54d70aeef06b85692417affe812a50e00ff
- Trigger Event: push

kensa 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Skills + CLI (recommended)

Claude Code plugin

Provider extras

Quickstart

1. Ask your coding agent to evaluate the repo. For e.g.

2. Add instrumentation if needed

3. Run the evals

4. Fix and iterate

Manual CLI Workflow

What You Type -> What Happens

Core Commands

Architecture

Scenario format

Examples

CI

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance