The open source agent evals harness
Project description
Agents are non-deterministic. Prompts drift. Tools change. Models behave differently. Any change can make them slower, more expensive, or just plain unreliable.
kensa gives coding agents, like Claude Code, a repeatable loop to eval your
agents, and catch regressions every time you make a change.
Installation
Paste this into your coding agent
Open your coding agent and paste:
Install Kensa for agent-driven evals with `uvx kensa init --cli --agent all`,
then evaluate this agent using Kensa's skills. Start with audit-evals, let it
route to the right next step, and follow the eval lifecycle: generate scenarios,
calibrate judges if needed, run `kensa eval`, diagnose failures, and recommend
whether to fix the agent, the scenarios, or the judge.
Your agent does the setup, writes or updates evals, runs them, and reports what to fix.
Or run it yourself
uvx kensa init
Adds kensa to your dev deps, scaffolds .kensa/, and adds 5 skills for the
complete evals workflow. Works with Claude Code, Codex, Cursor, and other coding
agents. For non-interactive setup or CI: uvx kensa init --cli --agent all --blank.
Quickstart
Tell your coding agent what you want:
| You say | Kensa does |
|---|---|
| "Evaluate this agent" | Audit setup, create or reuse scenarios, and run evals. |
| "Why are evals failing?" | Inspect results and traces, then diagnose the root cause. |
| "Add coverage for tool use" | Write scenario YAML with tool or trajectory checks. |
| "The judge seems wrong" | Create or validate structured judge prompts. |
How it works
- Zero to eval: your coding agent drafts scenarios; you review them.
- Runs become traces: each scenario runs in a subprocess with LLM calls, tool use, tokens, cost, and latency captured.
- Checks gate judges: deterministic checks run before any LLM judge call.
- Ship with evidence: reports show verdicts, traces, cost, latency, and failure details.
Instrumentation
Zero code changes. kensa captures LLM calls, tool use, tokens, cost, and latency without modifying your agent. OpenTelemetry (OTel) compatible.
Provider extras
uv add "kensa[anthropic]"
uv add "kensa[openai]"
uv add "kensa[langchain]"
uv add "kensa[all]"
Core commands
| Command | What it does |
|---|---|
kensa init --blank |
Scaffold .kensa/ without example content |
kensa doctor |
Check instrumentation, config, and environment readiness |
kensa generate |
Synthesize scenario YAMLs from captured traces via an LLM |
kensa eval |
Run + judge + report in one command |
kensa report |
Show the latest results in terminal, Markdown, JSON, or HTML |
See the CLI docs for run, judge, analyze,
mcp, and the full command reference.
MCP server
One-liner for Claude Code (run from your project root):
claude mcp add kensa -- uvx kensa-mcp
For other JSON-based MCP clients, add to your project's .mcp.json or
.cursor/mcp.json:
{
"mcpServers": {
"kensa": {
"command": "uvx",
"args": ["kensa-mcp"]
}
}
}
For Codex, add to your project-scoped .codex/config.toml:
[mcp_servers.kensa]
command = "uvx"
args = ["kensa-mcp"]
See the MCP server docs for tools, resources, and manual config.
Manual workflow
If you want to author evals yourself:
kensa init --blank
kensa doctor
Scenarios live in .kensa/scenarios/*.yaml and point at your agent entrypoint with run_command.
id: classify_ticket
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: [python, agent.py] # input is appended as the final argv element
checks:
- type: trajectory
params:
steps:
- tool: classify_ticket
max_steps: 1
max_tokens: 2000
- type: output_matches
params: { pattern: "^P[123]$" }
criteria: |
P1 is for outages or data loss affecting multiple users.
For complete examples, see examples/.
See the scenario docs and
checks docs for the full field and check
reference.
CI
- name: Run evals
run: uv run kensa eval --format markdown
If you only use deterministic checks, you do not need API keys. If you use
criteria or judge, add judge LLM provider secrets in CI.
Need more?
- Docs
examples/has sample agents and scenariosCONTRIBUTING.mdcovers local development- Homepage
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kensa-0.7.0.tar.gz.
File metadata
- Download URL: kensa-0.7.0.tar.gz
- Upload date:
- Size: 95.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b226ae1474e99ea8baa7d423f5abcf77c72385a73c64ecf276f73009203edb43
|
|
| MD5 |
8fa0288ad0514d9ddf5ce02d5500a960
|
|
| BLAKE2b-256 |
60b671b3c1cfd66a3e167f809c536013a9635554a552a5cbd346aa1569c16c47
|
Provenance
The following attestation bundles were made for kensa-0.7.0.tar.gz:
Publisher:
release.yml on satyaborg/kensa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kensa-0.7.0.tar.gz -
Subject digest:
b226ae1474e99ea8baa7d423f5abcf77c72385a73c64ecf276f73009203edb43 - Sigstore transparency entry: 1415988420
- Sigstore integration time:
-
Permalink:
satyaborg/kensa@cd458f8b737501227fb2183ac0f89705718a9897 -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/satyaborg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cd458f8b737501227fb2183ac0f89705718a9897 -
Trigger Event:
push
-
Statement type:
File details
Details for the file kensa-0.7.0-py3-none-any.whl.
File metadata
- Download URL: kensa-0.7.0-py3-none-any.whl
- Upload date:
- Size: 109.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
417540d1142636c401ef288befb07c65add0c4af324bad71827271c3d11ccabd
|
|
| MD5 |
925a8e62a404f3d8f6b058679c78fd36
|
|
| BLAKE2b-256 |
a954e256a57ec239b3a78ad6d304281aea7cd1dc5df08083cb32ab3eb2589e86
|
Provenance
The following attestation bundles were made for kensa-0.7.0-py3-none-any.whl:
Publisher:
release.yml on satyaborg/kensa
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
kensa-0.7.0-py3-none-any.whl -
Subject digest:
417540d1142636c401ef288befb07c65add0c4af324bad71827271c3d11ccabd - Sigstore transparency entry: 1415988752
- Sigstore integration time:
-
Permalink:
satyaborg/kensa@cd458f8b737501227fb2183ac0f89705718a9897 -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/satyaborg
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cd458f8b737501227fb2183ac0f89705718a9897 -
Trigger Event:
push
-
Statement type: