Skip to main content

The open source agent evals harness

Project description


kensa - the open source agent evals harness

Kensa is the open source harness for evaluating agents.

CI PyPI Python License Downloads


Agents are non-deterministic. Prompts drift. Tools change. Models behave differently. Any change can make them slower, more expensive, or just plain unreliable.

kensa gives coding agents, like Claude Code, a repeatable loop to eval your agents, and catch regressions every time you make a change.

Installation

Paste this into your coding agent

Open your coding agent and paste:

Install Kensa for agent-driven evals with `uvx kensa init --cli --agent all`,
then evaluate this agent using Kensa's skills. Start with audit-evals, let it
route to the right next step, and follow the eval lifecycle: generate scenarios,
calibrate judges if needed, run `kensa eval`, diagnose failures, and recommend
whether to fix the agent, the scenarios, or the judge.

Your agent does the setup, writes or updates evals, runs them, and reports what to fix.

Or run it yourself

uvx kensa init

Adds kensa to your dev deps, scaffolds .kensa/, and adds 5 skills for the complete evals workflow. Works with Claude Code, Codex, Cursor, and other coding agents. For non-interactive setup or CI: uvx kensa init --cli --agent all --blank.

Quickstart

Tell your coding agent what you want:

You say Kensa does
"Evaluate this agent" Audit setup, create or reuse scenarios, and run evals.
"Why are evals failing?" Inspect results and traces, then diagnose the root cause.
"Add coverage for tool use" Write scenario YAML with tool or trajectory checks.
"The judge seems wrong" Create or validate structured judge prompts.

How it works

  • Zero to eval: your coding agent drafts scenarios; you review them.
  • Runs become traces: each scenario runs in a subprocess with LLM calls, tool use, tokens, cost, and latency captured.
  • Checks gate judges: deterministic checks run before any LLM judge call.
  • Ship with evidence: reports show verdicts, traces, cost, latency, and failure details.

Instrumentation

Zero code changes. kensa captures LLM calls, tool use, tokens, cost, and latency without modifying your agent. OpenTelemetry (OTel) compatible.

Provider extras
uv add "kensa[anthropic]"
uv add "kensa[openai]"
uv add "kensa[langchain]"
uv add "kensa[all]"

Core commands

Command What it does
kensa init --blank Scaffold .kensa/ without example content
kensa doctor Check instrumentation, config, and environment readiness
kensa generate Synthesize scenario YAMLs from captured traces via an LLM
kensa eval Run + judge + report in one command
kensa report Show the latest results in terminal, Markdown, JSON, or HTML

See the CLI docs for run, judge, analyze, mcp, and the full command reference.

MCP server

One-liner for Claude Code (run from your project root):

claude mcp add kensa -- uvx kensa-mcp

For other JSON-based MCP clients, add to your project's .mcp.json or .cursor/mcp.json:

{
  "mcpServers": {
    "kensa": {
      "command": "uvx",
      "args": ["kensa-mcp"]
    }
  }
}

For Codex, add to your project-scoped .codex/config.toml:

[mcp_servers.kensa]
command = "uvx"
args = ["kensa-mcp"]

See the MCP server docs for tools, resources, and manual config.

Manual workflow

If you want to author evals yourself:

kensa init --blank
kensa doctor

Scenarios live in .kensa/scenarios/*.yaml and point at your agent entrypoint with run_command.

id: classify_ticket
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: [python, agent.py]   # input is appended as the final argv element

checks:
  - type: trajectory
    params:
      steps:
        - tool: classify_ticket
      max_steps: 1
      max_tokens: 2000
  - type: output_matches
    params: { pattern: "^P[123]$" }

criteria: |
  P1 is for outages or data loss affecting multiple users.

For complete examples, see examples/. See the scenario docs and checks docs for the full field and check reference.

CI

- name: Run evals
  run: uv run kensa eval --format markdown

If you only use deterministic checks, you do not need API keys. If you use criteria or judge, add judge LLM provider secrets in CI.

Need more?

License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kensa-0.7.0.tar.gz (95.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kensa-0.7.0-py3-none-any.whl (109.8 kB view details)

Uploaded Python 3

File details

Details for the file kensa-0.7.0.tar.gz.

File metadata

  • Download URL: kensa-0.7.0.tar.gz
  • Upload date:
  • Size: 95.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kensa-0.7.0.tar.gz
Algorithm Hash digest
SHA256 b226ae1474e99ea8baa7d423f5abcf77c72385a73c64ecf276f73009203edb43
MD5 8fa0288ad0514d9ddf5ce02d5500a960
BLAKE2b-256 60b671b3c1cfd66a3e167f809c536013a9635554a552a5cbd346aa1569c16c47

See more details on using hashes here.

Provenance

The following attestation bundles were made for kensa-0.7.0.tar.gz:

Publisher: release.yml on satyaborg/kensa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kensa-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: kensa-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 109.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kensa-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 417540d1142636c401ef288befb07c65add0c4af324bad71827271c3d11ccabd
MD5 925a8e62a404f3d8f6b058679c78fd36
BLAKE2b-256 a954e256a57ec239b3a78ad6d304281aea7cd1dc5df08083cb32ab3eb2589e86

See more details on using hashes here.

Provenance

The following attestation bundles were made for kensa-0.7.0-py3-none-any.whl:

Publisher: release.yml on satyaborg/kensa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page