Skip to main content

The open source agent evals harness

Project description


kensa - the open source agent evals harness

Tell your coding agent to evaluate an agent. Get a working eval suite in minutes.

CI PyPI Python License Downloads Stars


kensa is an open source eval harness for agent codebases. It gives coding agents an opinionated CLI and bundled skills to generate scenarios, run them in subprocesses, judge results, and report failures.

Note: kensa is under active development. Things may shift between minor versions as the harness stabilizes. Pin your version if you need predictability. If something breaks, open an issue. Your feedback shapes evals.

Installation

Skills + CLI (recommended)

npx skills add satyaborg/kensa
uv add kensa

Works for Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and similar coding agents.

Claude Code plugin

If you primarily use Claude Code, you can install it as a plugin:

/plugin marketplace add satyaborg/kensa
/plugin install kensa

Quickstart

Tell your coding agent:

evaluate this agent

That gives you the basic loop:

  • your coding agent inspects the repo, sets up instrumentation and writes evals
  • it runs kensa to execute scenarios and capture traces
  • deterministic checks run first
  • the LLM judge only runs when those pass
  • reports show what failed and why
  • you review changes, approve fixes and iterate

Instrumentation

Zero code changes. kensa captures LLM calls, tool use, tokens, cost, and latency without modifying your agent.

Provider extras
uv add "kensa[anthropic]"
uv add "kensa[openai]"
uv add "kensa[langchain]"
uv add "kensa[all]"

Core commands

Command What it does
kensa init --blank Scaffold .kensa/ without example content
kensa doctor Check instrumentation, config, and environment readiness
kensa generate Synthesize scenario YAMLs from captured traces via an LLM
kensa eval Run + judge + report in one command
kensa report Show the latest results in terminal, Markdown, JSON, or HTML
kensa analyze Flag slow, expensive, anomalous, or error-prone traces
kensa mcp Serve kensa over MCP for LLM clients (stdio or HTTP)

MCP server

Kensa ships an MCP server that exposes the eval workflow to any MCP-aware client: Claude Code, Cursor, Codex, OpenCode, Gemini CLI, Claude Desktop, anything that speaks MCP.

One-liner for Claude Code (run from your project root):

claude mcp add kensa -- uvx kensa-mcp

uvx pulls kensa-mcp from PyPI into an isolated environment on first launch. No pre-install needed. The server reads .kensa/ relative to the cwd it inherits from Claude Code.

Tools (7): init, doctor, run, judge, eval, report, analyze.

Resources (8): read-only data under the kensa:// namespace.

kensa://runs                          # list of recent runs
kensa://runs/{id}                     # manifest + summary for one run
kensa://runs/{id}/results             # full judged results
kensa://runs/{id}/trace/{scenario}/{index}  # spans for one scenario execution
kensa://scenarios                     # list of scenarios
kensa://scenarios/{id}                # full scenario YAML
kensa://judges                        # list of judge prompt names
kensa://judges/{name}                 # judge prompt spec

Long-running tools (run, judge, eval) return a compact summary plus a results_uri. Fetch detail via the resource only when you need it. Errors come back as a typed MCPError envelope ({error, code, hint}) with stable code values so clients can branch on failure type.

Manual config (Cursor, Codex, Claude Desktop, etc.)

Add to your MCP client config (e.g. ~/.claude.json or a project-local .mcp.json):

{
  "mcpServers": {
    "kensa": {
      "command": "uvx",
      "args": ["kensa-mcp"],
      "cwd": "/absolute/path/to/your/project"
    }
  }
}

Already have kensa installed in the project? Add the extra (uv add "kensa[mcp]") and use the built-in kensa mcp subcommand instead of the shim:

{
  "mcpServers": {
    "kensa": {
      "command": "uv",
      "args": ["run", "kensa", "mcp"],
      "cwd": "/absolute/path/to/your/project"
    }
  }
}

For local Kensa development from a source checkout:

{
  "mcpServers": {
    "kensa": {
      "command": "uv",
      "args": ["run", "--extra", "mcp", "kensa", "mcp"],
      "cwd": "/absolute/path/to/kensa"
    }
  }
}

Manual workflow

If you want to author evals yourself:

kensa init --blank
kensa doctor

Scenarios live in .kensa/scenarios/*.yaml and point at your agent entrypoint with run_command.

id: classify_ticket
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: [python, agent.py]   # input is appended as the final argv element

checks:
  - type: trajectory
    params:
      steps:
        - tool: classify_ticket
      max_steps: 1
      max_tokens: 2000
  - type: output_matches
    params: { pattern: "^P[123]$" }

criteria: |
  P1 is for outages or data loss affecting multiple users.

For complete examples, see examples/.

trajectory is the deterministic path check for tool-call correctness. V1 supports:

  • ordering: exact | any_order
  • args: exact | ignore
  • min_accuracy
  • inline budgets: max_steps, max_tokens, max_duration_seconds

When present, reports surface trajectory_accuracy and step_efficiency alongside pass/fail.

When you run the same scenario multiple times, aggregate reports also surface estimated 3-run and 5-run pass rates assuming independent runs.

If you need custom deterministic assertions beyond the built-ins, add a Python check via CHECK_REGISTRY rather than embedding logic in scenario YAML.

CI

- name: Run evals
  run: uv run kensa eval --format markdown

If you only use deterministic checks, you do not need API keys. If you use criteria or judge, add judge provider secrets in CI.

Need more?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kensa-0.6.0.tar.gz (71.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kensa-0.6.0-py3-none-any.whl (76.8 kB view details)

Uploaded Python 3

File details

Details for the file kensa-0.6.0.tar.gz.

File metadata

  • Download URL: kensa-0.6.0.tar.gz
  • Upload date:
  • Size: 71.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kensa-0.6.0.tar.gz
Algorithm Hash digest
SHA256 0fe94ada5050bd79675c302c7a91e6f6cdb7b49b73d8bc63ed8b7b2af53613df
MD5 42eeb2df64b6f8e3b0bd416e253fd0ae
BLAKE2b-256 012e271eabd958700ced5d0ca8c7441c7c7e6cad948b196b0f1146ffe08f8e8b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kensa-0.6.0.tar.gz:

Publisher: release.yml on satyaborg/kensa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kensa-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: kensa-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 76.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kensa-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 45bcc3bd62a4adda0fb2c9ceded391d644d922c2133921cdc869c5562363ee19
MD5 839438ea0d0e9f8316813a4bf3fa1b24
BLAKE2b-256 1b99cc4727af3eea89ffbf03d02307b55b319df7130cd76e5c7462e357d7bb1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for kensa-0.6.0-py3-none-any.whl:

Publisher: release.yml on satyaborg/kensa

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page