Skip to main content

Context-aware evaluation framework for AI agents using MCP.

Project description

k-eval

Context-aware evaluation framework for AI agents using MCP.

Quick Start

k-eval uses uv for dependency management. Install it first if you don't have it:

curl -LsSf https://astral.sh/uv/install.sh | sh

Run k-eval

k-eval runs are configured using yaml configuration files (see Configuration).

Once an evaluation is defined in a yaml file, you can invoke k-eval like:

uvx --python 3.13 "k-eval[all]" run /path/to/config.yaml

See docs/run-configuration.md for authentication setup and all CLI options.

CLI Commands

$ uvx --python 3.13 "k-eval[all]" --help
                                                                                                                                       
 Usage: k-eval [OPTIONS] COMMAND [ARGS]...                                                                                                                 
                                                                                                                                                           
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ run   Run a k-eval evaluation from a YAML config file.                                                                                                  │
│ view  Open a k-eval results file in the interactive browser viewer.                                                                                     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Understanding the Output

Each run produces two files in ./results/ (or wherever you point --output-dir):

results/
  my-eval_20260225_a1b2c3d4.json           # aggregate scores per condition
  my-eval_20260225_a1b2c3d4.detailed.jsonl # one line per (question, condition) pair

{name}_{date}_{run_id}.json — the summary. One entry per condition with mean and standard deviation for each of the three metrics across all questions and repetitions. Use this to compare conditions at a glance.

This file is intended to be mostly compliant with the Every Eval Ever schema. Notably, k-eval does not aggregate the three metrics into a single score. Thus, the individual metrics are written to score_details.details, and score_details.score is left null.

{name}_{date}_{run_id}.detailed.jsonl — the full record. One JSON object per (question, condition) pair containing the agent's raw responses for every repetition, per-repetition judge scores and reasoning, unverified claims, and token usage. Use this if you want to dig into why a condition scored the way it did.

The three metrics are scored 1-5 by the judge model:

Metric What it measures
factual_adherence Does the response stick to facts in the golden answer?
completeness Does it cover all the essential points?
helpfulness_and_clarity Is it well-structured and easy to act on?

See evaluation-methodology for more details.

Interactive Results Viewer

k-eval comes bundled with a web-based interactive results viewer. The viewer can be invoked via the k-eval command:

uvx k-eval view /path/to/results.detailed.jsonl

[!Note]

After running an evaluation, the k-eval view ... command will be printed out for easy copy/paste.

Configuration

A config file defines your dataset, agent, judge, MCP servers, and the conditions you want to compare:

[!Important]

For MCP servers that require authentication, please reference docs/run-configuration.md.

name: "my-eval"
version: "1"

dataset:
  # JSONL file with your questions and golden answers
  path: "./questions.jsonl"
  # The name of the key used to reference the question within the JSONL file.
  question_key: "question"
  # They key used to reference the golden "reference" or answer within the JSON file.
  answer_key: "answer"

agent:
  type: "claude_code_sdk" # currently the only supported type
  model: "claude-sonnet-4-5"

judge:
  model: "vertex_ai/claude-opus-4-5" # any LiteLLM-compatible model string (See: https://models.litellm.ai/)
  temperature: 0.0

mcp_servers:
  graph:
    type: "stdio"
    command: "python"
    args: ["-m", "my_mcp_server"]

conditions:
  baseline:
    mcp_servers: []
    system_prompt: |
        Answer using your own knowledge.
  with_graph:
    mcp_servers: [graph]
    system_prompt: |
        Use the graph tool to answer the question.
    # Abort this triple if the agent makes no MCP tool calls.
    # Prevents silently scoring runs where the MCP server was unreachable.
    require_mcp_tool_use: true
    # Abort this triple if every MCP tool call returned an error.
    # Use alongside require_mcp_tool_use to validate
    # that MCP tools are working correctly.
    require_mcp_tool_success: true

execution:
  # How many times each (question, condition) pair is evaluated.
  # This is useful for managing variance in agent responses. Standard
  # deviation between scores will be reported if num_repetitions >= 3
  num_repetitions: 3
  # (question, condition, repetition) tuples can be evaluated concurrently
  # to reduce total evaluation time. The upper bound of this number is determined
  # only by the resources on your computer and by the rate limit configuration
  # of the agent and model providers.
  #
  # In practice, numbers even as high as 50 seem to be well tolerated 
  # when using Vertex AI.
  max_concurrent: 5

See docs/run-configuration.md for the full reference including authentication setup.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_eval-1.1.3.tar.gz (64.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

k_eval-1.1.3-py3-none-any.whl (79.4 kB view details)

Uploaded Python 3

File details

Details for the file k_eval-1.1.3.tar.gz.

File metadata

  • Download URL: k_eval-1.1.3.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for k_eval-1.1.3.tar.gz
Algorithm Hash digest
SHA256 5da19015a0a3c08d3d9c69309c9812c891121be0a6e8582980dfb85372857956
MD5 9163246009c4c03b42d08641246d26c4
BLAKE2b-256 c002bafc4d8b3297aa3608ae85e50ba406938fb0065e56bf6b063940ec56b10a

See more details on using hashes here.

Provenance

The following attestation bundles were made for k_eval-1.1.3.tar.gz:

Publisher: publish.yml on jsell-rh/k-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file k_eval-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: k_eval-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 79.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for k_eval-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fd8246d4c6111dc5e11216a17e85d71905aba38df199d26b5c657d35e0cedd0e
MD5 50623c5fb2ce27a01d39987841927b0d
BLAKE2b-256 e81b066f8f4b7c5b0b2c3144ecc8b05f2c75c9a8f94d314cb55f3ff0d84303c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for k_eval-1.1.3-py3-none-any.whl:

Publisher: publish.yml on jsell-rh/k-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page