Context-aware evaluation framework for AI agents using MCP.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jsell-rh

These details have not been verified by PyPI

Project description

k-eval

Context-aware evaluation framework for AI agents using MCP.

Quick Start

k-eval uses uv for dependency management. Install it first if you don't have it:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install `k-eval`

git clone https://github.com/jsell-rh/k-eval.git
cd k-eval/src/k-eval

# Core dependencies
uv sync

# With Vertex AI provider support
uv sync --extra vertex_ai

# All provider dependencies
uv sync --extra all

Run `k-eval`

k-eval runs are configured using yaml configuration files (see Configuration).

Once an evaluation is defined in a yaml file, you can invoke k-eval like:

cd src/k-eval
uv run python -m k_eval.cli.main /path/to/config.yaml

See docs/run-configuration.md for authentication setup and all CLI options.

CLI Options

src/k-eval$ uv run python -m k_eval.cli.main --help
                                                                                                                                    
 Usage: python -m cli.main [OPTIONS] CONFIG_PATH                                                                                    
                                                                                                                                    
 Run a k-eval evaluation from a YAML config file.                                                                                   
                                                                                                                                    
╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    config_path      PATH  Path to evaluation config YAML [required]                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --output-dir  -o      PATH  Directory for output files [default: results]                                                        │
│ --log-format          TEXT  Log format: 'console' or 'json' [default: console]                                                   │
│ --quiet       -q            Suppress debug and info logs; show only the progress bar plus warnings/errors.                       │
│ --help                      Show this message and exit.                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Understanding the Output

Each run produces two files in ./results/ (or wherever you point --output-dir):

results/
  my-eval_20260225_a1b2c3d4.json           # aggregate scores per condition
  my-eval_20260225_a1b2c3d4.detailed.jsonl # one line per (question, condition) pair

{name}_{date}_{run_id}.json — the summary. One entry per condition with mean and standard deviation for each of the three metrics across all questions and repetitions. Use this to compare conditions at a glance.

This file is intended to be mostly compliant with the Every Eval Ever schema. Notably, k-eval does not aggregate the three metrics into a single score. Thus, the individual metrics are written to score_details.details, and score_details.score is left null.

{name}_{date}_{run_id}.detailed.jsonl — the full record. One JSON object per (question, condition) pair containing the agent's raw responses for every repetition, per-repetition judge scores and reasoning, unverified claims, and token usage. Use this if you want to dig into why a condition scored the way it did.

The three metrics are scored 1-5 by the judge model:

Metric	What it measures
`factual_adherence`	Does the response stick to facts in the golden answer?
`completeness`	Does it cover all the essential points?
`helpfulness_and_clarity`	Is it well-structured and easy to act on?

See evaluation-methodology for more details.

Configuration

A config file defines your dataset, agent, judge, MCP servers, and the conditions you want to compare:

[!Important]

For MCP servers that require authentication, please reference docs/run-configuration.md.

name: "my-eval"
version: "1"

dataset:
  # JSONL file with your questions and golden answers
  path: "./questions.jsonl"
  # The name of the key used to reference the question within the JSONL file.
  question_key: "question"
  # They key used to reference the golden "reference" or answer within the JSON file.
  answer_key: "answer"

agent:
  type: "claude_code_sdk" # currently the only supported type
  model: "claude-sonnet-4-5"

judge:
  model: "vertex_ai/claude-opus-4-5" # any LiteLLM-compatible model string (See: https://models.litellm.ai/)
  temperature: 0.0

mcp_servers:
  graph:
    type: "stdio"
    command: "python"
    args: ["-m", "my_mcp_server"]

conditions:
  baseline:
    mcp_servers: []
    system_prompt: |
        Answer using your own knowledge.
  with_graph:
    mcp_servers: [graph]
    system_prompt: |
        Use the graph tool to answer the question.

execution:
  # How many times each (question, condition) pair is evaluated.
  # This is useful for managing variance in agent responses. Standard
  # deviation between scores will be reported if num_repetitions >= 3
  num_repetitions: 3
  # (question, condition, repetition) tuples can be evaluated concurrently
  # to reduce total evaluation time. The upper bound of this number is determined
  # only by the resources on your computer and by the rate limit configuration
  # of the agent and model providers.
  #
  # In practice, numbers even as high as 50 seem to be well tolerated 
  # when using Vertex AI.
  max_concurrent: 5

See docs/run-configuration.md for the full reference including authentication setup.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jsell-rh

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.1

Apr 24, 2026

1.2.0

Apr 24, 2026

1.1.3

Mar 24, 2026

1.1.2

Feb 27, 2026

1.1.1

Feb 27, 2026

1.1.0

Feb 27, 2026

1.0.0

Feb 26, 2026

0.3.2

Feb 26, 2026

0.3.1

Feb 25, 2026

This version

0.3.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

k_eval-0.3.0.tar.gz (36.1 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

k_eval-0.3.0-py3-none-any.whl (50.1 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file k_eval-0.3.0.tar.gz.

File metadata

Download URL: k_eval-0.3.0.tar.gz
Upload date: Feb 25, 2026
Size: 36.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for k_eval-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`dd488d83722503a2057845c486fb4ad787bac8692b134c67e92b53a56252a8ce`
MD5	`f846d6d64fbab2cf2087c14a76a035a0`
BLAKE2b-256	`9ed7c4b511cbb01df2b0968093b158dde733d7993479575bc215c5b4c314b114`

See more details on using hashes here.

Provenance

The following attestation bundles were made for k_eval-0.3.0.tar.gz:

Publisher: publish.yml on jsell-rh/k-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: k_eval-0.3.0.tar.gz
- Subject digest: dd488d83722503a2057845c486fb4ad787bac8692b134c67e92b53a56252a8ce
- Sigstore transparency entry: 992583663
- Sigstore integration time: Feb 25, 2026
Source repository:
- Permalink: jsell-rh/k-eval@0e83692325498a8f326902bb0dd0f4c08d24d7a1
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/jsell-rh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0e83692325498a8f326902bb0dd0f4c08d24d7a1
- Trigger Event: release

File details

Details for the file k_eval-0.3.0-py3-none-any.whl.

File metadata

Download URL: k_eval-0.3.0-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 50.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for k_eval-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9a09536ac8c6a2eefaa022959c4ca595b86b662c0910108fa736135535170477`
MD5	`891312624bfb293c1af8dee15cf5ec49`
BLAKE2b-256	`6c3f927ede48e20ad67c7057666022bf9c28b100adb32b8b9c434f12a713d105`

See more details on using hashes here.

Provenance

The following attestation bundles were made for k_eval-0.3.0-py3-none-any.whl:

Publisher: publish.yml on jsell-rh/k-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: k_eval-0.3.0-py3-none-any.whl
- Subject digest: 9a09536ac8c6a2eefaa022959c4ca595b86b662c0910108fa736135535170477
- Sigstore transparency entry: 992583665
- Sigstore integration time: Feb 25, 2026
Source repository:
- Permalink: jsell-rh/k-eval@0e83692325498a8f326902bb0dd0f4c08d24d7a1
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/jsell-rh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0e83692325498a8f326902bb0dd0f4c08d24d7a1
- Trigger Event: release

k-eval 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

k-eval

Quick Start

Install `k-eval`

Run `k-eval`

CLI Options

Understanding the Output

Configuration

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

k-eval 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

k-eval

Quick Start

Install k-eval

Run k-eval

CLI Options

Understanding the Output

Configuration

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Install `k-eval`

Run `k-eval`