Skip to main content

Open ended tool use evaluation framework

Project description

mcpx-eval

A framework for evaluating open-ended tool use across various large language models.

mcpx-eval can be used to compare the output of different LLMs with the same prompt for a given task using mcp.run tools. This means we're not only interested in the quality of the output, but also curious about the helpfulness of various models when presented with real world tools.

Test configs

The tests/ directory contains pre-defined evals

Installation

uv tool install mcpx-eval

Or from git:

uv tool install git+https://github.com/dylibso/mcpx-eval

Or using uvx without installation:

uvx mcpx-eval

mcp.run Setup

You will need to get an mcp.run session ID by running:

npx --yes -p @dylibso/mcpx gen-session --write

This will generate a new session and write the session ID to a configuration file that can be used by mcpx-eval.

If you need to store the session ID in an environment variable you can run gen-session without the --write flag:

npx --yes -p @dylibso/mcpx gen-session

which should output something like:

Login successful!
Session: kabA7w6qH58H7kKOQ5su4v3bX_CeFn4k.Y4l/s/9dQwkjv9r8t/xZFjsn2fkLzf+tkve89P1vKhQ

Then set the MCP_RUN_SESSION_ID environment variable:

$ export MCP_RUN_SESSION_ID=kabA7w6qH58H7kKOQ5su4v3bX_CeFn4k.Y4l/s/9dQwkjv9r8t/xZFjsn2fkLzf+tkve89P1vKhQ

Usage

Run an eval comparing all mcp.task runs for my-task:

mcpx-eval test --task my-task --task-run all

Only evaluate the latest task run:

mcpx-eval test --task my-task --task-run latest

Or trigger a new task run:

mcpx-eval test --task my-task --task-run new

Run an mcp.run task locally with a different set of models:

mcpx-eval test --model .. --model .. --task my-task --iter 10

Generate an HTML scoreboard for all evals:

mcpx-eval gen --html results.html --show

Test file

A test file is a TOML file containing the following fields:

  • name - name of the test
  • task - optional, the name of the mcp.run task to use
  • task-run - optional, one of latest, new, all or the name/index of the task run to analyze
  • prompt - prompt to test, this is passed to the LLM under test, this can be left blank if task is set
  • check - prompt for the judge, this is used to determine the quality of the test output
  • expected-tools - list of tool names that might be used
  • ignored-tools - optional, list of tools to ignore, they will not be available to the LLM
  • import - optional, includes fields from another test TOML file
  • vars - optional, a dict of variables that will be used to format the prompt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcpx_eval-0.4.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcpx_eval-0.4.0-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file mcpx_eval-0.4.0.tar.gz.

File metadata

  • Download URL: mcpx_eval-0.4.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.3

File hashes

Hashes for mcpx_eval-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9a8d02c734e7c189f9f9179494ec538cbca9ce3108dd94f6835ae8d7198e5bd4
MD5 f4255b250eadc8addeea19de35cebe81
BLAKE2b-256 b1a1f2d0c7e16edd63d7239f5f63a3ae59ef2c7557e63d92408986b207261611

See more details on using hashes here.

File details

Details for the file mcpx_eval-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: mcpx_eval-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.7.3

File hashes

Hashes for mcpx_eval-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ced4c78ac1ff4012f042b54baaf3299edebd242f69fd8835486f72a9a696999
MD5 126746eaef6b80ab69c4bbd4a727f9ab
BLAKE2b-256 a74adbbdcc802ff219e3e7c9b48aa97a02ce6cb8290329d93d135bf49e651106

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page