Skip to main content

Open ended tool use evaluation framework

Project description

mcpx-eval

A framework for evaluating open-ended tool use across various large language models.

mcpx-eval can be used to compare the output of different LLMs with the same prompt for a given task using mcp.run tools. This means we're not only interested in the quality of the output, but also curious about the helpfulness of various models when presented with real world tools.

Test configs

The tests/ directory contains pre-defined evals

Installation

uv tool install mcpx-eval

Or from git:

uv tool install git+https://github.com/dylibso/mcpx-eval

Or using uvx without installation:

uvx mcpx-eval

Usage

Run the my-test test for 10 iterations:

mcpx-eval test --model ... --model ... --config my-test.toml --iter 10

Or run a task directly from mcp.run:

mcpx-eval test --model .. --model .. --task my-task --iter 10

Generate an HTML scoreboard for all evals:

mcpx-eval gen --html results.html --show

Test file

A test file is a TOML file containing the following fields:

  • name - name of the test
  • task - optional, the name of the mcp.run task to use
  • prompt - prompt to test, this is passed to the LLM under test, this can be left blank if task is set
  • check - prompt for the judge, this is used to determine the quality of the test output
  • expected-tools - list of tool names that might be used
  • ignore-tools - optional, list of tools to ignore, they will not be available to the LLM
  • import - optional, includes fields from another test TOML file
  • vars - optional, a dict of variables that will be used to format the prompt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcpx_eval-0.2.1.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcpx_eval-0.2.1-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file mcpx_eval-0.2.1.tar.gz.

File metadata

  • Download URL: mcpx_eval-0.2.1.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.10

File hashes

Hashes for mcpx_eval-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4c6d4305724fed418a8696c1b82394b0a9211922dee3de71f3ad997d1722149a
MD5 08d77cbb4368bbcca66483a090740415
BLAKE2b-256 b86d1b7a48c8f86791c9d0936d929e58550da52508c100a5fed2bdc17fba411e

See more details on using hashes here.

File details

Details for the file mcpx_eval-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mcpx_eval-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.6.10

File hashes

Hashes for mcpx_eval-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e0904d257d7309c6df8fbda9e7bb02522bd63b90dd9760e6626c5adb936ec808
MD5 0c04fb360c8986755d8ff4241cb7d427
BLAKE2b-256 d287222659d1940e6d1c246b2a7c6235866af9482a4e4182489ab00129a47fe3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page