Skip to main content

Agent evaluation toolkit

Project description

agent-eval

A utility for evaluating agents on a suite of Inspect-formatted evals, with the following primary benefits:

  1. Task suite specifications as config.
  2. Extracts the token usage of the agent from log files, and computes cost using litellm.
  3. Submits task suite results to a leaderboard, with submission metadata and easy upload to a HuggingFace repo for distribution of scores and logs.

Installation

To install from pypi, use pip install agent-eval.

Usage

Run evaluation suite

agenteval eval --config-path CONFIG_PATH --split SPLIT LOG_DIR

Evaluate an agent on the supplied eval suite configuration. Results are written to agenteval.json in the log directory.

See sample-config.yml for a sample configuration file.

For aggregation in a leaderboard, each task specifies a primary_metric as {scorer_name}/{metric_name}. The scoring utils will look for a corresponding stderr metric, by looking for another metric with the same scorer_name and with a metric_name containing the string "stderr".

Score results

agenteval score [OPTIONS] LOG_DIR

Compute scores for the results in agenteval.json and update the file with the computed scores.

Publish scores

agenteval publish [OPTIONS] LOG_DIR

Upload the scored results to HuggingFace datasets.

Administer the HuggingFace datasets

Prior to publishing scores, two HuggingFace datasets should be set up, one for full submissions and one for results files.

If you want to call load_dataset() on the results dataset (e.g., for populating a leaderboard), you probably want to explicitly tell HuggingFace about the schema and dataset structure (otherwise, HuggingFace may fail to propertly auto-convert to Parquet). This is done by updating the configs attribute in the YAML metadata block at the top of the README.md file at the root of the results dataset (the metadata block is identified by lines with just --- above and below it). This attribute should contain a list of configs, each of which specifies the schema (under the features key) and dataset structure (under the data_files key). See sample-config-hf-readme-metadata.yml for a sample metadata block corresponding to sample-comfig.yml (note that the metadata references the raw schema data, which must be copied).

To facilitate initializing new configs, agenteval publish will automatically add this metadata if it is missing.

Development

See Development.md for development instructions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval-0.1.5.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval-0.1.5-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval-0.1.5.tar.gz.

File metadata

  • Download URL: agent_eval-0.1.5.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for agent_eval-0.1.5.tar.gz
Algorithm Hash digest
SHA256 d3124a2268d9b544d2480a5e23f0b0e47214a216852d8379fa1f85bf65aaf459
MD5 4789d6b70bb3587c6fd672d8a217dbec
BLAKE2b-256 4f8ec1dcb7dfd1ef51b1dbae6ce96b38d7b0b138384de7d1ea3f2f9a82a34eb5

See more details on using hashes here.

File details

Details for the file agent_eval-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: agent_eval-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for agent_eval-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ace69bb9d397c7ae2e85d9de9cf43dd07018c67b4cc5e6a150ca47a4d5058b7a
MD5 b7f0a9f142bda3ac5b610960ef89a870
BLAKE2b-256 230bbe64b1928e4a4fcb62a5fdfcf846d5be7a0a79ee2ec7a91fd62bd33671ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page