Agent evaluation toolkit

Project description

agent-eval

A utility for evaluating agents on a suite of Inspect-formatted evals, with the following primary benefits:

Task suite specifications as config.
Extracts the token usage of the agent from log files, and computes cost using litellm.
Submits task suite results to a leaderboard, with submission metadata and easy upload to a HuggingFace repo for distribution of scores and logs.

Installation

To install from pypi, use pip install agent-eval.

For leaderboard extras, use pip install agent-eval[leaderboard].

Usage

Run evaluation suite

agenteval eval --config-path CONFIG_PATH --split SPLIT LOG_DIR

Evaluate an agent on the supplied eval suite configuration. Results are written to agenteval.json in the log directory.

See sample-config.yml for a sample configuration file.

For aggregation in a leaderboard, each task specifies a primary_metric as {scorer_name}/{metric_name}. The scoring utils will look for a corresponding stderr metric, by looking for another metric with the same scorer_name and with a metric_name containing the string "stderr".

Weighted Macro Averaging with Tags

Tasks can be grouped using tags for computing summary statistics. The tags support weighted macro averaging, allowing you to assign different weights to tasks within a tag group.

Tags are specified as simple strings on tasks. To adjust weights for specific tag-task combinations, use the macro_average_weight_adjustments field at the split level. Tasks not specified in the adjustments default to a weight of 1.0.

See sample-config.yml for an example of the tag and weight adjustment format.

Score results

agenteval score [OPTIONS] LOG_DIR

Compute scores for the results in agenteval.json and update the file with the computed scores.

Publish scores to leaderboard

agenteval lb publish [OPTIONS] LOG_DIR

Upload the scored results to HuggingFace datasets.

View leaderboard scores

agenteval lb view [OPTIONS]

View results from the leaderboard.

Administer the leaderboard

Prior to publishing scores, two HuggingFace datasets should be set up, one for full submissions and one for results files.

If you want to call load_dataset() on the results dataset (e.g., for populating a leaderboard), you probably want to explicitly tell HuggingFace about the schema and dataset structure (otherwise, HuggingFace may fail to propertly auto-convert to Parquet). This is done by updating the configs attribute in the YAML metadata block at the top of the README.md file at the root of the results dataset (the metadata block is identified by lines with just --- above and below it). This attribute should contain a list of configs, each of which specifies the schema (under the features key) and dataset structure (under the data_files key). See sample-config-hf-readme-metadata.yml for a sample metadata block corresponding to sample-comfig.yml (note that the metadata references the raw schema data, which must be copied).

To facilitate initializing new configs, agenteval lb publish will automatically add this metadata if it is missing.

Development

See Development.md for development instructions.

Project details

Release history Release notifications | RSS feed

0.1.50

Apr 27, 2026

0.1.49

Apr 25, 2026

0.1.48

Apr 21, 2026

0.1.47

Apr 2, 2026

0.1.46

Mar 26, 2026

0.1.45

Mar 24, 2026

0.1.44

Jan 14, 2026

0.1.43

Aug 28, 2025

0.1.42

Aug 22, 2025

0.1.41

Aug 22, 2025

0.1.40

Aug 21, 2025

0.1.39

Aug 20, 2025

0.1.38

Aug 20, 2025

0.1.37

Aug 20, 2025

0.1.36

Aug 19, 2025

0.1.35

Aug 19, 2025

0.1.34

Aug 14, 2025

0.1.33

Aug 12, 2025

0.1.32

Aug 7, 2025

0.1.31

Aug 7, 2025

0.1.30

Aug 7, 2025

0.1.29

Aug 7, 2025

0.1.28

Aug 6, 2025

0.1.27

Aug 6, 2025

0.1.26

Aug 6, 2025

0.1.25

Aug 6, 2025

0.1.24

Aug 1, 2025

0.1.23

Aug 1, 2025

0.1.22

Jul 31, 2025

0.1.21

Jul 31, 2025

0.1.20

Jul 30, 2025

0.1.19

Jul 29, 2025

0.1.18

Jul 24, 2025

0.1.17

Jul 24, 2025

0.1.16

Jul 23, 2025

0.1.15

Jul 17, 2025

0.1.14

Jul 16, 2025

0.1.13

Jul 3, 2025

0.1.12

Jul 3, 2025

This version

0.1.11

Jul 2, 2025

0.1.10

Jun 27, 2025

0.1.9

Jun 23, 2025

0.1.8

Jun 21, 2025

0.1.7

Jun 19, 2025

0.1.6

Jun 18, 2025

0.1.5

Jun 11, 2025

0.1.4

Jun 6, 2025

0.1.3

Jun 4, 2025

0.1.1

May 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval-0.1.11.tar.gz (23.2 kB view details)

Uploaded Jul 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_eval-0.1.11-py3-none-any.whl (24.9 kB view details)

Uploaded Jul 2, 2025 Python 3

File details

Details for the file agent_eval-0.1.11.tar.gz.

File metadata

Download URL: agent_eval-0.1.11.tar.gz
Upload date: Jul 2, 2025
Size: 23.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for agent_eval-0.1.11.tar.gz
Algorithm	Hash digest
SHA256	`47c3e716a76b4109cd42b2c6352790b641541879d8a969ebe9cdd515e9587667`
MD5	`cc6c0580d03c15afd9acc52d69303a86`
BLAKE2b-256	`db31291a073e9dd90537b395f4c55c77d72371adc8616d98cce50a21f6c40f74`

See more details on using hashes here.

File details

Details for the file agent_eval-0.1.11-py3-none-any.whl.

File metadata

Download URL: agent_eval-0.1.11-py3-none-any.whl
Upload date: Jul 2, 2025
Size: 24.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for agent_eval-0.1.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dbbd69014c8c5b62a9bb7fd6a561f8cb3fcfe9d62fb9fdeb96c18c617af11bd5`
MD5	`ba4b26669b7a1fb71df45b2004057650`
BLAKE2b-256	`05073e38e4f51693102bdc14960242de78a931636ed1fbb2cc091ef17142c64c`

See more details on using hashes here.

agent-eval 0.1.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

agent-eval

Installation

Usage

Run evaluation suite

Weighted Macro Averaging with Tags

Score results

Publish scores to leaderboard

View leaderboard scores

Administer the leaderboard

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes