Small Benchmarks for LM Agents

Project description

SmallBench

Small, simple agent task environments for training and evaluation.

Designed to challenge a broad spectrum of lm-agent abilities.

Spinning Up - Use

uv add smallbench

pip install smallbench

Spinning Up - Dev

uv venv smallbench-dev
source smallbench-dev/bin/activate
uv sync
uv run ruff format .

Easy Benchmarks

BigCodeBench - Agent Harness

This benchmark provides a stateful environment for lm-based agents to solve coding problems from the BigCodeBench dataset. Agents are given a scratchpad (soon), a way to prepare and use unit tests, editing handlers, and a way to submit their solution.

Please see the BigCodeBench page for more information about the underlying dataset.

Get Started

Local

add GROQ_API_KEY and any other API keys supported by the apropos-ai library to the .env file.

Note: Groq, Google, and possibly other providers offer free tiers.

If you use a Docker backend, ensure you have the Docker app running. If you use Modal, please add all necessary credentials.

Then, run the test script:

uv run python -m src.smallbench.benchmarks.bcb_a.test

Colab

Check out the Colab if you prefer to run the benchmark in the cloud.

Medium Benchmarks

TBD

Hard Benchmarks

TBD

Difficult Benchmarks

TBD

Scores

BigCodeBench - Agent Harness (ReAct)

LM	Number Correct - Train	Number Correct - Test	Sample Size	Avg. Cost Per Run
gpt-4o-mini-2024-07-18-ft*	???	21	100	$0.006
gpt-4o-2024-08-06	17	18	100	$0.057
gpt-4o-mini-2024-07-18-ft**	???	13	100	$0.006
deepseek-v2.5	12	???	100	$0.0029
gpt-4o-mini-2024-07-18	12	8	100	$0.003
gemini-1.5-flash-latest	6	7	100	$0.0018

"*" fine-tuned on a minimal subset (<500k tokens) of trajectories using a variation of the Filtered Behavioral Cloning approach (2024-09-17).
"**" fine-tuned on a minimal subset (<500k tokens) of trajectories using a variation of the Filtered Behavioral Cloning approach (2024-09-08).

Animation credits: ZZ

Project details

Release history Release notifications | RSS feed

0.2.27

Oct 21, 2024

0.2.26

Oct 21, 2024

0.2.24

Oct 21, 2024

0.2.23

Oct 21, 2024

0.2.22

Oct 21, 2024

This version

0.2.21

Oct 20, 2024

0.2.20

Oct 20, 2024

0.2.19

Oct 20, 2024

0.2.18

Oct 20, 2024

0.2.17

Oct 20, 2024

0.2.16

Oct 20, 2024

0.2.15

Oct 20, 2024

0.2.13

Oct 19, 2024

0.2.12

Oct 19, 2024

0.2.11

Oct 19, 2024

0.2.10

Oct 18, 2024

0.2.9

Oct 18, 2024

0.2.8

Oct 17, 2024

0.2.7

Oct 17, 2024

0.2.6

Oct 17, 2024

0.2.5

Oct 17, 2024

0.2.4

Oct 17, 2024

0.2.3

Oct 4, 2024

0.2.2

Oct 4, 2024

0.2.1

Oct 4, 2024

0.2.0

Oct 4, 2024

0.1.24

Sep 18, 2024

0.1.23

Sep 18, 2024

0.1.22

Sep 18, 2024

0.1.21

Sep 18, 2024

0.1.20

Sep 18, 2024

0.1.19

Sep 13, 2024

0.1.18

Sep 13, 2024

0.1.17

Sep 11, 2024

0.1.16

Sep 10, 2024

0.1.15

Sep 10, 2024

0.1.14

Sep 10, 2024

0.1.13

Sep 10, 2024

0.1.12

Sep 10, 2024

0.1.11

Sep 10, 2024

0.1.10

Sep 6, 2024

0.1.9

Sep 5, 2024

0.1.8

Sep 5, 2024

0.1.7

Sep 4, 2024

0.1.6

Sep 4, 2024

0.1.5

Sep 4, 2024

0.1.4

Sep 4, 2024

0.1.3

Sep 4, 2024

0.1.2

Sep 4, 2024

0.1.1

Sep 4, 2024

0.1.0

Sep 4, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smallbench-0.2.21.tar.gz (3.5 MB view hashes)

Uploaded Oct 20, 2024 Source

Built Distribution

smallbench-0.2.21-py3-none-any.whl (23.1 kB view hashes)

Uploaded Oct 20, 2024 Python 3

Hashes for smallbench-0.2.21.tar.gz

Hashes for smallbench-0.2.21.tar.gz
Algorithm	Hash digest
SHA256	`f1feda291b2e75bfd72d7aff3fc77c7e4e5d3e0bc6da3e27b09ce33e62b54542`
MD5	`72e49368caf0bfe5d90d31b755e409d3`
BLAKE2b-256	`193727883a3bc7dce338f56bb6ce380c148d4ab30fb72796becd099f9f4cade0`

Hashes for smallbench-0.2.21-py3-none-any.whl

Hashes for smallbench-0.2.21-py3-none-any.whl
Algorithm	Hash digest
SHA256	`75c0043c5b025f4016d0697ad92e9acf27bad037e6b6cc4ac46e0bc214a99285`
MD5	`9fa96f5e47bb06f12fcc8d87e7be5dfb`
BLAKE2b-256	`dc2d1c5472e5f8de42dc0cd91a6624a884ba3d8bb46c5d411ca0a861367196b3`