Skip to main content

Small Benchmarks for LM Agents

Project description

SmallBench

Small, simple agent task environments for training and evaluation.

Designed to challenge a broad spectrum of lm-agent abilities.

Spinning Up - Use

uv add smallbench

or

pip install smallbench

Spinning Up - Dev

uv venv smallbench-dev
source smallbench-dev/bin/activate
uv sync
uv run ruff format .

Easy Benchmarks

BigCodeBench - Agent Harness

This benchmark provides a stateful environment for lm-based agents to solve coding problems from the BigCodeBench dataset. Agents are given a scratchpad (soon), a way to prepare and use unit tests, editing handlers, and a way to submit their solution.

Please see the BigCodeBench page for more information about the underlying dataset.

Get Started

Local

add GROQ_API_KEY and any other API keys supported by the apropos-ai library to the .env file.

  • Note: Groq, Google, and possibly other providers offer free tiers.

If you use a Docker backend, ensure you have the Docker app running. If you use Modal, please add all necessary credentials.

Then, run the test script:

uv run python -m src.smallbench.benchmarks.bcb_a.test
Colab

Check out the Colab if you prefer to run the benchmark in the cloud.

Medium Benchmarks

TBD

Hard Benchmarks

TBD

Difficult Benchmarks

TBD

Scores

BigCodeBench - Agent Harness (ReAct)

LM Number Correct - Train Number Correct - Test Sample Size Avg. Cost Per Run
gpt-4o-mini-2024-07-18-ft* ??? 21 100 $0.006
gpt-4o-2024-08-06 17 18 100 $0.057
gpt-4o-mini-2024-07-18-ft** ??? 13 100 $0.006
deepseek-v2.5 12 ??? 100 $0.0029
gpt-4o-mini-2024-07-18 12 8 100 $0.003
gemini-1.5-flash-latest 6 7 100 $0.0018
  • "*" fine-tuned on a minimal subset (<500k tokens) of trajectories using a variation of the Filtered Behavioral Cloning approach (2024-09-17).
  • "**" fine-tuned on a minimal subset (<500k tokens) of trajectories using a variation of the Filtered Behavioral Cloning approach (2024-09-08).

Animation credits: ZZ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smallbench-0.2.28.tar.gz (3.5 MB view details)

Uploaded Source

Built Distribution

smallbench-0.2.28-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file smallbench-0.2.28.tar.gz.

File metadata

  • Download URL: smallbench-0.2.28.tar.gz
  • Upload date:
  • Size: 3.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for smallbench-0.2.28.tar.gz
Algorithm Hash digest
SHA256 37d62491a2aded238b802a8fee5eba5ee9d372b6cf6c9114174bcaec461c6f57
MD5 da70d18765531a8f69d1ba96c4fd0b38
BLAKE2b-256 4297354bc5debae4e2e712448984ed0ac582d280539b9c8278e37d24d800ab93

See more details on using hashes here.

File details

Details for the file smallbench-0.2.28-py3-none-any.whl.

File metadata

  • Download URL: smallbench-0.2.28-py3-none-any.whl
  • Upload date:
  • Size: 25.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for smallbench-0.2.28-py3-none-any.whl
Algorithm Hash digest
SHA256 ba24d92c2b9da60c55ebf8d7bd46abbbc8a8d664f433bd2e59bd461efc482c2a
MD5 2d89665ea260b187fdecb192e02041c2
BLAKE2b-256 51ddc140f67beac26908e6a160433ccfbda93d2ed2e5da181c3074715da7bbcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page