Small Benchmarks for LM Agents
Project description
SmallBench
Small, simple agent task environments for training and evaluation.
Designed to challenge a broad spectrum of lm-agent abilities.
Spinning Up - Use
uv add smallbench
or
pip install smallbench
Spinning Up - Dev
uv venv smallbench-dev
source smallbench-dev/bin/activate
uv sync
uv run ruff format .
Easy Benchmarks
BigCodeBench - Agent Harness
This benchmark provides a stateful environment for lm-based agents to solve coding problems from the BigCodeBench dataset. Agents are given a scratchpad (soon), a way to prepare and use unit tests, editing handlers, and a way to submit their solution.
Please see the BigCodeBench page for more information about the underlying dataset.
Get Started
Local
add GROQ_API_KEY and any other API keys supported by the apropos-ai library to the .env file.
- Note: Groq, Google, and possibly other providers offer free tiers.
If you use a Docker backend, ensure you have the Docker app running. If you use Modal, please add all necessary credentials.
Then, run the test script:
uv run python -m src.smallbench.benchmarks.bcb_a.test
Colab
Check out the Colab if you prefer to run the benchmark in the cloud.
Medium Benchmarks
TBD
Hard Benchmarks
TBD
Difficult Benchmarks
TBD
Scores
BigCodeBench - Agent Harness (ReAct)
LM | Number Correct - Train | Number Correct - Test | Sample Size | Avg. Cost Per Run |
---|---|---|---|---|
gpt-4o-mini-2024-07-18-ft* | ??? | 21 | 100 | $0.006 |
gpt-4o-2024-08-06 | 17 | 18 | 100 | $0.057 |
gpt-4o-mini-2024-07-18-ft** | ??? | 13 | 100 | $0.006 |
deepseek-v2.5 | 12 | ??? | 100 | $0.0029 |
gpt-4o-mini-2024-07-18 | 12 | 8 | 100 | $0.003 |
gemini-1.5-flash-latest | 6 | 7 | 100 | $0.0018 |
- "*" fine-tuned on a minimal subset (<500k tokens) of trajectories using a variation of the Filtered Behavioral Cloning approach (2024-09-17).
- "**" fine-tuned on a minimal subset (<500k tokens) of trajectories using a variation of the Filtered Behavioral Cloning approach (2024-09-08).
Animation credits: ZZ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file smallbench-0.2.28.tar.gz
.
File metadata
- Download URL: smallbench-0.2.28.tar.gz
- Upload date:
- Size: 3.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37d62491a2aded238b802a8fee5eba5ee9d372b6cf6c9114174bcaec461c6f57 |
|
MD5 | da70d18765531a8f69d1ba96c4fd0b38 |
|
BLAKE2b-256 | 4297354bc5debae4e2e712448984ed0ac582d280539b9c8278e37d24d800ab93 |
File details
Details for the file smallbench-0.2.28-py3-none-any.whl
.
File metadata
- Download URL: smallbench-0.2.28-py3-none-any.whl
- Upload date:
- Size: 25.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba24d92c2b9da60c55ebf8d7bd46abbbc8a8d664f433bd2e59bd461efc482c2a |
|
MD5 | 2d89665ea260b187fdecb192e02041c2 |
|
BLAKE2b-256 | 51ddc140f67beac26908e6a160433ccfbda93d2ed2e5da181c3074715da7bbcf |