Small Benchmarks for LM Agents
Project description
SmallBench
Small, simple agent task environments for training and evaluation.
Designed to challenge a broad spectrum of lm-agent abilities.
Spinning Up
uv venv smallbench-dev
source smallbench-dev/bin/activate
uv sync
uv run ruff format .
Easy Benchmarks
BigCodeBench - Agent Harness
This benchmark provides a stateful environment for lm-based agents to solve coding problems from the BigCodeBench dataset. Agents are given a scratchpad (soon), a way to prepare and use unit tests, editing handlers, and a way to submit their solution.
Please see the BigCodeBench page for more information about the underlying dataset.
Get Started
Local
add GROQ_API_KEY and any other API keys supported by the apropos-ai library to the .env file.
- Note: Groq, Google, and possibly other providers offer free tiers.
If you use a Docker backend, ensure you have the Docker app running. If you use Modal, please add all necessary credentials.
Then, run the test script:
uv run python -m src.smallbench.benchmarks.bcb_a.test
Colab
Check out the Colab if you prefer to run the benchmark in the cloud.
Medium Benchmarks
TBD
Hard Benchmarks
TBD
Difficult Benchmarks
TBD
Scores
BigCodeBench - Agent Harness
LM | Number Correct | Success Rate | Sample Size | Avg. Cost Per Run |
---|---|---|---|---|
gpt-4o-2024-08-06 | 8 | 16% | 50 | $0.057 |
deepseek-v2.5 | 5 | 10% | 50 | $0.0029 |
gpt-4o-mini-2024-07-18 | 5 | 10% | 50 | $0.003 |
gemini-1.5-flash-latest | 3 | 06% | 50 | $0.0018 |
Animation credits: ZZ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for smallbench-0.1.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88647f35b994c66c0d6104decba137f835ae6f2e54330d3c9a9f9bb5c5e95d26 |
|
MD5 | b3e5d862f669ebe1dc33fc0a6356ad95 |
|
BLAKE2b-256 | 563c208b4510b31a6fb869bca0babeff2bf279be15ac4bf71fc3c5ded67dbb08 |