Small Benchmarks for LM Agents
Project description
SmallBench
Small, simple agent task environments for training and evaluation.
Designed to challenge a broad spectrum of lm-agent abilities.
Spinning Up
uv venv smallbench-dev
source smallbench-dev/bin/activate
uv sync
uv run ruff format .
Easy Benchmarks
BigCodeBench - Agent Harness
This benchmark provides a stateful environment for lm-based agents to solve coding problems from the BigCodeBench dataset. Agents are given a scratchpad (soon), a way to prepare and use unit tests, editing handlers, and a way to submit their solution.
Please see the BigCodeBench page for more information about the underlying dataset.
Get Started
add GROQ_API_KEY and any other API keys supported by the apropos-ai library to the .env file.
- Note: Groq, Google, and possibly other providers offer free tiers.
If you use a Docker backend, ensure you have the Docker app running. If you use Modal, please add all necessary credentials.
Then, run the test script:
uvp -m src.smallbench.benchmarks.bcb_a.test
Medium Benchmarks
TBD
Hard Benchmarks
TBD
Caveats
- This repository is still under very active development.
- In particular, certain details regarding the agent computer interface contexts are very much subject to change, and there's a bit of response model instability. Let me know if you run into issues in the issues tab of the GitHub!
- For this reason, scores will likely be artificially low until further notice. Don't take them too seriously.
Scores - Extremely Preliminary
BigCodeBench - Agent Harness
LM | Score (out of 1) |
---|---|
4o | ??? |
4o-mini | ??? |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for smallbench-0.1.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d96e455900eaff2f4639d3b91a45f4f4537b672c8ee7f6c63aa07d80e49efaf1 |
|
MD5 | 825a39b78002d6916f6765e3412d73b9 |
|
BLAKE2b-256 | 172316606e7ece165c19af5f4912e69a33005f6cdf62ea9d96c31bd93f06240d |