featurebench

FeatureBench Pipeline - A test-driven data generation pipeline for building and evaluating feature-level coding benchmarks

Project description

logo

FeatureBench is a test-driven data generation and evaluation pipeline for feature-level coding benchmarks. It provides a unified CLI to run inference, evaluation, and dataset generation.

📰 News

📊 2026.05.18: We added lite split evaluation results for frontier models including GPT-5.5, Claude Opus 4.7, DeepSeek-V4, GLM-5.1, Kimi-2.6, Mimo-V2.5-Pro, and more to the leaderboard.

🚀 2026.03.27: We released the fast split containing 100 instances (a subset of full split). These instances require no GPU and are optimized for rapid evaluation. On an Intel Xeon Platinum 8457C with 944GB RAM, the average evaluation time per instance using gold patches is 57.2 seconds.

🎁 2026.02.06: We now support one-click inference for mainstream agent frameworks, including OpenHands, Claude Code, Codex, Gemini CLI, and mini-swe-agent. All supported agent frameworks can be found here. We have also open-sourced the FeatureBench data pipeline.

🏆 Leaderboard

Full interactive leaderboard with tabs, filters, and sorting.

Lite split results, ranked by %PASSED

Rank	Model	Scaffold	%PASSED	%RESOLVED
1	Claude Opus 4.7	OpenHands	78.2	46.7
2	GPT-5.5	OpenHands	69.8	26.7
3	Claude Opus 4.6	OpenHands	69.5	20
4	Claude Opus 4.5	OpenHands	67.2	20
5	GPT-5.4	OpenHands	66.2	23.3
6	GPT-5.1-Codex	Codex	60.2	20
7	DeepSeek-V4-Pro	OpenHands	59.6	26.7
8	Claude Opus 4.5	Claude Code	59.1	20
9	Kimi-2.6	OpenHands	49.4	20
10	Mimo-V2.5-Pro	OpenHands	47.8	13.3
11	Gemini-3-Pro-Preview	OpenHands	45.1	10
12	GLM-5.1	OpenHands	44.2	13.3
13	Gemini-3-Pro-Preview	Gemini-CLI	43.4	10
14	DeepSeek-V4-Flash	OpenHands	41.9	16.7
15	MiniMax M2.1	Mini-SWE-Agent	41.9	10
16	GLM 4.7	Mini-SWE-Agent	41.2	6.7
17	Qwen3-Coder-480B-A35B-Instruct	OpenHands	38.3	6.7
18	DeepSeek V3.2	OpenHands	35.9	6.7
19	Qwen3.5-27B	OpenHands	34.8	10.0
20	Qwen3-Coder-30B-A3B-Instruct	OpenHands	23	3.3

🚀 Quickstart

Prerequisites:

uv for Python environment management
docker for reproducible builds and evaluation

# pypi
pip install featurebench
# or uv add featurebench

# local
git clone https://github.com/LiberCoders/FeatureBench.git
cd FeatureBench
uv sync
source .venv/bin/activate

Configure:

cp config_example.toml config.toml

See docs/config.md for a comprehensive reference (harness, infer, data pipeline) with examples.

Optional: pre-pull images to reduce network variance:

fb pull --mode lite                 # lite split image list (13 images)
fb pull --mode fast                 # fast split image list (18 images)
fb pull --mode full                 # full split image list (24 images)
fb pull --mode /path/to/images.txt  # one image name per line

# full list: featurebench/resources/constants/full_images.txt
# lite list: featurebench/resources/constants/lite_images.txt
# fast list: featurebench/resources/constants/fast_images.txt

Run inference:

fb infer \
    --config-path config.toml \
    --agent mini_swe_agent \
    --model openai/qwen3-coder-480b-a35b-instruct \
    --split fast

Run evaluation:

fb eval \
    -p runs/<timestamp>/output.jsonl \
    --split fast
    # use -p gold to verify the gold patches

🧭 CLI Overview

fb provides three core commands:

fb infer runs featurebench.infer.run_infer (docs: docs/infer_cli_arg.md)
fb eval runs featurebench.harness.run_evaluation (docs: docs/harness_cli_arg.md)
fb data runs featurebench.pipeline (docs: docs/pipeline.md)

✍️ Citation

If you found FeatureBench useful, please cite us as:

@article{zhou2026featurebench,
  title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
  author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
  journal={arXiv preprint arXiv:2602.10975},
  year={2026}
}

📧 Contact

If you have any questions, feel free to contact qixingzhou1125@gmail.com or zjcheng2022@gmail.com.

Project details

Release history Release notifications | RSS feed

This version

0.2.1

Jun 6, 2026

0.2.0

Jun 1, 2026

0.1.0

Feb 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featurebench-0.2.1.tar.gz (1.3 MB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

featurebench-0.2.1-py3-none-any.whl (366.8 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file featurebench-0.2.1.tar.gz.

File metadata

Download URL: featurebench-0.2.1.tar.gz
Upload date: Jun 6, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for featurebench-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`74c0aaf82b40ecef4f60e286a9019265467c9951b409349dcabd9f6bfb522a2e`
MD5	`786167667d54d2aa80aeb6179894e32a`
BLAKE2b-256	`532a8833e3e5e8fc44336acffa0d6c025ff93e840a53d4f6df99b30b3f7da352`

See more details on using hashes here.

File details

Details for the file featurebench-0.2.1-py3-none-any.whl.

File metadata

Download URL: featurebench-0.2.1-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 366.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for featurebench-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`75d492e5c7a4671baa636e2576fbfe337b86459995b86932ab32716f056da442`
MD5	`89c40f70f5dfb346df3dab267a62c014`
BLAKE2b-256	`017b87c3c7eb512bc5334fbd16e31a1f7030f17f58d55711bfd0743ae2853786`

See more details on using hashes here.

featurebench 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

📰 News

🏆 Leaderboard

🚀 Quickstart

🧭 CLI Overview

✍️ Citation

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes