Skip to main content

FeatureBench Pipeline - A test-driven data generation pipeline for building and evaluating feature-level coding benchmarks

Project description

logo

arXiv License DockerHub HuggingFace Leaderboard


FeatureBench is a test-driven data generation and evaluation pipeline for feature-level coding benchmarks. It provides a unified CLI to run inference, evaluation, and dataset generation.

📰 News

📊 2026.05.18: We added lite split evaluation results for frontier models including GPT-5.5, Claude Opus 4.7, DeepSeek-V4, GLM-5.1, Kimi-2.6, Mimo-V2.5-Pro, and more to the leaderboard.

🚀 2026.03.27: We released the fast split containing 100 instances (a subset of full split). These instances require no GPU and are optimized for rapid evaluation. On an Intel Xeon Platinum 8457C with 944GB RAM, the average evaluation time per instance using gold patches is 57.2 seconds.

🎁 2026.02.06: We now support one-click inference for mainstream agent frameworks, including OpenHands, Claude Code, Codex, Gemini CLI, and mini-swe-agent. All supported agent frameworks can be found here. We have also open-sourced the FeatureBench data pipeline.

🏆 Leaderboard

Full interactive leaderboard with tabs, filters, and sorting.

Lite split results, ranked by %PASSED
Rank Model Scaffold %PASSED %RESOLVED
1 Claude Opus 4.7 OpenHands 78.2 46.7
2 GPT-5.5 OpenHands 69.8 26.7
3 Claude Opus 4.6 OpenHands 69.5 20
4 Claude Opus 4.5 OpenHands 67.2 20
5 GPT-5.4 OpenHands 66.2 23.3
6 GPT-5.1-Codex Codex 60.2 20
7 DeepSeek-V4-Pro OpenHands 59.6 26.7
8 Claude Opus 4.5 Claude Code 59.1 20
9 Kimi-2.6 OpenHands 49.4 20
10 Mimo-V2.5-Pro OpenHands 47.8 13.3
11 Gemini-3-Pro-Preview OpenHands 45.1 10
12 GLM-5.1 OpenHands 44.2 13.3
13 Gemini-3-Pro-Preview Gemini-CLI 43.4 10
14 DeepSeek-V4-Flash OpenHands 41.9 16.7
15 MiniMax M2.1 Mini-SWE-Agent 41.9 10
16 GLM 4.7 Mini-SWE-Agent 41.2 6.7
17 Qwen3-Coder-480B-A35B-Instruct OpenHands 38.3 6.7
18 DeepSeek V3.2 OpenHands 35.9 6.7
19 Qwen3.5-27B OpenHands 34.8 10.0
20 Qwen3-Coder-30B-A3B-Instruct OpenHands 23 3.3

🚀 Quickstart

Prerequisites:

  • uv for Python environment management
  • docker for reproducible builds and evaluation
# pypi
pip install featurebench
# or uv add featurebench

# local
git clone https://github.com/LiberCoders/FeatureBench.git
cd FeatureBench
uv sync
source .venv/bin/activate

Configure:

cp config_example.toml config.toml

See docs/config.md for a comprehensive reference (harness, infer, data pipeline) with examples.

Optional: pre-pull images to reduce network variance:

fb pull --mode lite                 # lite split image list (13 images)
fb pull --mode fast                 # fast split image list (18 images)
fb pull --mode full                 # full split image list (24 images)
fb pull --mode /path/to/images.txt  # one image name per line

# full list: featurebench/resources/constants/full_images.txt
# lite list: featurebench/resources/constants/lite_images.txt
# fast list: featurebench/resources/constants/fast_images.txt

Run inference:

fb infer \
    --config-path config.toml \
    --agent mini_swe_agent \
    --model openai/qwen3-coder-480b-a35b-instruct \
    --split fast

Run evaluation:

fb eval \
    -p runs/<timestamp>/output.jsonl \
    --split fast
    # use -p gold to verify the gold patches

🧭 CLI Overview

fb provides three core commands:

✍️ Citation

If you found FeatureBench useful, please cite us as:

@article{zhou2026featurebench,
  title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
  author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
  journal={arXiv preprint arXiv:2602.10975},
  year={2026}
}

📧 Contact

If you have any questions, feel free to contact qixingzhou1125@gmail.com or zjcheng2022@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featurebench-0.2.1.tar.gz (1.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

featurebench-0.2.1-py3-none-any.whl (366.8 kB view details)

Uploaded Python 3

File details

Details for the file featurebench-0.2.1.tar.gz.

File metadata

  • Download URL: featurebench-0.2.1.tar.gz
  • Upload date:
  • Size: 1.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for featurebench-0.2.1.tar.gz
Algorithm Hash digest
SHA256 74c0aaf82b40ecef4f60e286a9019265467c9951b409349dcabd9f6bfb522a2e
MD5 786167667d54d2aa80aeb6179894e32a
BLAKE2b-256 532a8833e3e5e8fc44336acffa0d6c025ff93e840a53d4f6df99b30b3f7da352

See more details on using hashes here.

File details

Details for the file featurebench-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: featurebench-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 366.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for featurebench-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 75d492e5c7a4671baa636e2576fbfe337b86459995b86932ab32716f056da442
MD5 89c40f70f5dfb346df3dab267a62c014
BLAKE2b-256 017b87c3c7eb512bc5334fbd16e31a1f7030f17f58d55711bfd0743ae2853786

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page