Skip to main content

FeatureBench Pipeline - A test-driven data generation pipeline for building and evaluating feature-level coding benchmarks

Project description

logo

arXiv License DockerHub HuggingFace Leaderboard


FeatureBench is a test-driven data generation and evaluation pipeline for feature-level coding benchmarks. It provides a unified CLI to run inference, evaluation, and dataset generation.

📰 News

🎁 2026.02.06: We now support one-click inference for mainstream agent frameworks, including OpenHands, Claude Code, Codex, Gemini CLI, and mini-swe-agent. All supported agent frameworks can be found here. We have also open-sourced the FeatureBench data pipeline.

🚀 Quickstart

Prerequisites:

  • uv for Python environment management
  • docker for reproducible builds and evaluation
# pypi
pip install featurebench
# or uv add featurebench

# local
git clone https://github.com/LiberCoders/FeatureBench.git
cd FeatureBench
uv sync

Configure:

cp config_example.toml config.toml

See docs/config.md for a comprehensive reference (harness, infer, data pipeline) with examples.

Optional: pre-pull images to reduce network variance:

fb pull --mode lite                 # lite split image list (13 images)
fb pull --mode full                 # full split image list (24 images)
fb pull --mode /path/to/images.txt  # one image name per line

# full list: featurebench/resources/constants/full_images.txt
# lite list: featurebench/resources/constants/lite_images.txt

Run inference:

fb infer \
    --config-path config.toml \
    --agent mini_swe_agent \
    --model openai/qwen3-coder-480b-a35b-instruct \
    --split lite

Run evaluation:

fb eval \
    -p runs/<timestamp>/output.jsonl \
    --split lite

🧭 CLI Overview

fb provides three core commands:

✍️ Citation

If you found FeatureBench useful, please cite us as:

@misc{zhou2026featurebenchbenchmarkingagenticcoding,
      title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development}, 
      author={Qixing Zhou and Jiacheng Zhang and Haiyang Wang and Rui Hao and Jiahe Wang and Minghao Han and Yuxue Yang and Shuzhe Wu and Feiyang Pan and Lue Fan and Dandan Tu and Zhaoxiang Zhang},
      year={2026},
      eprint={2602.10975},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2602.10975}, 
}

📧 Contact

If you have any questions, feel free to contact qixingzhou1125@gmail.com or zjcheng2022@gmail.com.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

featurebench-0.1.0.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

featurebench-0.1.0-py3-none-any.whl (317.9 kB view details)

Uploaded Python 3

File details

Details for the file featurebench-0.1.0.tar.gz.

File metadata

  • Download URL: featurebench-0.1.0.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for featurebench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ff181abc503906ea4ff15aa1636db17ed7aab0f3034027516eeea8e2f3cc9e6c
MD5 fbdc65cb17e0331d190113f2d445f50b
BLAKE2b-256 378eb3fc8c56a56f06c160c775fdadd99c5e3b8c97181051fd5034301e4791d0

See more details on using hashes here.

File details

Details for the file featurebench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: featurebench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 317.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for featurebench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f466bde01d8dc515c779adb04ec5232026f3b7dbd3b49e6059097747e6f8fd02
MD5 b2c45dacfed32b91615aa046da68d4ad
BLAKE2b-256 0931997f002d99955016c6e3d8f9939001c83bed203731b800184ef9ade507b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page