Skip to main content

"Evaluation package for BigCodeBench"

Project description

BigCodeBench

BigCodeBench

📰 News🔥 Quick Start🚀 Remote Evaluation💻 LLM-generated Code📜 Citation

📰 News

  • [2024-10-06] We are releasing bigcodebench==v0.2.0!
  • [2024-10-05] We create a public code execution API on the Hugging Face space.
  • [2024-10-01] We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the leaderboard!
  • [2024-08-19] To make the evaluation fully reproducible, we add a real-time code execution session to the leaderboard. It can be viewed here.
  • [2024-08-02] We release bigcodebench==v0.1.9.
More News :: click to expand ::
  • [2024-07-18] We announce a subset of BigCodeBench, BigCodeBench-Hard, which includes 148 tasks that are more aligned with the real-world programming tasks. The details are available in this blog post. The dataset is available here. The new release is bigcodebench==v0.1.8.
  • [2024-06-28] We release bigcodebench==v0.1.7.
  • [2024-06-27] We release bigcodebench==v0.1.6.
  • [2024-06-19] We start the Hugging Face BigCodeBench Leaderboard! The leaderboard is available here.
  • [2024-06-18] We release BigCodeBench, a new benchmark for code generation with 1140 software-engineering-oriented programming tasks. Preprint is available here. PyPI package is available here with the version 0.1.5.

🌸 About

BigCodeBench

BigCodeBench is an easy-to-use benchmark for solving practical and challenging tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.

Why BigCodeBench?

BigCodeBench focuses on task automation via code generation with diverse function calls and complex instructions, with:

  • Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
  • Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!

🔥 Quick Start

To get started, please first set up the environment:

# By default, you will use the remote evaluation API to execute the output samples.
pip install bigcodebench[generate] --upgrade

# You are suggested to use `flash-attn` for generating code samples.
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases
⏬ Install nightly version :: click to expand ::
# Install to use bigcodebench.generate
pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcodebench[generate]" --upgrade

🚀 Remote Evaluation

We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.

[!Note]

Remotely executing on BigCodeBench-Full typically takes 6-7 minutes, and on BigCodeBench-Hard typically takes 4-5 minutes.

bigcodebench.evaluate \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --split [complete|instruct] \
  --subset [full|hard] \
  --backend [vllm|openai|anthropic|google|mistral|hf]
  • All the resulted files will be stored in a folder named bcb_results.
  • The generated code samples will be stored in a file named [model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl.
  • The evaluation results will be stored in a file named [model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json.
  • The pass@k results will be stored in a file named [model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json.

[!Note]

BigCodeBench uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --direct_completion to avoid being evaluated in a chat mode.

Access OpenAI APIs from OpenAI Console

export OPENAI_API_KEY=<your_openai_api_key>

Access Anthropic APIs from Anthropic Console

export ANTHROPIC_API_KEY=<your_anthropic_api_key>

Access Mistral APIs from Mistral Console

export MISTRAL_API_KEY=<your_mistral_api_key>

Access Gemini APIs from Google AI Studio

export GOOGLE_API_KEY=<your_google_api_key>

💻 LLM-generated Code

We share pre-generated code samples from LLMs we have evaluated:

  • See the attachment of our v0.2.0. We include sanitized_samples_calibrated.zip for your convenience.

Advanced Usage

Please refer to the ADVANCED USAGE for more details.

📜 Citation

@article{zhuo2024bigcodebench,
  title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
  author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
  journal={arXiv preprint arXiv:2406.15877},
  year={2024}
}

🙏 Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigcodebench-0.2.0.tar.gz (69.6 kB view details)

Uploaded Source

Built Distribution

bigcodebench-0.2.0-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file bigcodebench-0.2.0.tar.gz.

File metadata

  • Download URL: bigcodebench-0.2.0.tar.gz
  • Upload date:
  • Size: 69.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.0

File hashes

Hashes for bigcodebench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a475543afde1b33765439d982fc20943ed8f922f9b6dfe29dbff7e60a9585051
MD5 b79d6f7f3774d2564a419a059a847130
BLAKE2b-256 276f2e4bbbb04bd67aefae82b073c6fe827736ea9d9298404b5a8d3ead129317

See more details on using hashes here.

File details

Details for the file bigcodebench-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bigcodebench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 319a9a16e44ef51ddf0cfd7052b8dbfdeb0925f59624f87ad7e00e4385a0d005
MD5 514b901a33a45b8ec4bff8540818103c
BLAKE2b-256 ca9ef7ceb979dea9f1d412349ecf832d1d71e934c1b66b2e2759cc0dfa830de1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page