"Evaluation package for BigCodeBench"

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

BigCodeBench

💥 Impact • 📰 News • 🔥 Quick Start • 🚀 Remote Evaluation • 💻 LLM-generated Code • 🧑 Advanced Usage • 📰 Result Submission • 📜 Citation

🎉 Check out our latest work!
🌟 SWE Arena 🌟
🚀 Open Evaluation Platform on AI for Software Engineering 🚀
✨ 100% free to use the latest frontier models! ✨

💥 Impact

BigCodeBench has been trusted by many LLM teams including:

Zhipu AI
Alibaba Qwen
DeepSeek
Amazon AWS AI
Snowflake AI Research
ServiceNow Research
Meta AI
Cohere AI
Sakana AI
Allen Institute for Artificial Intelligence (AI2)

📰 News

[2025-01-22] We are releasing bigcodebench==v0.2.2.dev2, with 163 models evaluated!
[2024-10-06] We are releasing bigcodebench==v0.2.0!
[2024-10-05] We create a public code execution API on the Hugging Face space.
[2024-10-01] We have evaluated 139 models on BigCodeBench-Hard so far. Take a look at the leaderboard!
[2024-08-19] To make the evaluation fully reproducible, we add a real-time code execution session to the leaderboard. It can be viewed here.
[2024-08-02] We release bigcodebench==v0.1.9.

More News :: click to expand ::

[2024-07-18] We announce a subset of BigCodeBench, BigCodeBench-Hard, which includes 148 tasks that are more aligned with the real-world programming tasks. The details are available in this blog post. The dataset is available here. The new release is bigcodebench==v0.1.8.
[2024-06-28] We release bigcodebench==v0.1.7.
[2024-06-27] We release bigcodebench==v0.1.6.
[2024-06-19] We start the Hugging Face BigCodeBench Leaderboard! The leaderboard is available here.
[2024-06-18] We release BigCodeBench, a new benchmark for code generation with 1140 software-engineering-oriented programming tasks. Preprint is available here. PyPI package is available here with the version 0.1.5.

🌸 About

BigCodeBench

BigCodeBench is an easy-to-use benchmark for solving practical and challenging tasks via code. It aims to evaluate the true programming capabilities of large language models (LLMs) in a more realistic setting. The benchmark is designed for HumanEval-like function-level code generation tasks, but with much more complex instructions and diverse function calls.

There are two splits in BigCodeBench:

Complete: Thes split is designed for code completion based on the comprehensive docstrings.
Instruct: The split works for the instruction-tuned and chat models only, where the models are asked to generate a code snippet based on the natural language instructions. The instructions only contain necessary information, and require more complex reasoning.

Why BigCodeBench?

BigCodeBench focuses on task automation via code generation with diverse function calls and complex instructions, with:

✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
✨ Pre-generated samples: BigCodeBench accelerates code intelligence research by open-sourcing LLM-generated samples for various models -- no need to re-run the expensive benchmarks!

🔥 Quick Start

To get started, please first set up the environment:

# By default, you will use the remote evaluation API to execute the output samples.
pip install bigcodebench --upgrade

# You are suggested to use `flash-attn` for generating code samples.
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

⏬ Install nightly version :: click to expand ::

# Install to use bigcodebench.generate
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade

🚀 Remote Evaluation

We use the greedy decoding as an example to show how to evaluate the generated code samples via remote API.

[!Warning]

To ease the generation, we use batch inference by default. However, the batch inference results could vary from batch sizes to batch sizes and versions to versions, at least for the vLLM backend. If you want to get more deterministic results for greedy decoding, please set --bs to 1.

[!Note]

gradio backend on BigCodeBench-Full typically takes 6-7 minutes, and on BigCodeBench-Hard typically takes 4-5 minutes. e2b backend with default machine on BigCodeBench-Full typically takes 25-30 minutes, and on BigCodeBench-Hard typically takes 15-20 minutes.

bigcodebench.evaluate \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --execution [e2b|gradio|local] \
  --split [complete|instruct] \
  --subset [full|hard] \
  --backend [vllm|openai|anthropic|google|mistral|hf|hf-inference]

All the resulted files will be stored in a folder named bcb_results.
The generated code samples will be stored in a file named [model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl.
The evaluation results will be stored in a file named [model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json.
The pass@k results will be stored in a file named [model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_pass_at_k.json.

[!Note]

The gradio backend is hosted on the Hugging Face space by default. The default space can be sometimes slow, so we recommend you to use the gradio backend with a cloned bigcodebench-evaluator endpoint for faster evaluation. Otherwise, you can also use the e2b sandbox for evaluation, which is also pretty slow on the default machine.

[!Note]

BigCodeBench uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --direct_completion to avoid being evaluated in a chat mode.

To use E2B, you need to set up an account and get an API key from E2B.

export E2B_API_KEY=<your_e2b_api_key>

Access OpenAI APIs from OpenAI Console

export OPENAI_API_KEY=<your_openai_api_key>

Access Anthropic APIs from Anthropic Console

export ANTHROPIC_API_KEY=<your_anthropic_api_key>

Access Mistral APIs from Mistral Console

export MISTRAL_API_KEY=<your_mistral_api_key>

Access Gemini APIs from Google AI Studio

export GOOGLE_API_KEY=<your_google_api_key>

Access the Hugging Face Serverless Inference API

export HF_INFERENCE_API_KEY=<your_hf_api_key>

Please make sure your HF access token has the Make calls to inference providers permission.

💻 LLM-generated Code

We share pre-generated code samples from LLMs we have evaluated on the full set:

See the attachment of our v0.2.4. We include sanitized_samples_calibrated.zip for your convenience.

🧑 Advanced Usage

Please refer to the ADVANCED USAGE for more details.

📰 Result Submission

Please email both the generated code samples and the execution results to terry.zhuo@monash.edu if you would like to contribute your model to the leaderboard. Note that the file names should be in the format of [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated.jsonl and [model_name]--[revision]--[bigcodebench|bigcodebench-hard]-[instruct|complete]--[backend]-[temp]-[n_samples]-sanitized_calibrated_eval_results.json. You can file an issue to remind us if we do not respond to your email within 3 days.

📜 Citation

@article{zhuo2024bigcodebench,
  title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
  author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
  journal={arXiv preprint arXiv:2406.15877},
  year={2024}
}

🙏 Acknowledgement

EvalPlus

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.5

Mar 31, 2025

0.2.4

Feb 23, 2025

0.2.3.post5

Feb 14, 2025

0.2.3.post4

Feb 13, 2025

0.2.3.post3

Feb 12, 2025

0.2.3.post2

Feb 8, 2025

0.2.3.post1

Feb 1, 2025

0.2.3

Jan 31, 2025

0.2.2

Jan 23, 2025

0.2.2.dev3 pre-release

Jan 22, 2025

0.2.2.dev2 pre-release

Jan 22, 2025

0.2.2.dev1 pre-release

Jan 22, 2025

0.2.2.dev0 pre-release

Jan 22, 2025

0.2.1.post7

Dec 20, 2024

0.2.1.post6

Dec 20, 2024

0.2.1.post5

Dec 20, 2024

0.2.1.post4

Dec 12, 2024

0.2.1.post3

Dec 7, 2024

0.2.1.post2

Nov 12, 2024

0.2.1.post1

Nov 11, 2024

0.2.1

Nov 9, 2024

0.2.0.post3

Oct 6, 2024

0.2.0.post2

Oct 6, 2024

0.2.0.post1

Oct 6, 2024

0.2.0

Oct 5, 2024

0.1.9

Aug 2, 2024

0.1.8.post2

Jul 29, 2024

0.1.8.post1

Jul 29, 2024

0.1.8

Jul 17, 2024

0.1.8rc2 pre-release

Jul 17, 2024

0.1.8rc1 pre-release

Jul 17, 2024

0.1.7.post2

Jul 1, 2024

0.1.7.post1

Jul 1, 2024

0.1.7

Jun 27, 2024

0.1.6

Jun 26, 2024

0.1.5

Jun 18, 2024

0.1.5rc2 pre-release

Jun 18, 2024

0.1.4

Jun 13, 2024

0.1.3

Jun 11, 2024

0.1.2

Jun 8, 2024

0.1.1

Jun 4, 2024

0.1.0

Jun 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigcodebench-0.2.5.tar.gz (77.5 kB view details)

Uploaded Mar 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bigcodebench-0.2.5-py3-none-any.whl (49.2 kB view details)

Uploaded Mar 31, 2025 Python 3

File details

Details for the file bigcodebench-0.2.5.tar.gz.

File metadata

Download URL: bigcodebench-0.2.5.tar.gz
Upload date: Mar 31, 2025
Size: 77.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for bigcodebench-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`1e3897b1f2052dfe25100cc242f905ce526f35b93822a06f8cd4d85a9454410b`
MD5	`2500cf46a3b0477537134fe07e09139b`
BLAKE2b-256	`12c92962fe6beedd9a8c579910f38f4916618901a910ecc58868f892e03db524`

See more details on using hashes here.

File details

Details for the file bigcodebench-0.2.5-py3-none-any.whl.

File metadata

Download URL: bigcodebench-0.2.5-py3-none-any.whl
Upload date: Mar 31, 2025
Size: 49.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for bigcodebench-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7adda739687317102db71b3088d2b6e8350c887a16245a7e7c5272f893c11ae3`
MD5	`afd2ed64f03f223c04686e97e4684ef1`
BLAKE2b-256	`3c4d74bc478759e9c8363abf3b9fcb254ae1e41f10e5a8b9f23675496f20fade`

See more details on using hashes here.

bigcodebench 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BigCodeBench

🎉 Check out our latest work!
🌟 SWE Arena 🌟
🚀 Open Evaluation Platform on AI for Software Engineering 🚀
✨ 100% free to use the latest frontier models! ✨

💥 Impact

📰 News

🌸 About

BigCodeBench

Why BigCodeBench?

🔥 Quick Start

🚀 Remote Evaluation

💻 LLM-generated Code

🧑 Advanced Usage

📰 Result Submission

📜 Citation

🙏 Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

bigcodebench 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BigCodeBench

🎉 Check out our latest work! 🌟 SWE Arena 🌟 🚀 Open Evaluation Platform on AI for Software Engineering 🚀 ✨ 100% free to use the latest frontier models! ✨

💥 Impact

📰 News

🌸 About

BigCodeBench

Why BigCodeBench?

🔥 Quick Start

🚀 Remote Evaluation

💻 LLM-generated Code

🧑 Advanced Usage

📰 Result Submission

📜 Citation

🙏 Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

🎉 Check out our latest work!
🌟 SWE Arena 🌟
🚀 Open Evaluation Platform on AI for Software Engineering 🚀
✨ 100% free to use the latest frontier models! ✨