Skip to main content

"ENAMEL: A benchmark for evaluating the capability of LLMs in generating efficient code"

Project description

ENAMEL

Our paper on arXiv Our dataset on HuggingFace Our Python library on PyPI

Getting Started | Library Usage | LLM Leaderboard | Acknowledgements

What is ENAMEL?

ENAMEL is a rigorous and high-standard benchmark for evaluating the capability of large language models (LLMs) in generating efficient code. We provide:

  • A new metric $\text{eff}@k$ characterizing the relationship between code efficiency and sample size $k$;
  • A problem set consisting of 142 high-quality problems selected from OpenAI HumanEval;
  • Expert-written efficient reference solutions, setting a high-standard for efficiency evaluation;
  • Expert-written strong test case generators, enabling a rigorous evaluation of both correctness and efficiency;
  • A Python library enam for easily evaluating the efficiency of LLM-generated code.

If you are interested in our work, please feel free to check our paper for detail.

Illustration of ENAMEL

Getting Started

Dependencies

Before running the code, please ensure the following dependencies:

  • Python >= 3.10
  • Tqdm >= 3.1.4
  • NumPy >= 1.4.0
  • Pandas >= 1.0

Using our generated test cases and LLM-generated code samples

To facilitate reproduction, we share on HuggingFace our generated test cases and LLM-generated code samples used in our evaluation. Please download eval~tests.pkl into the cache/ folder and download the code samples into the samples/ folder.

To reproduce our results, please run demo.py, where --load_name specifies the file name of code samples (without file extension), and --tests specifies the generated test cases. For example, to evaluate the HumanEval+ canonical solutions, please run:

python3 demo.py --load_name humanevalplus-canonical --tests cache/eval~tests.pkl

Evaluating zipped code samples provided by EvalPlus

Our demo also supports the zipped code samples provided by EvalPlus. Please put their .zip files into our samples/ folder without renaming the files. For example, to evaluate the GPT-4 code samples gpt-4_temp_0.0.zip from EvalPlus, please run:

python3 demo.py --load_name gpt-4_temp_0.0 --tests cache/eval~tests.pkl

Warning: It is known to us that our evaluator might be unable to kill a code sample if the code uses try ... except ... within an infinity loop because the killing signal will be caught. We have decided not to resolve this issue because resolving it with multiprocessing will significantly slow down the evaluation process. If you do encounter this issue, please consider removing such code samples. (This issue indeed happens for two code samples provided by EvalPlus, so our demo will automatically handle it if you use the zipped code samples from EvalPlus.)

Evaluating new code samples

If you want to evaluate your own code samples, please organize them as a .json file, put it in the samples/ folder, and run demo.py. For example, if the code samples are in the file samples/codes.json, please run:

python3 demo.py --load_name codes --tests cache/eval~tests.pkl

The .json file should be a dict of lists such that codes[str(i)][j] is the $j$-th code sample of problem $i$.

Library Usage

Our benchmark is also available as a Python library. Please see demo.py for an example usage of our library. The detailed usage will be posted soon.

Installation

Our library enam can be installed via pip:

pip install enam --upgrade

Note: To distinguish from our benchmark ENAMEL, we name our library enam.

LLM Leaderboard

The following table is a leaderboard of 30 LLMs (under greedy decoding) as well as HumanEval/HumanEval+ canonical solutions. Results show that LLMs fall short of generating expert-level efficient code. For more results, please refer to our paper.

We welcome LLM developers to submit their results to enrich this leaderboard. If you would like to submit your results, please organize your generated code samples into a .json file as described above and contact Ruizhong Qiu (rq5 AT illinois DOT edu).

No. Name eff@1 pass@1
1 HumanEval+ 0.517 0.958
2 GPT-4 Turbo (Nov 2023) 0.470 0.796
3 HumanEval 0.458 0.908
4 GPT-4 (Jun 2023) 0.454 0.831
5 Llama 3 70B Instruct 0.421 0.746
6 Mixtral 8x22B Instruct 0.408 0.746
7 Claude 3 Opus 0.401 0.789
8 Phind Code Llama V2 0.394 0.683
9 Claude 3 Haiku 0.386 0.739
10 ChatGPT 0.364 0.683
11 Claude 3 Sonnet 0.345 0.662
12 Llama 3 8B Instruct 0.344 0.592
13 Code Llama 34B Python 0.268 0.458
14 Mixtral 8x7B Instruct 0.266 0.444
15 Code Llama 70B Python 0.264 0.500
16 Code Llama 7B Python 0.247 0.373
17 Code Llama 13B Python 0.216 0.408
18 StarCoder 0.195 0.352
19 CodeGen 6B 0.193 0.296
20 CodeGen 16B 0.169 0.310
21 CodeT5+ 16B 0.160 0.317
22 CodeGen 2B 0.153 0.254
23 Mistral 7B 0.152 0.275
24 Vicuna 13B 0.123 0.176
25 SantaCoder 0.100 0.141
26 Incoder 6B 0.091 0.127
27 GPT-J 0.083 0.106
28 Incoder 1B 0.066 0.092
29 Vicuna 7B 0.061 0.099
30 GPT-Neo 2B 0.043 0.056
31 PolyCoder 0.037 0.049
32 StableLM 7B 0.020 0.021

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enam-0.1.0.tar.gz (16.5 kB view details)

Uploaded Source

Built Distribution

enam-0.1.0-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file enam-0.1.0.tar.gz.

File metadata

  • Download URL: enam-0.1.0.tar.gz
  • Upload date:
  • Size: 16.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for enam-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9ed479c8c15bdd03490a948354eb250ebd5b8fcf89b330072861f7c75fbd6654
MD5 4f795c53b889a38c8c6ba4cccfaab19a
BLAKE2b-256 90d6a9f6772a7380cdcbe3db305fd0a5b2a266b2720672e424208d5268e90114

See more details on using hashes here.

File details

Details for the file enam-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: enam-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for enam-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a220130ba15c4529aba961e4601154228bf142913e34addb50109f80f9e76b14
MD5 32137e0241e100440471a88d62742d22
BLAKE2b-256 fc98bf886c20e042e9f28b547b01c69724e32f40e8e0ef07f06994549e34c1bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page