Skip to main content

"ENAMEL: A benchmark for evaluating the capability of LLMs in generating efficient code"

Project description

ENAMEL

Our paper on arXiv Our dataset on HuggingFace Our Python library on PyPI

Getting Started | Library Usage | LLM Leaderboard | Acknowledgements

What is ENAMEL?

ENAMEL is a rigorous and high-standard benchmark for evaluating the capability of large language models (LLMs) in generating efficient code. We provide:

  • A new metric $\text{eff}@k$ characterizing the relationship between code efficiency and sample size $k$;
  • A problem set consisting of 142 high-quality problems selected from OpenAI HumanEval;
  • Expert-written efficient reference solutions, setting a high-standard for efficiency evaluation;
  • Expert-written strong test case generators, enabling a rigorous evaluation of both correctness and efficiency;
  • A Python library enam for easily evaluating the efficiency of LLM-generated code.

If you are interested in our work, please feel free to check our paper for detail.

Illustration of ENAMEL

Getting Started

Dependencies

Before running the code, please ensure the following dependencies:

  • Python >= 3.10
  • Tqdm >= 3.1.4
  • NumPy >= 1.4.0
  • Pandas >= 1.0

Using our generated test cases and LLM-generated code samples

To facilitate reproduction, we share on HuggingFace our generated test cases and LLM-generated code samples used in our evaluation. Please download eval~tests.pkl into the cache/ folder and download the code samples into the samples/ folder.

To reproduce our results, please run demo.py, where --load_name specifies the file name of code samples (without file extension), and --tests specifies the generated test cases. For example, to evaluate the HumanEval+ canonical solutions, please run:

python3 demo.py --load_name humanevalplus-canonical --tests cache/eval~tests.pkl

Evaluating zipped code samples provided by EvalPlus

Our demo also supports the zipped code samples provided by EvalPlus. Please put their .zip files into our samples/ folder without renaming the files. For example, to evaluate the GPT-4 code samples gpt-4_temp_0.0.zip from EvalPlus, please run:

python3 demo.py --load_name gpt-4_temp_0.0 --tests cache/eval~tests.pkl

Warning: It is known to us that our evaluator might be unable to kill a code sample if the code uses try ... except ... within an infinity loop because the killing signal will be caught. We have decided not to resolve this issue because resolving it with multiprocessing will significantly slow down the evaluation process. If you do encounter this issue, please consider removing such code samples. (This issue indeed happens for two code samples provided by EvalPlus, so our demo will automatically handle it if you use the zipped code samples from EvalPlus.)

Evaluating new code samples

If you want to evaluate your own code samples, please organize them as a .json file, put it in the samples/ folder, and run demo.py. For example, if the code samples are in the file samples/codes.json, please run:

python3 demo.py --load_name codes --tests cache/eval~tests.pkl

The .json file should be a dict of lists such that codes[str(i)][j] is the $j$-th code sample of problem $i$.

Library Usage

Our benchmark is also available as a Python library. Please see demo.py for an example usage of our library.

Notice: DO NOT use multiple threads or processes in efficiency evaluation. That might negatively affect the efficiency results.

Installation

Our library enam can be installed via pip:

pip install enam --upgrade

Note: To distinguish from our benchmark ENAMEL, we name our library enam.

LLM Leaderboard

The following table is a leaderboard of 30 LLMs (under greedy decoding) as well as HumanEval/HumanEval+ canonical solutions. Results show that LLMs fall short of generating expert-level efficient code. For more results, please refer to our paper.

We welcome LLM developers to submit their results to enrich this leaderboard. If you would like to submit your results, please organize your generated code samples into a .json file as described above and contact Ruizhong Qiu (rq5 AT illinois DOT edu).

No. Name eff@1 pass@1
1 HumanEval+ 0.517 0.958
2 GPT-4 Turbo (Nov 2023) 0.470 0.796
3 HumanEval 0.458 0.908
4 GPT-4 (Jun 2023) 0.454 0.831
5 Llama 3 70B Instruct 0.421 0.746
6 Mixtral 8x22B Instruct 0.408 0.746
7 Claude 3 Opus 0.401 0.789
8 Phind Code Llama V2 0.394 0.683
9 Claude 3 Haiku 0.386 0.739
10 ChatGPT 0.364 0.683
11 Claude 3 Sonnet 0.345 0.662
12 Llama 3 8B Instruct 0.344 0.592
13 Code Llama 34B Python 0.268 0.458
14 Mixtral 8x7B Instruct 0.266 0.444
15 Code Llama 70B Python 0.264 0.500
16 Code Llama 7B Python 0.247 0.373
17 Code Llama 13B Python 0.216 0.408
18 StarCoder 0.195 0.352
19 CodeGen 6B 0.193 0.296
20 CodeGen 16B 0.169 0.310
21 CodeT5+ 16B 0.160 0.317
22 CodeGen 2B 0.153 0.254
23 Mistral 7B 0.152 0.275
24 Vicuna 13B 0.123 0.176
25 SantaCoder 0.100 0.141
26 Incoder 6B 0.091 0.127
27 GPT-J 0.083 0.106
28 Incoder 1B 0.066 0.092
29 Vicuna 7B 0.061 0.099
30 GPT-Neo 2B 0.043 0.056
31 PolyCoder 0.037 0.049
32 StableLM 7B 0.020 0.021

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

enam-0.1.2.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

enam-0.1.2-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file enam-0.1.2.tar.gz.

File metadata

  • Download URL: enam-0.1.2.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for enam-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a0df52af942ff57437dc4bbe549b275ff881459294d3b6d41e2f5e288e118fe0
MD5 88d4033891586d40726719bc96d41baa
BLAKE2b-256 afc3f676a4fceff0f1609e7a04d2b390befb8212ca7339fd2d834e8fb8c63a3b

See more details on using hashes here.

File details

Details for the file enam-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: enam-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for enam-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a8732207a386f35f58a0fa9536cfee8a33f9e7591c94f3a8cfbb4230705a62fe
MD5 2ba85ac377415d5b0ef3b48cbcd269d8
BLAKE2b-256 92b07e51190c92a60f8ef0cbde52c5cd213507a5445b621ea0653c16865c64cd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page