Skip to main content

"EvalPlus for rigourous evaluation of LLM-synthesized code"

Project description

EvalPlus(📖) => 📚

🔥Quick Start • 💻LLM code • 🔨Tools • 📜Citation • 🙏Acknowledgement


📢 Who is the best LLM coder? Take a look at the EvalPlus leaderboard 🏆! 📢
🤗 Request for independent model evaluation is open!



🚨 Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! 🚨

To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:

  • ✨ improves code benchmarks by adding up to thousands of new tests! (80x for HumanEval and 35x for MBPP!)
  • ✨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
  • ✨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 20+ models -- no need to re-run the expensive benchmarks!

Want to know more details? Please read our NeurIPS'23 paper !

🔥 Quick Start

To get started, please first setup the environment:

pip install evalplus --upgrade
⏬ Install nightly version :: click to expand ::
pip install "git+" --upgrade
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone
cd evalplus
pip install -r requirements.txt

Code generation

Implement the GEN_SOLUTION function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl:

from import get_[human_eval|mbpp]_plus, write_jsonl

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_[human_eval|mbpp]_plus().items()
write_jsonl("samples.jsonl", samples)
🤔 Structure of `problem`? :: click to expand ::
  • task_id is the identifier string for the task
  • entry_point is name of the function
  • prompt is the function signature with docstring
  • canonical_solution is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
  • base_input is the test inputs in original HumanEval
  • plus_input is the test inputs brought by EvalPlus


Expected Schema of samples.jsonl

  1. task_id: Task ID, which are the keys of get_[human_eval|mbpp]_plus()
  2. solution (optional): Self-contained solution (usually including the prompt)
    • Example: {"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
  3. completion (optional): Function body without prompt
    • Example: {"task_id": "HumanEval/?", "completion": " return 1"}

Only one of solution and completion is required. If both are provided, solution will be used. We also accept solutions in the form of directory, i.e., --samples ${SAMPLE_DIR} where ${SAMPLE_DIR} is organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py (${TASK_ID} = task_id.replace("/", "_")).

Code evaluation

You are strongly recommended to use a sandbox such as docker:

docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl

...Or if you want to try it locally regardless of the risks ⚠️:

evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl


Do you use a very slow machine?

LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where:

  • $T_{base}$ is the minimal timeout (configurable by --min-time-limit; default to 0.2s);
  • $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
  • $k$ is a configurable factor --gt-time-limit-factor (default to 4);

If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.

Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using --parallel 64 on a 4-core machine or doing something else during evaluation are bad ideas...

🤔 Evaluate with local GitHub repo? :: click to expand ::
python evalplus/ --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
  • --parallel: by default half of the cores
  • --base-only (store_ture): only run base HumanEval tests
  • --i-just-wanna-run: force a re-run

The output should be like (below is GPT-4 greedy decoding example):

Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|██████████████████████████████████████████| 164/164 [00:03<00:00, 44.75it/s]
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
  • Base is the pass@k for the original HumanEval
  • Base + Extra is the pass@k for the our HumanEval+ (with extra tests)
  • The "k" includes [1, 10, 100] where k values <= the sample size will be used
  • A cache file named like samples_eval_results.jsonl will be cached. Remove it to re-run the evaluation
🤔 How long it would take? :: click to expand ::

If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds. When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minute by using --parallel 64 and --test-details. Here are some tips to speed up the evaluation:

  • Use --parallel $(nproc)
  • Do NOT use --test-details if you just want to quickly get pass@k as --test-details will run all tests (700+ on average for each task), while without --test-details the testing for a sample stops immediately when it fails the first test.
  • Use our pre-evaluated results (see LLM-generated code)
  • Use HumanEval+ Mini


🚀 Try out HumanEvalPlus-Mini! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with --test-details).

docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini
# ...Or locally ⚠️
# evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini

💻 LLM-generated code

We also share pre-generated code samples from LLMs we have evaluated:

  • HumanEval+: See the attachment of our v0.1.0 release.
  • MBPP+: See the attachment of our v0.2.0 release (TBD).

Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip. You can unzip them to a folder named like ${model_name}_temp_${temperature} and run the evaluation from scratch with:

evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}

🔨 Useful tools

To use these tools, please first install the repository from GitHub:

git clone
cd evalplus
pip install -r requirements-tools.txt

Syntax checker for LLM-generated code

Check LLM-produced code and answer the following questions:

  1. Is the generation entirely done for all samples / all problems in the dataset?
  2. Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)
python tools/ --folder /path/to/[model]-[??]b_temp_[??] --dataset [humaneval|mbpp]

Post code sanitizer

LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.

python tools/ --eof --folder /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

Render pass@k results to rich and LaTeX tables

python tools/ --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`

Perform test input generation from scratch (TBD)

Name convention

  • evalplus is the package name.
  • ${DATASET}_plus is the name of dataset applied with evalplus.

📜 Citation

  title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
  journal={arXiv preprint arXiv:2305.01210},

🙏 Acknowledgement

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalplus-0.2.0.tar.gz (562.3 kB view hashes)

Uploaded source

Built Distribution

evalplus-0.2.0-py3-none-any.whl (39.9 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page