Skip to main content

"EvalPlus for rigourous evaluation of LLM-synthesized code"

Project description

EvalPlus(📖) => 📚

🔥Quick Start💻LLM code📜Papers🔨Tools👷Development🙏Acknowledgement

Warning

🚨 Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! 🚨

To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:

  • ✨ improves code benchmarks by adding up to thousands of new tests! (81x new tests for HumanEval!)
  • ✨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
  • ✨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 14+ models -- no need to re-run the expensive benchmarks!

🔥 Quick Start

To get started, please first setup the environment:

pip install evalplus --upgrade

...Or you can try out the latest developing version:

pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
🤔 Want to use local GitHub repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

HumanEval+

The usage is just like the original HumanEval where you just need to implement the generate_one_completion function!

from evalplus.data import get_human_eval_plus, write_jsonl

problems = get_human_eval_plus()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
🤔 What is in a `problem`? :: click to expand ::
  • task_id is the identifier string for the task
  • entry_point is name of the function
  • prompt is the function signature with docstring
  • canonical_solution is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
  • base_input is the test inputs in original HumanEval
  • plus_input is the test inputs brought by EvalPlus

To evaluate the samples:

You are strongly recommended to use a sandbox such as docker:

docker run -v $(pwd):/app ganler/evalplus:v0.1.1 --dataset humaneval --samples samples.jsonl

...Or if you want to try it locally regardless of the risks ⚠️:

evalplus.evaluate --dataset humaneval --samples samples.jsonl

🚀 Try out HumanEvalPlus-Mini! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a --mini flag, it can run 23+% faster! (even faster if you evaluate all tests regardless of fail-stop).

docker run -v $(pwd):/app ganler/evalplus:v0.1.1 --dataset humaneval --samples samples.jsonl --mini
# ...Or locally ⚠️
# evalplus.evaluate --dataset humaneval --samples samples.jsonl
🤔 Want to use local GitHub repo? :: click to expand ::
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
  • --parallel: by default half of the cores
  • --base-only (store_ture): only run base HumanEval tests
  • --i-just-wanna-run: force a re-run
🤔 How long it would take? :: click to expand ::

When running 200 samples x 164 tasks x ~775 tests, it can take around 4-8 minute by using --parallel 64 and --test-details. Here are some tips to speed up the evaluation:

  • Use --parallel $(nproc)
  • Do not use --test-details if you just want to quickly get pass@k as --test-details will run all tests (~775 on average for each task), while without --test-details the testing for a sample stops immediately when it fails the first test.
  • Use our pre-evaluated results (see LLM-generated code)
  • We will release an distilled version of HumanEval+ soon. Stay tuned!

The output should be like (below is GPT-4 greedy decoding example):

Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|██████████████████████████████████████████| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.75}
  • Base is the pass@k for the original HumanEval
  • Base + Extra is the pass@k for the our HumanEval+ (with extra tests)
  • The "k" includes [1, 10, 100] where k values <= the sample size will be used
  • A cache file named like samples_eval_results.jsonl will be cached. Remove it to re-run the evaluation

MBPP+ (TBD)

💻 LLM-generated code

Please kindly find the LLM-pre-generated code samples in the attachment of our v0.1.0 release. Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip. You can unzip them to a folder named like ${model_name}_temp_${temperature} and run the evaluation from scratch with:

evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}

📜 Papers

Read our paper for more detailed findings!

@article{evalplus,
  title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
  journal={arXiv preprint arXiv:2305.01210},
  year={2023},
}

🔨 Useful tools

To use these tools, please first install the repository from GitHub:

git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txt

Syntax checker for LLM-generated code

Check LLM-produced code and answer the following questions:

  1. Is the generation entirely done for all samples / all problems in the dataset?
  2. Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)
python tools/checker.py --folder /path/to/[model]-[??]b_temp_[??] --dataset humaneval

Post code sanitizer

LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.

python tools/sanitize.py --eof --folder /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

Render pass@k results to rich and LaTeX tables

python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`

Perform test input generation from scratch (TBD)

👷 Development

Before you start:

pip install pre-commit
pre-commit install
export PYTHONPATH=$PYTHONPATH:$(pwd)

Name convention

  • evalplus is the package name.
  • ${DATASET}_plus is the name of dataset applied with evalplus.

🙏 Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalplus-0.1.5.tar.gz (550.1 kB view details)

Uploaded Source

Built Distribution

evalplus-0.1.5-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file evalplus-0.1.5.tar.gz.

File metadata

  • Download URL: evalplus-0.1.5.tar.gz
  • Upload date:
  • Size: 550.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for evalplus-0.1.5.tar.gz
Algorithm Hash digest
SHA256 d4fc08c50e0094323e0b026068df018d683baf96e2f30a5d0f7bf776c5911529
MD5 108c099f2a7cd8491ca1bfd8119065fc
BLAKE2b-256 88f43722fdace5dbd9ba8a5f2b5a1a285e73cdb86c1f4c0187cf66ab1a01ad9a

See more details on using hashes here.

File details

Details for the file evalplus-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: evalplus-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for evalplus-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 dc9ffaef1ad9c8b6c003f7d67c20825302f9df9ddcaa4844ae78845065b9a06a
MD5 751759b2a9bd643cff7df68d30292ce0
BLAKE2b-256 bcb1df0742eb59c51483327b9defbc2b0b9d7a4cdcb406150ec06d0f1ec47748

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page