Skip to main content

"EvalPlus for rigourous evaluation of LLM-synthesized code"

Project description

EvalPlus(📖) => 📚

🔥Quick Start📜Papers🔨Useful tools👷Development🙏Acknowledgement

Warning

🚨 Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! 🚨

To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:

  • ✨ improves code benchmarks by adding up to thousands of new tests! (81x new tests for HumanEval!)
  • ✨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
  • ✨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 14+ models -- no need to re-run the expensive benchmarks!

🔥 Quick Start

To get started, please first setup the environment:

pip install evalplus --upgrade

...Or you can try out the latest developing version:

pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
🤔 Want to use local GitHub repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

HumanEval+

The usage is just like the original HumanEval where you just need to implement the generate_one_completion function!

from evalplus.data import get_human_eval_plus, write_jsonl

problems = get_human_eval_plus()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
🤔 What is in a `problem`? :: click to expand ::
  • "task_id" is the identifier string for the task
  • "entry_point": name of the function
  • "prompt" is the function signature with docstring
  • "canonical_solution" is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
  • "base_input" is the test inputs in original HumanEval
  • "plus_input" is the test inputs brought by EvalPlus

To evaluate the samples:

evalplus.evaluate --dataset humaneval --samples samples.jsonl
🤔 Want to use local GitHub repo? :: click to expand ::
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
  • --parallel: by default half of the cores
  • --base-only (store_ture): only run base HumanEval tests
  • --i-just-wanna-run: force a re-run

MBPP+ (TBD)

📜 Papers

Read our paper for more detailed findings!

@article{evalplus,
  title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
  journal={arXiv preprint arXiv:2305.01210},
  year={2023},
}

🔨 Useful tools

To use these tools, please first install the repository from GitHub:

git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txt

Syntax checker for LLM-generated code

Check LLM-produced code and answer the following questions:

  1. Is the generation entirely done for all samples / all problems in the dataset?
  2. Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)
python tools/checker.py --folder /path/to/[model]-[??]b_temp_[??] --dataset humaneval

Post code sanitizer

LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.

python tools/sanitize.py --eof --folder /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

Render pass@k results to rich and LaTeX tables

python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`

Perform test input generation from scratch (TBD)

👷 Development

Before you start:

pip install pre-commit
pre-commit install
export PYTHONPATH=$PYTHONPATH:$(pwd)

Name convention

  • evalplus is the package name.
  • ${DATASET}_plus is the name of dataset applied with evalplus.

🙏 Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalplus-0.1.1.tar.gz (534.2 kB view details)

Uploaded Source

Built Distribution

evalplus-0.1.1-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file evalplus-0.1.1.tar.gz.

File metadata

  • Download URL: evalplus-0.1.1.tar.gz
  • Upload date:
  • Size: 534.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for evalplus-0.1.1.tar.gz
Algorithm Hash digest
SHA256 01a5bd9b287bce63aa163af8f86064acd784fe36485c22f2fb11cb9c2430093b
MD5 f118d543e88961b464ecc1a5dc64838c
BLAKE2b-256 b0ccbaf5a7b48c87d895f4f4c4da553ad8f115e8111500362d73eec2e50456d8

See more details on using hashes here.

File details

Details for the file evalplus-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: evalplus-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for evalplus-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af62363d74c07123ade59888173aebccad1195f17d37a026576941bb91913bd9
MD5 2fc57602814f35a16fc21372bf780ea0
BLAKE2b-256 886be498e7b97efac82ee9874b60ad16e1be9e50fd46b2d5b3570b4b1f787696

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page