"EvalPlus for rigourous evaluation of LLM-synthesized code"
Project description
EvalPlus(📖) => 📚
🔥Quick Start • 📜Papers • 🔨Useful tools • 👷Development • 🙏Acknowledgement
Warning
🚨 Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! 🚨
To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:
- ✨ improves code benchmarks by adding up to thousands of new tests! (81x new tests for HumanEval!)
- ✨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
- ✨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 14+ models -- no need to re-run the expensive benchmarks!
🔥 Quick Start
To get started, please first setup the environment:
pip install evalplus --upgrade
...Or you can try out the latest developing version:
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
🤔 Want to use local GitHub repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
HumanEval+
The usage is just like the original HumanEval where you just need to implement the generate_one_completion
function!
from evalplus.data import get_human_eval_plus, write_jsonl
problems = get_human_eval_plus()
num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
🤔 What is in a `problem`? :: click to expand ::
task_id
is the identifier string for the taskentry_point
is name of the functionprompt
is the function signature with docstring
canonical_solution
is the ground-truth implementation (re-implemented to fix bugs in HumanEval)base_input
is the test inputs in original HumanEvalplus_input
is the test inputs brought by EvalPlus
To evaluate the samples:
You are strongly recommended to use a sandbox such as docker:
docker run -v $(pwd):/app ganler/evalplus:v0.1.1 --dataset humaneval --samples samples.jsonl
...Or if you want to try it locally regardless of the risks ⚠️:
evalplus.evaluate --dataset humaneval --samples samples.jsonl
🤔 Want to use local GitHub repo? :: click to expand ::
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
--parallel
: by default half of the cores--base-only
(store_ture): only run base HumanEval tests--i-just-wanna-run
: force a re-run
The output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|██████████████████████████████████████████| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.75}
Base
is thepass@k
for the original HumanEvalBase + Extra
is thepass@k
for the our HumanEval+ (with extra tests)- The "k" includes
[1, 10, 100]
where k values<=
the sample size will be used - A cache file named like
samples_eval_results.jsonl
will be cached. Remove it to re-run the evaluation
MBPP+ (TBD)
📜 Papers
Read our paper for more detailed findings!
@article{evalplus,
title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
journal={arXiv preprint arXiv:2305.01210},
year={2023},
}
🔨 Useful tools
To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txt
Syntax checker for LLM-generated code
Check LLM-produced code and answer the following questions:
- Is the generation entirely done for all samples / all problems in the dataset?
- Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)
python tools/checker.py --folder /path/to/[model]-[??]b_temp_[??] --dataset humaneval
Post code sanitizer
LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.
python tools/sanitize.py --eof --folder /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
Render pass@k
results to rich
and LaTeX tables
python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`
Perform test input generation from scratch (TBD)
👷 Development
Before you start:
pip install pre-commit
pre-commit install
export PYTHONPATH=$PYTHONPATH:$(pwd)
Name convention
evalplus
is the package name.${DATASET}_plus
is the name of dataset applied withevalplus
.
🙏 Acknowledgement
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file evalplus-0.1.2.tar.gz
.
File metadata
- Download URL: evalplus-0.1.2.tar.gz
- Upload date:
- Size: 535.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0567430c503683c4a35ba7b3f5e7b3ea4348e6bb57dba23e3e69ba95c07796bf |
|
MD5 | a8adbdef28376f735d43f75ff5303faa |
|
BLAKE2b-256 | 770cec688791114de680768eff80787db44c4e2f828c2ee022e836d9e6833ead |
File details
Details for the file evalplus-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: evalplus-0.1.2-py3-none-any.whl
- Upload date:
- Size: 24.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2fe09f9c5c56538d554e76211894a86742288ab400fa78b4ad4380b14af5050d |
|
MD5 | 2b805fd5dbcbabf48aa0bcd91f1c4ca7 |
|
BLAKE2b-256 | 0d4db2681b2e2b4ea45b73045e6ac44d2066e2d483891108df76c44134d2a154 |