"EvalPlus for rigourous evaluation of LLM-synthesized code"
Reason this release was yanked:
Bugs #135
Project description
EvalPlus(📖) => 📚
🔥Quick Start • 💻LLM code • 🔨Tools • 📜Citation • 🙏Acknowledgement
About
EvalPlus is a rigorous evaluation framework for LLM4Code, with:
- ✨ HumanEval+: 80x more tests than the original HumanEval!
- ✨ MBPP+: 35x more tests than the original MBPP!
- ✨ Evaluation framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Why EvalPlus?
- ✨ Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
- ✨ Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
- ✨ Pre-generated samples: EvalPlus accelerates LLM4Code research by open-sourcing LLM-generated samples for vairous models -- no need to re-run the expensive benchmarks!
Want to know more details? Read our NeurIPS'23 paper as well as our Google Slides!
🔥 Quick Start
To get started, please first setup the environment:
pip install evalplus --upgrade
⏬ Install nightly version :: click to expand ::
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade
⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
Code generation
Implement the GEN_SOLUTION
function by calling the LLM to produce the complete solution (include the code) and save the samples to samples.jsonl
:
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl
samples = [
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
for task_id, problem in get_[human_eval|mbpp]_plus().items()
]
write_jsonl("samples.jsonl", samples)
🤔 Structure of `problem`? :: click to expand ::
task_id
is the identifier string for the taskentry_point
is name of the functionprompt
is the function signature with docstring
canonical_solution
is the ground-truth implementation (re-implemented to fix bugs in HumanEval)base_input
is the test inputs in original HumanEvalplus_input
is the test inputs brought by EvalPlus
[!Note]
Expected Schema of
samples.jsonl
task_id
: Task ID, which are the keys ofget_[human_eval|mbpp]_plus()
solution
(optional): Self-contained solution (usually including the prompt)
- Example:
{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}
completion
(optional): Function body without prompt
- Example:
{"task_id": "HumanEval/?", "completion": " return 1"}
Only one of
solution
andcompletion
is required. If both are provided,solution
will be used. We also accept solutions in the form of directory, i.e.,--samples ${SAMPLE_DIR}
where${SAMPLE_DIR}
is organized as:${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py
(${TASK_ID} = task_id.replace("/", "_")
).
Code post-processing
LLM-generated text may not be compilable code for including natural language lines or incomplete extra code.
We provide a tool namely evalplus.sanitize
to clean up the code:
# 💡 If you are storing codes in jsonl:
evalplus.sanitize --samples samples.jsonl --dataset [humaneval|mbpp]
# Sanitized code will be produced to `samples-sanitized.jsonl`
# 💡 If you are storing codes in directories:
evalplus.sanitize --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
Note that the post-processing may not be perfect, you are suggested to use evalplus.syncheck
to check the code validity before and after sanitization, which will print erroneous code snippets:
# 💡 If you are storing codes in jsonl:
evalplus.syncheck --samples samples.jsonl --dataset [humaneval|mbpp]
# 💡 If you are storing codes in directories:
evalplus.syncheck --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp]
Code evaluation
You are strongly recommended to use a sandbox such as docker:
docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl
...Or if you want to try it locally regardless of the risks ⚠️:
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl
[!Tip]
Do you use a very slow machine?
LLM solutions are regarded as failed on timeout (and OOM etc.). Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where:
- $T_{base}$ is the minimal timeout (configurable by
--min-time-limit
; default to 1s);- $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling);
- $k$ is a configurable factor
--gt-time-limit-factor
(default to 4);If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$.
Additionally, you are NOT encouraged to make your test-bed over stressed while running evaluation. For example, using
--parallel 64
on a 4-core machine or doing something else during evaluation are bad ideas...
🤔 Evaluate with local GitHub repo? :: click to expand ::
export PYTHONPATH=$PYTHONPATH:$(pwd)
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl
⌨️ More command-line flags :: click to expand ::
--parallel
: by default half of the cores--base-only
(store_ture): only run base HumanEval tests--i-just-wanna-run
: force a re-run
The output should be like (below is GPT-4 greedy decoding example):
Computing expected output...
Expected outputs computed in 15.18s
Reading samples...
164it [00:04, 37.79it/s]
Evaluating samples...
100%|██████████████████████████████████████████| 164/164 [00:03<00:00, 44.75it/s]
Base
{'pass@1': 0.8841463414634146}
Base + Extra
{'pass@1': 0.768}
Base
is thepass@k
for the original HumanEvalBase + Extra
is thepass@k
for the our HumanEval+ (with extra tests)- The "k" includes
[1, 10, 100]
where k values<=
the sample size will be used - A cache file named like
samples_eval_results.jsonl
will be cached. Remove it to re-run the evaluation
🤔 How long it would take? :: click to expand ::
If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds.
When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minute by using --parallel 64
and --test-details
.
Here are some tips to speed up the evaluation:
- Use
--parallel $(nproc)
- Do NOT use
--test-details
if you just want to quickly get pass@k as--test-details
will run all tests (700+ on average for each task), while without--test-details
the testing for a sample stops immediately when it fails the first test. - Use our pre-evaluated results (see LLM-generated code)
- Use HumanEval+ Mini
[!Tip]
🚀 Try out
HumanEvalPlus-Mini
! which selects a minimal set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a--mini
flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with--test-details
).docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini # ...Or locally ⚠️ # evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini
💻 LLM-generated code
We also share pre-generated code samples from LLMs we have evaluated:
- HumanEval+: See the attachment of our v0.1.0 release.
- MBPP+: See the attachment of our v0.2.0 release.
Each sample file is packaged in a zip file named like ${model_name}_temp_${temperature}.zip
.
You can unzip them to a folder named like ${model_name}_temp_${temperature}
and run the evaluation from scratch with:
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature}
🔨 Useful tools
To use these tools, please first install the repository from GitHub:
git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r tools/requirements.txt
Code generation
We have configured the code generation of a wide range of LLMs (see support details in odegen/models.py). Example to run greedy generation on StarCoderBase-7B:
python codegen/generate.py --model starcoderbase-7b --bs 1 --temperature 0 --n_samples 1 --resume --greedy --root [result_path] --dataset [mbpp|humaneval]
Test input generation using EvalPlus
Please check evalplus/inputgen.py
.
📜 Citation
@inproceedings{evalplus,
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
year = {2023},
url = {https://openreview.net/forum?id=1qvx610Cu7},
}
🙏 Acknowledgement
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file evalplus-0.2.2.tar.gz
.
File metadata
- Download URL: evalplus-0.2.2.tar.gz
- Upload date:
- Size: 574.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8d5ad5523c3230f3addefb006a774e3eed4391ef6fd3157bbd52416573f91f5 |
|
MD5 | d77635bf8b5e2b945c69f01139685ed4 |
|
BLAKE2b-256 | f09bbf68f469a8c5b5593e324e546f073619b550c058443755e03800e7b0c6ec |
File details
Details for the file evalplus-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: evalplus-0.2.2-py3-none-any.whl
- Upload date:
- Size: 37.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88a82d88760773dfd497af973cefc802ffe4465a5c1ebc34174119033729ac44 |
|
MD5 | aed5a3d7257ea2b24268d09507a317bd |
|
BLAKE2b-256 | 7b404fff1e58be610285fecf80ba451bb3cf0b03ddae1b811048e80c24eb1daf |