"EvalPlus for rigourous evaluation of LLM-synthesized code"

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

`EvalPlus(📖) => 📚`

🔥Quick Start • 📜Papers • 🔨Useful tools • 👷Development • 🙏Acknowledgement

Warning

🚨 Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! 🚨

To address this, we started the EvalPlus project -- a rigourous evaluation framework for LLM4Code that:

✨ improves code benchmarks by adding up to thousands of new tests! (81x new tests for HumanEval!)
✨ crafts a set utility tools to sanitize, visualize and inspect LLM-generated code and evaluation results!
✨ accelerates LLM4Code research by open-sourcing LLM-generated samples for 14+ models -- no need to re-run the expensive benchmarks!

🔥 Quick Start

To get started, please first setup the environment:

pip install evalplus --upgrade

...Or you can try out the latest developing version:

pip install "git+https://github.com/evalplus/evalplus.git" --upgrade

🤔 Want to use local GitHub repo? :: click to expand ::

git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

HumanEval+

The usage is just like the original HumanEval where you just need to implement the generate_one_completion function!

from evalplus.data import get_human_eval_plus, write_jsonl

problems = get_human_eval_plus()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)

🤔 What is in a `problem`? :: click to expand ::

"task_id" is the identifier string for the task
"entry_point": name of the function
"prompt" is the function signature with docstring

"canonical_solution" is the ground-truth implementation (re-implemented to fix bugs in HumanEval)
"base_input" is the test inputs in original HumanEval
"plus_input" is the test inputs brought by EvalPlus

To evaluate the samples:

evalplus.evaluate --dataset humaneval --samples samples.jsonl

🤔 Want to use local GitHub repo? :: click to expand ::

python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl

⌨️ More command-line flags :: click to expand ::

--parallel: by default half of the cores
--base-only (store_ture): only run base HumanEval tests
--i-just-wanna-run: force a re-run

MBPP+ (TBD)

📜 Papers

Read our paper for more detailed findings!

@article{evalplus,
  title={Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author={Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang},
  journal={arXiv preprint arXiv:2305.01210},
  year={2023},
}

🔨 Useful tools

To use these tools, please first install the repository from GitHub:

git clone https://github.com/evalplus/evalplus.git
cd evalplus
pip install -r requirements-tools.txt

Syntax checker for LLM-generated code

Check LLM-produced code and answer the following questions:

Is the generation entirely done for all samples / all problems in the dataset?
Are LLM-generated code compilable? (if no, something could be wrong and you'd better check)

python tools/checker.py --folder /path/to/[model]-[??]b_temp_[??] --dataset humaneval

Post code sanitizer

LLM-generated code may contain some syntax errors. But some of them can be easily fixable by doing simple post-processing. This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens.

python tools/sanitize.py --eof --folder /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

Render `pass@k` results to `rich` and LaTeX tables

python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]`

Perform test input generation from scratch (TBD)

👷 Development

Before you start:

pip install pre-commit
pre-commit install
export PYTHONPATH=$PYTHONPATH:$(pwd)

Name convention

evalplus is the package name.
${DATASET}_plus is the name of dataset applied with evalplus.

🙏 Acknowledgement

HumanEval

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.1

Oct 20, 2024

0.3.0 yanked

Oct 20, 2024

Reason this release was yanked:

Missing dependency of psutil

0.2.2 yanked

Apr 3, 2024

Reason this release was yanked:

Bugs #135

0.2.1

Apr 2, 2024

0.2.0

Nov 24, 2023

0.1.7

Sep 6, 2023

0.1.6

Jun 26, 2023

0.1.5

Jun 2, 2023

0.1.4

May 12, 2023

0.1.3

May 9, 2023

0.1.2

May 7, 2023

This version

0.1.1

May 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalplus-0.1.1.tar.gz (534.2 kB view details)

Uploaded May 5, 2023 Source

Built Distribution

evalplus-0.1.1-py3-none-any.whl (24.4 kB view details)

Uploaded May 5, 2023 Python 3

File details

Details for the file evalplus-0.1.1.tar.gz.

File metadata

Download URL: evalplus-0.1.1.tar.gz
Upload date: May 5, 2023
Size: 534.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for evalplus-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`01a5bd9b287bce63aa163af8f86064acd784fe36485c22f2fb11cb9c2430093b`
MD5	`f118d543e88961b464ecc1a5dc64838c`
BLAKE2b-256	`b0ccbaf5a7b48c87d895f4f4c4da553ad8f115e8111500362d73eec2e50456d8`

See more details on using hashes here.

File details

Details for the file evalplus-0.1.1-py3-none-any.whl.

File metadata

Download URL: evalplus-0.1.1-py3-none-any.whl
Upload date: May 5, 2023
Size: 24.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for evalplus-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af62363d74c07123ade59888173aebccad1195f17d37a026576941bb91913bd9`
MD5	`2fc57602814f35a16fc21372bf780ea0`
BLAKE2b-256	`886be498e7b97efac82ee9874b60ad16e1be9e50fd46b2d5b3570b4b1f787696`

See more details on using hashes here.

evalplus 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

`EvalPlus(📖) => 📚`

🔥 Quick Start

HumanEval+

MBPP+ (TBD)

📜 Papers

🔨 Useful tools

Syntax checker for LLM-generated code

Post code sanitizer

Render `pass@k` results to `rich` and LaTeX tables

Perform test input generation from scratch (TBD)

👷 Development

Name convention

🙏 Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

evalplus 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EvalPlus(📖) => 📚

🔥 Quick Start

HumanEval+

MBPP+ (TBD)

📜 Papers

🔨 Useful tools

Syntax checker for LLM-generated code

Post code sanitizer

Render pass@k results to rich and LaTeX tables

Perform test input generation from scratch (TBD)

👷 Development

Name convention

🙏 Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`EvalPlus(📖) => 📚`

Render `pass@k` results to `rich` and LaTeX tables