Skip to main content

"EvalPlus for rigourous evaluation of LLM-synthesized code"

Project description

EvalPlus(📖) => 📚

📰News🔥Quick Start🚀LLM Backends📚Documents📜Citation🙏Acknowledgement

About

EvalPlus is a rigorous evaluation framework for LLM4Code, with:

  • HumanEval+: 80x more tests than the original HumanEval!
  • MBPP+: 35x more tests than the original MBPP!
  • EvalPerf: evaluating the efficiency of LLM-generated code!
  • Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Why EvalPlus?

  • Precise evaluation & ranking: See our leaderboard for latest LLM rankings before & after rigorous evaluation.
  • Coding rigorousness: Look at the score differences! esp. before and after using EvalPlus tests! Less drop is better as it means more rigorousness and less laxity in code generation; while a big drop means the generated code tends to be fragile.
  • Code efficiency: Beyond correctness, our EvalPerf dataset evaluates the efficiency of LLM-generated code via performance-exercising coding tasks and test inputs.

Want to know more details? Read our papers & materials!

📰 News

Below tracks the notable updates of EvalPlus:

  • [2024-10-20 v0.3.1]: EvalPlus v0.3.1 is officially released! Release highlights includes (i) Code efficiency evaluation via EvalPerf, (ii) one command to run the whole pipline (generation + post-processing + evaluation), (iii) support for more inference backends such as Google Gemini & Anthropic, etc.
  • [2024-06-09 pre v0.3.0]: Improved ground-truth solutions for MBPP+ tasks (IDs: 459, 102, 559). Thanks to EvalArena.
  • [2024-04-17 pre v0.3.0]: MBPP+ is upgraded to v0.2.0 by removing some broken tasks (399 -> 378 tasks). ~4pp pass@1 improvement could be expected.
  • Earlier:
    • (v0.2.1) You can use EvalPlus datasets via bigcode-evaluation-harness! HumanEval+ oracle fixes (32).
    • (v0.2.0) MBPP+ is released! HumanEval contract & input fixes (0/3/9/148/114/1/2/99/28/32/35/160).
    • (v0.1.7) Leaderboard release; HumanEval+ contract and input fixes (32/166/126/6)
    • (v0.1.6) Configurable and by-default-conservative timeout settings; HumanEval+ contract & ground-truth fixes (129/148/75/53/0/3/9/140)
    • (v0.1.5) HumanEval+ mini is released for ultra-fast evaluation when you have too many samples!
    • (v0.1.1) Optimizing user experiences: evaluation speed, PyPI package, Docker, etc.
    • (v0.1.0) HumanEval+ is released!

🔥 Quick Start

  • Code correctness evaluation: HumanEval(+) or MBPP(+)
pip install --upgrade "evalplus[vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[vllm]" --upgrade` for the latest stable release

evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --greedy
Code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset humaneval                    \
                 --backend vllm                         \
                 --greedy

# Code execution within Docker
docker run --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
           evalplus.evaluate --dataset humaneval                       \
           --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_0.0.jsonl
  • Code efficiency evaluation: EvalPerf (*nix only)
pip install --upgrade "evalplus[perf,vllm] @ git+https://github.com/evalplus/evalplus"
# Or `pip install "evalplus[perf,vllm]" --upgrade` for the latest stable release

sudo sh -c 'echo 0 > /proc/sys/kernel/perf_event_paranoid' # Enable perf
evalplus.evalperf --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --backend vllm
Code execution within Docker :: click to expand ::
# Local generation
evalplus.codegen --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                 --dataset evalperf                     \
                 --backend vllm                         \
                 --temperture 1.0                       \
                 --n-samples 100

# Code execution within Docker
docker run --cap-add PERFMON --rm ganler/evalplus:latest -v $(pwd)/evalplus_results:/app \
           evalplus.evalperf --samples /app/humaneval/ise-uiuc--Magicoder-S-DS-6.7B_vllm_temp_1.0.jsonl

🚀 LLM Backends

HuggingFace models

  • transformers backend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend hf                           \
                  --greedy

[!Note]

EvalPlus uses different prompts for base and chat models. By default it is detected by tokenizer.chat_template when using hf/vllm as backend. For other backends, only chat mode is allowed.

Therefore, if your base models come with a tokenizer.chat_template, please add --force-base-prompt to avoid being evaluated in a chat mode.

Enable Flash Attention 2 :: click to expand ::
# Install Flash Attention 2
pip install packaging ninja
pip install flash-attn --no-build-isolation
# Note: if you have installation problem, consider using pre-built
# wheels from https://github.com/Dao-AILab/flash-attention/releases

# Run evaluation with FA2
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B"         \
                  --dataset [humaneval|mbpp]                     \
                  --backend hf                                   \
                  --attn-implementation [flash_attention_2|sdpa] \
                  --greedy
  • vllm backend:
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend vllm                         \
                  --tp [TENSOR_PARALLEL_SIZE]            \
                  --greedy
  • openai compatible servers (e.g., vLLM):
# Launch a model server first: e.g., https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html
evalplus.evaluate --model "ise-uiuc/Magicoder-S-DS-6.7B" \
                  --dataset [humaneval|mbpp]             \
                  --backend openai                       \
                  --base-url http://localhost:8000/v1    \
                  --greedy

OpenAI models

export OPENAI_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gpt-4o"            \
                  --dataset [humaneval|mbpp]  \
                  --backend openai            \
                  --greedy

Anthropic models

export ANTHROPIC_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "claude-3-haiku-20240307" \
                  --dataset [humaneval|mbpp]        \
                  --backend anthropic               \
                  --greedy

Google Gemini models

export GOOGLE_API_KEY="[YOUR_API_KEY]"
evalplus.evaluate --model "gemini-1.5-pro"    \
                  --dataset [humaneval|mbpp]  \
                  --backend google            \
                  --greedy

You can checkout the generation and results at evalplus_results/[humaneval|mbpp]/

⏬ Using EvalPlus as a local repo? :: click to expand ::
git clone https://github.com/evalplus/evalplus.git
cd evalplus
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

📚 Documents

To learn more about how to use EvalPlus, please refer to:

📜 Citation

@inproceedings{evalplus,
  title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
  author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
  booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
  year = {2023},
  url = {https://openreview.net/forum?id=1qvx610Cu7},
}

@inproceedings{evalperf,
  title = {Evaluating Language Models for Efficient Code Generation},
  author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming},
  booktitle = {First Conference on Language Modeling},
  year = {2024},
  url = {https://openreview.net/forum?id=IBCBMeAhmC},
}

🙏 Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evalplus-0.3.1.tar.gz (606.4 kB view details)

Uploaded Source

Built Distribution

evalplus-0.3.1-py3-none-any.whl (68.6 kB view details)

Uploaded Python 3

File details

Details for the file evalplus-0.3.1.tar.gz.

File metadata

  • Download URL: evalplus-0.3.1.tar.gz
  • Upload date:
  • Size: 606.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.13

File hashes

Hashes for evalplus-0.3.1.tar.gz
Algorithm Hash digest
SHA256 732352dba404d96dd97a1593b3209fd50a09f543bfe37058a0bacf3fd23025da
MD5 6a6f2bf84f2458b6119f4b8f38b6e381
BLAKE2b-256 91abb1f6d8420542dc615a7500abf7c0e3bab4935f04148a8be3c244d520d8c0

See more details on using hashes here.

File details

Details for the file evalplus-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: evalplus-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 68.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.13

File hashes

Hashes for evalplus-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd601debb67419113d10ac5c3317689d847f27de5d8cf3837975f3cab571b75d
MD5 4e9b95c1ea1fb2f503e2cc5c80ff1e76
BLAKE2b-256 1366de5ac13e00c681cff679c981b1bae98ab3b940fa3db2f0d8d70a0213424d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page