Skip to main content

"EvoEval: Evolving Coding Benchmarks via LLM"

Project description

EvoEval: Evolving Coding Benchmarks via LLM

⚡Quick Start | 🔠Benchmarks | 🤖LLM Generated Code | 📝Citation | 🙏Acknowledgement

About

EvoEval1 is a holistic benchmark suite created by evolving HumanEval problems:

  • 🔥 Containing 828 new problems across 5 🌠 semantic-altering and 2 ⭐ semantic-preserving benchmarks
  • 🔮 Allows evaluation/comparison across different dimensions and problem types (i.e., Difficult, Creative or Tool Use problems). See our visualization tool for ready-to-use comparison
  • 🏆 Complete with leaderboard, groundtruth solutions, robust testcases and evaluation scripts to easily fit into your evaluation pipeline
  • 🤖 Generated LLM code samples from >50 different models to save you time in running experiments

1 coincidentally similar pronunciation with 😈 EvilEval

Checkout our 📃 paper and webpage for more detail!

⚡ Quick Start

Directly install the package:

pip install evoeval --upgrade
⏬ Nightly Version
pip install "git+https://github.com/evo-eval/evoeval.git" --upgrade
⏬ Local Repository
git clone https://github.com/evo-eval/evoeval.git
cd evoeval
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

Now you are ready to download EvoEval benchmarks and perform evaluation!

🧑‍💻 Code generation

To download our benchmarks, simply use the following code snippet:

from evoeval.data import get_evo_eval

evoeval_benchmark = "EvoEval_difficult" # you can pick from 7 different benchmarks!

problems = get_evo_eval(evoeval_benchmark)

For code generation and evaluation, we adopt the same style as HumanEval+ and HumanEval.

Implement the GEN_SOLUTION function by calling the LLM to produce the complete solution (include the function header + code) and save the samples to {benchmark}_samples.jsonl:

from evoeval.data import get_evo_eval, write_jsonl

evoeval_benchmark = "EvoEval_difficult"

samples = [
    dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"]))
    for task_id, problem in get_evo_eval(evoeval_benchmark).items()
]
write_jsonl(f"{evoeval_benchmark}_samples.jsonl", samples)

[!TIP]

EvoEval samples.jsonl expects the solution field to contain the complete code implementation, this is slightly different from the original HumanEval where the solution field only contains the function body.

If you want to follow exactly like HumanEval setup, checkout our 🤗 Huggingface datasets, which can be directly ran with HumanEval evaluation script

🕵️ Evaluation

You can use our provided docker image:

docker run -v $(pwd):/app evoeval/evoeval:latest --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl

Or run it locally:

evoeval.evaluate --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl

Or if you are using it as a local repository:

export PYTHONPATH=$PYTHONPATH:$(pwd)
python evoeval/evaluate.py --dataset EvoEval_difficult --samples EvoEval_difficult_samples.jsonl

You should expect to see the following output (when evaluated on GPT-4):

Computing expected output...
Expected outputs computed in 11.24s
Reading samples...
100it [00:00, 164.16it/s]
100%|████████████████████████████████████████████████████████████████| 100/100 [00:07<00:00, 12.77it/s]
EvoEval_difficult
pass@1: 0.520 # for reference GPT-4 solves more than 80% of problems in HumanEval

This shows the pass@1 score for the EvoEval_difficult benchmark. You can use --i-just-wanna-run to recompute the evaluation result

🔠 Benchmarks

EvoEval contains 7 different benchmarks, each with a unique set of problems evolved from the original HumanEval problems. 🌠 denotes semantic-altering benchmarks, while ⭐ denotes semantic-preserving benchmarks.:

🌠EvoEval_difficult:

Introduce complexity by adding additional constraints and requirements, replace commonly used requirements to less common ones, or add additional reasoning steps to the original problem.

🌠EvoEval_creative:

Generate a more creative problem compared to the original through the use of stories or uncommon narratives.

🌠EvoEval_subtle:

Make a subtle and minor change to the original problem such as inverting or replacing a requirement.

🌠EvoEval_combine:

Combine two different problems by integrating the concepts from both problems. In order to select problems that make sense to combine, we apply a simple heuristic to combine only problems of the same type together categorized based on the type of input arguments in the original problem.

🌠EvoEval_tool_use:

Produce a new problem containing a main problem and one or more helpers functions which can be used to solve it. Each helper function is fully implemented and provides hints or useful functionality for solving the main problem. The main problem does not explicitly reference individual helper functions, and we do not require the model to use the provided helpers.

⭐EvoEval_verbose:

Reword the original docstring to be more verbose. These verbose docstrings can use more descriptive language to illustrate the problem, include detailed explanation of the example output, and provide additional hints.

⭐EvoEval_concise:

Reword the original docstring to be more concise by removing unnecessary details and using concise language. Furthermore, simple examples that are not required to demonstrate edge cases may be removed.

For each problem in each EvoEval benchmark, we include the complete groundtruth as well as test cases for functional evaluation.

[!Note]

Problem Structure

{
"task_id": "identifier string for the task",
"entry_point": "name of the function",
"prompt": "function signature with docstring",
"canonical_solution": "groundtruth implementation",
"inputs": "test inputs for each problem",
"parent": "original HumanEval problem it evolved from",
"main": "special field of EvoEval_tool_use to show just the main problem description",
"helpers": "special field of EvoEval_tool_use to show the helper functions"
}

🤖 LLM Generated Code

To view the performance of >50 LLMs on the EvoEval benchmarks, we provide a complete leaderboard as well as a visualization tool to compare the performance of different models.

Further, we also provide all code samples from LLMs on the EvoEval benchmarks:

Each LLM generation is packaged in a zip file named like {model_name}_temp_0.0.zip. You can unzip the folder and obtain the LLM generation for each of our 7 benchmarks + the original HumanEval problems. Note that we only evaluate the greedy output for each LLM.

📝 Citation

@article{evoeval,
  author    = {Xia, Chunqiu Steven and Deng, Yinlin and Zhang, Lingming},
  title     = {Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM},
  year      = {2024},
  journal   = {arXiv preprint},
}

[!Note]

The first two authors contributed equally to this work, with author order determined via Nigiri

🙏 Acknowledgement

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evoeval-0.1.0.tar.gz (18.8 MB view details)

Uploaded Source

Built Distribution

evoeval-0.1.0-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file evoeval-0.1.0.tar.gz.

File metadata

  • Download URL: evoeval-0.1.0.tar.gz
  • Upload date:
  • Size: 18.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for evoeval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b15e39e2fe0c8eb6b7506288f41b0019e639f879b2baa0ee09dfd5c337367cba
MD5 3fd04488189688cbcff5ccf3bdf326b4
BLAKE2b-256 f8cd1cdd05e3b5883c94f3f214b0c2f947c41e495e224e8290072d043d2efe5c

See more details on using hashes here.

File details

Details for the file evoeval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evoeval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for evoeval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77749da407f2507711394a8652d4b719af1ea7bed5a249aa877e314594fbc6c4
MD5 b7566b9e5b684d5b6858a2c49263bf82
BLAKE2b-256 6183f5274a509c46e1d839b94654939f5b8072eb005ea5b6b71d8f6321a2f9a7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page