CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.
Project description
CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities
📰 News • 🚀 Quick Start • 📋 Evaluation • 📌 Citation
📌 About
CodeMMLU
CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.
Why CodeMMLU?
-
CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
-
Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.
📰 News
[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!
🚀 Quick Start
Install CodeMMLU and setup dependencies via pip
:
pip install codemmlu
Generate response for CodeMMLU MCQs benchmark:
code_mmlu.generate --model_name <your_model_name_or_path> \
--subset <subset> \
--backend <backend> \
--output_dir <your_output_dir>
📋 Evaluation
Build codemmlu
from source:
git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .
[!Note]
If you prefer
vllm
backend, we highly recommend you install vllm from official project before installcodemmlu
.
Start evaluating your model via codemmlu
:
code_mmlu.generate \
--model_name <your_model_name_or_path> \
--peft_model <your_peft_model_name_or_path> \
--subset all \
--batch_size 16 \
--backend [vllm|hf] \
--max_new_tokens 1024 \
--temperature 0.0 \
--output_dir <your_output_dir> \
--instruction_prefix <special_prefix> \
--assistant_prefix <special_prefix> \
--cache_dir <your_cache_dir>
List of CodeMMLU subset:
Subject | Subset |
---|---|
Syntactic test | programming_syntax |
api_frameworks | |
Semantic test | software_principles |
dbms_sql | |
others | |
Realworld problem | code_completion |
fill_in_the_middle | |
code_repair | |
defect_detection |
List of supported backends:
Backend | DecoderModel | LoRA |
---|---|---|
Transformers | ||
(hf) | ✅ | ✅ |
Vllm (vllm) | ✅ | ✅ |
📌 Citation
If you find this repository useful, please consider citing our paper:
@article{nguyen2024codemmlu,
title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
journal={arXiv preprint},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file codemmlu-0.0.1.tar.gz
.
File metadata
- Download URL: codemmlu-0.0.1.tar.gz
- Upload date:
- Size: 15.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 836ca159af5e8407e8fa4e5342be4c644f2a968a0c69b924c47f6f73f961186e |
|
MD5 | 0e304acf5f2ce781277483a133a3bc31 |
|
BLAKE2b-256 | fd91185ec5a2a922caca4c76b101d5ce50a7857cea629a5173e854d6c296c805 |
File details
Details for the file codemmlu-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: codemmlu-0.0.1-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6eb0a33d5bd486d4a0e19ad7bf6d8384b1fd99f701ede0031cf32c27f7f020a9 |
|
MD5 | bb868a914ece315d28dd7026ea1e8d75 |
|
BLAKE2b-256 | 4ec802f48575ce7d777f4e64d6222d89ce0a5e9617f6094188e62683f33fb19e |