CodeMMLU Evaluator: A framework for evaluating language models on CodeMMLU benchmark.
Project description
CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities
📰 News • 🚀 Quick Start • 📋 Evaluation • 📌 Citation
📌 About
CodeMMLU
CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.
Why CodeMMLU?
-
CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
-
Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.
📰 News
[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!
🚀 Quick Start
Install CodeMMLU and setup dependencies via pip:
pip install codemmlu
Generate response for CodeMMLU MCQs benchmark:
codemmlu --model_name <your_model_name_or_path> \
--subset <subset> \
--backend <backend> \
--output_dir <your_output_dir>
📋 Evaluation
Build codemmlu from source:
git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .
[!Note]
If you prefer
vllmbackend, we highly recommend you install vllm from official project before installcodemmlu.
Generating with CodeMMLU questions:
codemmlu --model_name <your_model_name_or_path> \
--peft_model <your_peft_model_name_or_path> \
--subset all \
--batch_size 16 \
--backend [vllm|hf] \
--max_new_tokens 1024 \
--temperature 0.0 \
--output_dir <your_output_dir> \
--instruction_prefix <special_prefix> \
--assistant_prefix <special_prefix> \
--cache_dir <your_cache_dir>
⏬ API Usage :: click to expand ::
codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX]
[--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME]
[--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS]
[--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code]
==================== CodeMMLU ====================
optional arguments:
-h, --help show this help message and exit
-V, --version Get version
--subset SUBSET Select evaluate subset
--batch_size BATCH_SIZE
--instruction_prefix INSTRUCTION_PREFIX
--assistant_prefix ASSISTANT_PREFIX
--output_dir OUTPUT_DIR
Save generation and result path
--model_name MODEL_NAME
Local path or Huggingface Hub link to load model
--peft_model PEFT_MODEL
Lora config
--backend BACKEND LLM generation backend (default: hf)
--max_new_tokens MAX_NEW_TOKENS
Number of max new tokens
--temperature TEMPERATURE
--prompt_mode PROMPT_MODE
Prompt available: zeroshot, fewshot, cot_zs, cot_fs
--cache_dir CACHE_DIR
Cache for save model download checkpoint and dataset
--trust_remote_code
List of supported backends:
| Backend | DecoderModel | LoRA |
|---|---|---|
| Transformers (hf) | ✅ | ✅ |
| Vllm (vllm) | ✅ | ✅ |
Leaderboard
To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.
📌 Citation
If you find this repository useful, please consider citing our paper:
@article{nguyen2024codemmlu,
title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
journal={arXiv preprint},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file codemmlu-0.0.2.1.tar.gz.
File metadata
- Download URL: codemmlu-0.0.2.1.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
452aaf524faa889d0a001821dc3888805577cda256c2c5ddc0854e0be87ea599
|
|
| MD5 |
c8ea4d8f2ea3c8a007c420c6ebf013be
|
|
| BLAKE2b-256 |
56c41544b10582530571ad16292c5b7da8ffe0b17e0f59bea7008af731610425
|
File details
Details for the file codemmlu-0.0.2.1-py3-none-any.whl.
File metadata
- Download URL: codemmlu-0.0.2.1-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02b22b0871d0c63b9a0d6b7519ef6301438eb6cc7985d45d9a2ee309798375d6
|
|
| MD5 |
eb50c43ee482d49005646a42c1b398b2
|
|
| BLAKE2b-256 |
57203a0e5dd30aa3ac6d7ae757a3252b8c68872582fd659960e00e4daf4ced03
|