Skip to main content

CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.

Project description

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

CodeMMLU

📰 News🚀 Quick Start📋 Evaluation📌 Citation

📌 About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

  • CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.

  • Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

📰 News

[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!

🚀 Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

code_mmlu.generate --model_name <your_model_name_or_path> \
  --subset <subset> \
  --backend <backend> \
  --output_dir <your_output_dir>

📋 Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .

[!Note]

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Start evaluating your model via codemmlu:

code_mmlu.generate \
  --model_name <your_model_name_or_path> \
  --peft_model <your_peft_model_name_or_path> \
  --subset all \
  --batch_size 16 \
  --backend [vllm|hf] \
  --max_new_tokens 1024 \
  --temperature 0.0 \
  --output_dir <your_output_dir> \
  --instruction_prefix <special_prefix> \
  --assistant_prefix <special_prefix> \
  --cache_dir <your_cache_dir>

List of CodeMMLU subset:

Subject Subset
Syntactic test programming_syntax
api_frameworks
Semantic test software_principles
dbms_sql
others
Realworld problem code_completion
fill_in_the_middle
code_repair
defect_detection

List of supported backends:

Backend DecoderModel LoRA
Transformers
(hf)
Vllm (vllm)

📌 Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
  author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
  journal={arXiv preprint},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codemmlu-0.0.1.tar.gz (15.8 kB view hashes)

Uploaded Source

Built Distribution

codemmlu-0.0.1-py3-none-any.whl (18.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page