Skip to main content

CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.

Project description

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

CodeMMLU

📰 News🚀 Quick Start📋 Evaluation📌 Citation

📌 About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

  • CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.

  • Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

📰 News

[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!

🚀 Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

code_mmlu.generate --model_name <your_model_name_or_path> \
  --subset <subset> \
  --backend <backend> \
  --output_dir <your_output_dir>

📋 Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .

[!Note]

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Start evaluating your model via codemmlu:

code_mmlu.generate \
  --model_name <your_model_name_or_path> \
  --peft_model <your_peft_model_name_or_path> \
  --subset all \
  --batch_size 16 \
  --backend [vllm|hf] \
  --max_new_tokens 1024 \
  --temperature 0.0 \
  --output_dir <your_output_dir> \
  --instruction_prefix <special_prefix> \
  --assistant_prefix <special_prefix> \
  --cache_dir <your_cache_dir>

List of CodeMMLU subset:

Subject Subset
Syntactic test programming_syntax
api_frameworks
Semantic test software_principles
dbms_sql
others
Realworld problem code_completion
fill_in_the_middle
code_repair
defect_detection

List of supported backends:

Backend DecoderModel LoRA
Transformers
(hf)
Vllm (vllm)

📌 Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
  author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
  journal={arXiv preprint},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codemmlu-0.0.1.tar.gz (15.8 kB view details)

Uploaded Source

Built Distribution

codemmlu-0.0.1-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file codemmlu-0.0.1.tar.gz.

File metadata

  • Download URL: codemmlu-0.0.1.tar.gz
  • Upload date:
  • Size: 15.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for codemmlu-0.0.1.tar.gz
Algorithm Hash digest
SHA256 836ca159af5e8407e8fa4e5342be4c644f2a968a0c69b924c47f6f73f961186e
MD5 0e304acf5f2ce781277483a133a3bc31
BLAKE2b-256 fd91185ec5a2a922caca4c76b101d5ce50a7857cea629a5173e854d6c296c805

See more details on using hashes here.

File details

Details for the file codemmlu-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: codemmlu-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for codemmlu-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6eb0a33d5bd486d4a0e19ad7bf6d8384b1fd99f701ede0031cf32c27f7f020a9
MD5 bb868a914ece315d28dd7026ea1e8d75
BLAKE2b-256 4ec802f48575ce7d777f4e64d6222d89ce0a5e9617f6094188e62683f33fb19e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page