CodeMMLU Evaluator: A framework for evaluating LM models on CodeMMLU MCQs benchmark.

These details have not been verified by PyPI

Project links

Project description

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

📰 News • 🚀 Quick Start • 📋 Evaluation • 📌 Citation

📌 About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

📰 News

[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!

🚀 Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

code_mmlu.generate --model_name <your_model_name_or_path> \
  --subset <subset> \
  --backend <backend> \
  --output_dir <your_output_dir>

📋 Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .

[!Note]

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Start evaluating your model via codemmlu:

code_mmlu.generate \
  --model_name <your_model_name_or_path> \
  --peft_model <your_peft_model_name_or_path> \
  --subset all \
  --batch_size 16 \
  --backend [vllm|hf] \
  --max_new_tokens 1024 \
  --temperature 0.0 \
  --output_dir <your_output_dir> \
  --instruction_prefix <special_prefix> \
  --assistant_prefix <special_prefix> \
  --cache_dir <your_cache_dir>

List of CodeMMLU subset:

Subject	Subset
Syntactic test	programming_syntax
	api_frameworks
Semantic test	software_principles
	dbms_sql
	others
Realworld problem	code_completion
	fill_in_the_middle
	code_repair
	defect_detection

List of supported backends:

Backend	DecoderModel	LoRA
Transformers
(hf)	✅	✅
Vllm (vllm)	✅	✅

📌 Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
  author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
  journal={arXiv preprint},
  year={2024}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1

Oct 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codemmlu-0.0.1.tar.gz (15.8 kB view hashes)

Uploaded Oct 14, 2024 Source

Built Distribution

codemmlu-0.0.1-py3-none-any.whl (18.0 kB view hashes)

Uploaded Oct 14, 2024 Python 3

Hashes for codemmlu-0.0.1.tar.gz

Hashes for codemmlu-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`836ca159af5e8407e8fa4e5342be4c644f2a968a0c69b924c47f6f73f961186e`
MD5	`0e304acf5f2ce781277483a133a3bc31`
BLAKE2b-256	`fd91185ec5a2a922caca4c76b101d5ce50a7857cea629a5173e854d6c296c805`

Hashes for codemmlu-0.0.1-py3-none-any.whl

Hashes for codemmlu-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6eb0a33d5bd486d4a0e19ad7bf6d8384b1fd99f701ede0031cf32c27f7f020a9`
MD5	`bb868a914ece315d28dd7026ea1e8d75`
BLAKE2b-256	`4ec802f48575ce7d777f4e64d6222d89ce0a5e9617f6094188e62683f33fb19e`