CodeMMLU Evaluator: A framework for evaluating language models on CodeMMLU benchmark.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

📰 News • 🚀 Quick Start • 📋 Evaluation • 📌 Citation

📌 About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

📰 News

[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!

🚀 Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

code_mmlu --model_name <your_model_name_or_path> \
  --subset <subset> \
  --backend <backend> \
  --output_dir <your_output_dir>

📋 Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .

[!Note]

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Generating with CodeMMLU questions:

codemmlu --model_name <your_model_name_or_path> \
  --peft_model <your_peft_model_name_or_path> \
  --subset all \
  --batch_size 16 \
  --backend [vllm|hf] \
  --max_new_tokens 1024 \
  --temperature 0.0 \
  --output_dir <your_output_dir> \
  --instruction_prefix <special_prefix> \
  --assistant_prefix <special_prefix> \
  --cache_dir <your_cache_dir>

⏬ API Usage :: click to expand ::

codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX]
                [--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME]
                [--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS]
                [--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code]

==================== CodeMMLU ====================

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         Get version
  --subset SUBSET       Select evaluate subset
  --batch_size BATCH_SIZE
  --instruction_prefix INSTRUCTION_PREFIX
  --assistant_prefix ASSISTANT_PREFIX
  --output_dir OUTPUT_DIR
                        Save generation and result path
  --model_name MODEL_NAME
                        Local path or Huggingface Hub link to load model
  --peft_model PEFT_MODEL
                        Lora config
  --backend BACKEND     LLM generation backend (default: hf)
  --max_new_tokens MAX_NEW_TOKENS
                        Number of max new tokens
  --temperature TEMPERATURE
  --prompt_mode PROMPT_MODE
                        Prompt available: zeroshot, fewshot, cot_zs, cot_fs
  --cache_dir CACHE_DIR
                        Cache for save model download checkpoint and dataset
  --trust_remote_code

List of supported backends:

Backend	DecoderModel	LoRA
Transformers (hf)	✅	✅
Vllm (vllm)	✅	✅

Leaderboard

To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.

📌 Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
  author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
  journal={arXiv preprint},
  year={2024}
}

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.0.2.1

Dec 4, 2024

This version

0.0.2

Dec 4, 2024

0.0.1

Oct 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codemmlu-0.0.2.tar.gz (17.1 kB view details)

Uploaded Dec 4, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codemmlu-0.0.2-py3-none-any.whl (19.2 kB view details)

Uploaded Dec 4, 2024 Python 3

File details

Details for the file codemmlu-0.0.2.tar.gz.

File metadata

Download URL: codemmlu-0.0.2.tar.gz
Upload date: Dec 4, 2024
Size: 17.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for codemmlu-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`598ce557888e4174a48d20d95a49408e859f367f3afe354c0ac25e30e041c5d8`
MD5	`30ad643d5e0ce04a217baa21c71506bb`
BLAKE2b-256	`f63eada75c6939971c48d324ab85fd575dc9684774d31c0926f888c4dd61aba3`

See more details on using hashes here.

File details

Details for the file codemmlu-0.0.2-py3-none-any.whl.

File metadata

Download URL: codemmlu-0.0.2-py3-none-any.whl
Upload date: Dec 4, 2024
Size: 19.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for codemmlu-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`611e396b8cdfcff31c9f57a19c81015e6078ff107dd24b2e0285170c93790e3e`
MD5	`38dadadacab19898ac010ad29a209008`
BLAKE2b-256	`908b2270dfd9a2d632b992f403286d3100f0681f3d87bef679a939649f705115`

See more details on using hashes here.

codemmlu 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

📌 About

CodeMMLU

Why CodeMMLU?

📰 News

🚀 Quick Start

📋 Evaluation

Leaderboard

📌 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes