Skip to main content

CodeMMLU Evaluator: A framework for evaluating language models on CodeMMLU benchmark.

Project description

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities

CodeMMLU

📰 News🚀 Quick Start📋 Evaluation📌 Citation

📌 About

CodeMMLU

CodeMMLU is a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in coding and software knowledge. It builds upon the structure of multiple-choice question answering (MCQA) to cover a wide range of programming tasks and domains, including code generation, defect detection, software engineering principles, and much more.

Why CodeMMLU?

  • CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources. It covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.

  • Precise and comprehensive: Checkout our LEADERBOARD for latest LLM rankings.

📰 News

[2024-10-13] We are releasing CodeMMLU benchmark v0.0.1 and preprint report HERE!

🚀 Quick Start

Install CodeMMLU and setup dependencies via pip:

pip install codemmlu

Generate response for CodeMMLU MCQs benchmark:

code_mmlu --model_name <your_model_name_or_path> \
  --subset <subset> \
  --backend <backend> \
  --output_dir <your_output_dir>

📋 Evaluation

Build codemmlu from source:

git clone https://github.com/Fsoft-AI4Code/CodeMMLU.git
cd CodeMMLU
pip install -e .

[!Note]

If you prefer vllm backend, we highly recommend you install vllm from official project before install codemmlu.

Generating with CodeMMLU questions:

codemmlu --model_name <your_model_name_or_path> \
  --peft_model <your_peft_model_name_or_path> \
  --subset all \
  --batch_size 16 \
  --backend [vllm|hf] \
  --max_new_tokens 1024 \
  --temperature 0.0 \
  --output_dir <your_output_dir> \
  --instruction_prefix <special_prefix> \
  --assistant_prefix <special_prefix> \
  --cache_dir <your_cache_dir>
⏬ API Usage :: click to expand ::
codemmlu [-h] [-V] [--subset SUBSET] [--batch_size BATCH_SIZE] [--instruction_prefix INSTRUCTION_PREFIX]
                [--assistant_prefix ASSISTANT_PREFIX] [--output_dir OUTPUT_DIR] [--model_name MODEL_NAME]
                [--peft_model PEFT_MODEL] [--backend BACKEND] [--max_new_tokens MAX_NEW_TOKENS]
                [--temperature TEMPERATURE] [--prompt_mode PROMPT_MODE] [--cache_dir CACHE_DIR] [--trust_remote_code]

==================== CodeMMLU ====================

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         Get version
  --subset SUBSET       Select evaluate subset
  --batch_size BATCH_SIZE
  --instruction_prefix INSTRUCTION_PREFIX
  --assistant_prefix ASSISTANT_PREFIX
  --output_dir OUTPUT_DIR
                        Save generation and result path
  --model_name MODEL_NAME
                        Local path or Huggingface Hub link to load model
  --peft_model PEFT_MODEL
                        Lora config
  --backend BACKEND     LLM generation backend (default: hf)
  --max_new_tokens MAX_NEW_TOKENS
                        Number of max new tokens
  --temperature TEMPERATURE
  --prompt_mode PROMPT_MODE
                        Prompt available: zeroshot, fewshot, cot_zs, cot_fs
  --cache_dir CACHE_DIR
                        Cache for save model download checkpoint and dataset
  --trust_remote_code

List of supported backends:

Backend DecoderModel LoRA
Transformers (hf)
Vllm (vllm)

Leaderboard

To evaluate your model and submit your results to the leaderboard, please follow the instruction in data/README.md.

📌 Citation

If you find this repository useful, please consider citing our paper:

@article{nguyen2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities},
  author={Nguyen, Dung Manh and Phan, Thang Chau and Le, Nam Hai and Doan, Thong T. and Nguyen, Nam V. and Pham, Quang and Bui, Nghi D. Q.},
  journal={arXiv preprint},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codemmlu-0.0.2.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codemmlu-0.0.2-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file codemmlu-0.0.2.tar.gz.

File metadata

  • Download URL: codemmlu-0.0.2.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for codemmlu-0.0.2.tar.gz
Algorithm Hash digest
SHA256 598ce557888e4174a48d20d95a49408e859f367f3afe354c0ac25e30e041c5d8
MD5 30ad643d5e0ce04a217baa21c71506bb
BLAKE2b-256 f63eada75c6939971c48d324ab85fd575dc9684774d31c0926f888c4dd61aba3

See more details on using hashes here.

File details

Details for the file codemmlu-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: codemmlu-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for codemmlu-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 611e396b8cdfcff31c9f57a19c81015e6078ff107dd24b2e0285170c93790e3e
MD5 38dadadacab19898ac010ad29a209008
BLAKE2b-256 908b2270dfd9a2d632b992f403286d3100f0681f3d87bef679a939649f705115

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page