Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models’ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
We introduce CodeMMLU, a novel benchmark designed to evaluate CodeLLMs' ability to understand and comprehend code through multi-choice question answering (MCQA). This approach enables a deeper assessment of how CodeLLMs grasp coding concepts, moving beyond mere generation capabilities. Inspired by the MMLU dataset from natural language understanding, CodeMMLU offers a robust and easily evaluable methodology with the following key features:
Our key contributions are:
CodeMMLU revealed significant performance differences across models, as shown in the table below. OpenAI's GPT-4o outperformed all models on CodeMMLU, demonstrating its quality across diverse tasks. Notably, despite not being the latest model, the instructed version of Meta-Llama-3-70B achieved the highest score among open-source models from 8 families. While LLMs perform well on knowledge-based tasks, they struggle with real-world problems, particularly in defect detection tasks.
Model name | Size (B) | Syntactic knowledge | Semantic knowledge | Real-world tasks | CodeMMLU | |
---|---|---|---|---|---|---|
Closed-source models | ||||||
Anthropic | Claude-3-sonnet@20240229 | - | 67.22 | 66.08 | 38.26 | 53.97 |
OpenAI | GPT-4o-2024-05-13 | - | 60.41 | 57.82 | 77.18 | 67.0 |
GPT-3.5-turbo-0613 | - | 61.68 | 53.64 | 45.26 | 51.7 | |
Open-source models | ||||||
Meta Llama | CodeLlama-34b-Instruct-hf | 34 | 56.81 | 46.93 | 23.55 | 38.73 |
Meta-Llama-3-70B | 70 | 63.38 | 57.64 | 35.29 | 48.98 | |
Meta-Llama-3-70B-Instruct | 70 | 64.90 | 62.96 | 60.84 | 62.45 | |
Meta-Llama-3.1-70B | 70 | 64.09 | 59.00 | 8.22 | 37.56 | |
Meta-Llama-3.1-70B-Instruct | 70 | 64.42 | 62.25 | 56.11 | 60 | |
Mistral | Mistral-7B-Instruct-v0.3 | 7 | 54.42 | 51.25 | 31.85 | 43.33 |
Mixtral-8x7B-Instruct-v0.1 | 46.7 | 61.17 | 54.89 | 24.90 | 42.96 | |
Codestral-22B-v0.1 | 22 | 60.34 | 52.11 | 37.86 | 47.6 | |
Phi | Phi-3-medium-128k-instruct | 14 | 58.54 | 54.56 | 37.89 | 48.03 |
Phi-3-mini-128k-instruct | 3.8 | 53.01 | 48.65 | 22.36 | 37.93 | |
Qwen | Qwen2-57B-A14B-Instruct | 57 | 61.34 | 57.48 | 30.48 | 46.34 |
CodeQwen1.5-7B-Chat | 7 | 49.66 | 46.58 | 56.37 | 49.82 | |
Yi | Yi-1.5-34B-Chat | 34 | 58.32 | 55.59 | 40.27 | 49.39 |
Yi-1.5-9B-Chat | 9 | 55.64 | 55.06 | 37.15 | 47.23 | |
Deep Seek | DeepSeek-coder-7b-instruct-v1.5 | 7 | 56.67 | 47.90 | 28.46 | 41.21 |
DeepSeek-coder-33b-instruct | 33 | 53.65 | 46.11 | 21.47 | 36.6 | |
DeepSeek-moe-16b-chat | 16.4 | 31.74 | 35.43 | 27.33 | 31.01 | |
DeepSeek-Coder-V2-Lite-Instruct | 16 | 59.91 | 54.76 | 33.62 | 46.51 | |
InternLM | InternLM2-5-20b-chat | 20 | 57.85 | 55.51 | 30.44 | 44.89 |
StarCoder2 | StarCoder2-15b-instruct-v0.1 | 15 | 56.58 | 49.07 | 42.79 | 47.94 |
For more benchmark detail, please check 👉 HERE 👈
@article{dung2024codemmlu,
title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs},
author={Manh, Dung Nguyen and Chau, Thang Phan and Hai, Nam Le and Doan, Thong T and Nguyen, Nam V and Pham, Quang and Bui, Nghi DQ},
journal={arXiv preprint arXiv:2410.01999v1},
year={2024}
}