CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

1FPT Software AI Center, 2Fulbright University, Viet Nam 3Hanoi University of Science and Technology, 4VNU-HCM- University of Science, Viet Nam

Abstract

Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models’ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.

Overview

We introduce CodeMMLU, a novel benchmark designed to evaluate CodeLLMs' ability to understand and comprehend code through multi-choice question answering (MCQA). This approach enables a deeper assessment of how CodeLLMs grasp coding concepts, moving beyond mere generation capabilities. Inspired by the MMLU dataset from natural language understanding, CodeMMLU offers a robust and easily evaluable methodology with the following key features:

  • Comprehensiveness: CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources, mitigating potential bias from limited evaluation data.
  • Diversity in task, domain, and language: The dataset covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.

CodeMMLU enables us to assess LLMs’ capabilities in coding and software tasks from a novel perspective, extending beyond traditional code generation and completion. Our analysis reveals several notable findings: (1) previously unexplored bias issues in CodeLLMs, aligning with those observed in natural language MCQA tasks; (2) GPT-4 consistently achieving the highest average performance among closed-source models, while (3) the Meta-Llama family demonstrated the greatest accuracy among open-source models; (4) scaling laws related to model size were partially observed within the same model family but not across different families, suggesting the significant influence of pretraining datasets, methodologies, and model architectures; (5) advanced prompting techniques, such as Chain-of-Thought (CoT), consistently degraded performance, raising concerns about CodeLLMs’ reasoning abilities on complex, step-by-step tasks; and (6) benchmarks like HumanEval, when converted from open-ended code generation to MCQA format, show that LLMs perform worse on MCQA, raising concerns about their real capability to understand and comprehend code. These findings highlight the current shortcomings of CodeLLMs and the intricate relationship between model architecture, training data quality, and evaluation methods in determining performance on software-related tasks.

Our key contributions are:

  • We present the first MCQA benchmark for software and coding-related knowledge, addressing the need for diverse evaluation scenarios in the code domain. CodeMMLU enables the evaluation of LLMs' alignment with human inference in the software knowledge domain, similar to advancements in the NLP field.
  • CodeMMLU provides a thorough assessment of LLM capabilities, ensuring a substantial number of samples and diversity across tasks, domains, and languages. This enables a more nuanced understanding of an LLM's strengths and weaknesses, facilitating the development of models better aligned with the complexities and demands of the software domain.
  • Our experiments offer critical insights into LLM performance, highlighting the impact of factors such as model size, model family, and prompting techniques. This provides essential information to the community on effectively utilizing LLMs for specific tasks and domains in software engineering.

Overview of CodeMMLU data creation pipeline. The blue diagram describe the process of collecting raw multiple-choice questions (MCQs) from open source internet for a knowledge testset. Otherwise, the pipeline of real-world problem indicated in orange area.

Evaluation Results

CodeMMLU revealed significant performance differences across models, as shown in the table below. OpenAI's GPT-4o outperformed all models on CodeMMLU, demonstrating its quality across diverse tasks. Notably, despite not being the latest model, the instructed version of Meta-Llama-3-70B achieved the highest score among open-source models from 8 families. While LLMs perform well on knowledge-based tasks, they struggle with real-world problems, particularly in defect detection tasks.

Model name Size (B) Syntactic knowledge Semantic knowledge Real-world tasks CodeMMLU
Closed-source models     
Anthropic Claude-3-sonnet@20240229 - 67.22 66.08 38.26 53.97
OpenAI GPT-4o-2024-05-13 - 60.41 57.82 77.18 67.0
GPT-3.5-turbo-0613 - 61.68 53.64 45.26 51.7
Open-source models     
Meta Llama CodeLlama-34b-Instruct-hf 34 56.81 46.93 23.55 38.73
Meta-Llama-3-70B 70 63.38 57.64 35.29 48.98
Meta-Llama-3-70B-Instruct 70 64.90 62.96 60.84 62.45
Meta-Llama-3.1-70B 70 64.09 59.00 8.22 37.56
Meta-Llama-3.1-70B-Instruct 70 64.42 62.25 56.11 60
Mistral Mistral-7B-Instruct-v0.3 7 54.42 51.25 31.85 43.33
Mixtral-8x7B-Instruct-v0.1 46.7 61.17 54.89 24.90 42.96
Codestral-22B-v0.1 22 60.34 52.11 37.86 47.6
Phi Phi-3-medium-128k-instruct 14 58.54 54.56 37.89 48.03
Phi-3-mini-128k-instruct 3.8 53.01 48.65 22.36 37.93
Qwen Qwen2-57B-A14B-Instruct 57 61.34 57.48 30.48 46.34
CodeQwen1.5-7B-Chat 7 49.66 46.58 56.37 49.82
Yi Yi-1.5-34B-Chat 34 58.32 55.59 40.27 49.39
Yi-1.5-9B-Chat 9 55.64 55.06 37.15 47.23
Deep Seek DeepSeek-coder-7b-instruct-v1.5 7 56.67 47.90 28.46 41.21
DeepSeek-coder-33b-instruct 33 53.65 46.11 21.47 36.6
DeepSeek-moe-16b-chat 16.4 31.74 35.43 27.33 31.01
DeepSeek-Coder-V2-Lite-Instruct 16 59.91 54.76 33.62 46.51
InternLM InternLM2-5-20b-chat 20 57.85 55.51 30.44 44.89
StarCoder2 StarCoder2-15b-instruct-v0.1 15 56.58 49.07 42.79 47.94
Summary performance of LLM family on CodeMMLU. The evaluation results (accuracy %) of different language models across CodeMMLU task.

For more benchmark detail, please check 👉 HERE 👈

CodeMMLU accuracy by task on LLMs. While knowledge tasks are following the scaling law, real-world tasks offer more challenges to LLMs which indicate the performance of instruction tuning and data quality when evaluating on CodeMMLU.

BibTeX

@article{dung2024codemmlu,
  title={CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs},
  author={Manh, Dung Nguyen and Chau, Thang Phan and Hai, Nam Le and Doan, Thong T and Nguyen, Nam V and Pham, Quang and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2410.01999v1},
  year={2024}
}