CodeMMLU Leaderboard
A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs
π Notes
- Evaluated using CodeMMLU
- Models are ranked according to Accuracy using greedy decoding.
- "Size" here is the amount of activated model weight during inference.
π€ More Leaderboards
In addition to CodeMMLU leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as:- RepoExec Leaderboard
- Bigcode-Bench Leaderboard
- EvalPlus Leaderboard
- Big Code Models Leaderboard
- Chatbot Arena Leaderboard
- CrossCodeEval
- ClassEval
- CRUXEval
- Code Lingua
- Evo-Eval
- HumanEval.jl - Julia version HumanEval with EvalPlus test cases
- InfiCoder-Eval
- LiveCodeBench
- NaturalCodeBench
- RepoBench
- SWE-bench
π Acknowledgements
- We thank the EvalPlus and BigCode teams for providing the leaderboard template.