The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at https://github.com/FSoft-AI4Code/RepoExec.
RepoExec is a pioneering benchmark that places a strong emphasis on the executability and correctness of generated code. Unlike traditional benchmarks, RepoExec ensures that the code not only compiles but also performs as intended in real-world scenarios. This is achieved through an automated system that verifies installation and runtime requirements, and dynamically generates high-coverage test cases.
Key Features of RepoExec:
The experiments conducted using RepoExec have provided several valuable insights into the capabilities of LLMs in code generation:
Two approaches are investigated to enhance the performance of generated code in terms of both functional correctness and dependency invocation.
@article{nam2024repoexec,
title={RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark},
author={Hai, Nam Le and Manh, Dung Nguyen and Bui, Nghi DQ},
journal={arXiv preprint arXiv:2406.11927v1},
year={2024}
}