RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark

¹FPT Software AI Center, ²Fulbright University, Viet Nam ³Hanoi University of Science and Technology, Viet Nam

Abstract

The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at https://github.com/FSoft-AI4Code/RepoExec.

Overview

RepoExec is a pioneering benchmark that places a strong emphasis on the executability and correctness of generated code. Unlike traditional benchmarks, RepoExec ensures that the code not only compiles but also performs as intended in real-world scenarios. This is achieved through an automated system that verifies installation and runtime requirements, and dynamically generates high-coverage test cases.

Key Features of RepoExec:

Enhanced Executability: RepoExec goes beyond match-based evaluation to ensure that the generated code can be executed in real-world environments. This involves verifying that the code can be installed and run, addressing a critical aspect of real-world applicability.
Dynamic Test Case Generation: One of the standout features of RepoExec is its sophisticated mechanism for generating test cases. These test cases are designed to thoroughly assess the functionality of the generated code, ensuring that it performs the intended tasks correctly.
Dependency Usage Evaluation: RepoExec evaluates how effectively LLMs utilize code dependencies. This involves analyzing whether the models can accurately integrate and manage external libraries and dependencies, which is crucial for creating functional software at a repository level.
Dependency Invocation Rate (DIR): A novel metric introduced by RepoExec, the Dependency Invocation Rate measures how frequently and effectively generated code invokes dependencies. This metric provides deeper insights into the integration capabilities of LLMs, highlighting their potential for creating more complex and interconnected software systems.

Figure 1: Data Processing Pipeline of RepoExec

Evaluation Results

The experiments conducted using RepoExec have provided several valuable insights into the capabilities of LLMs in code generation:

Correctness: Pretrained LLMs have shown a high level of correctness in the code they generate. This means that the code produced by these models is syntactically accurate and adheres to the basic structure expected by programming languages.
Dependency Management and Debugging: Instruction-tuned models, on the other hand, excel in managing dependencies and debugging. These models have demonstrated a better ability to handle the complexities of integrating external libraries and resolving issues that arise during the execution of the code.

Table 1: Pass@k (k= 1 and 5) and DIR results of various LLMs on RepoExec

Enhancing Functional Correctness and Dependency Invocation abilities

Two approaches are investigated to enhance the performance of generated code in terms of both functional correctness and dependency invocation.

Multi-round Debugging: Leveraging test execution outputs and incorporating self-refinement through multiple rounds can dramatically boost a model's performance in generating accurate code and effectively utilizing dependencies.

Figure 2: Improvement of the performance of several models on RepoExec after 3-round debugging process.

Instruction tuning: RepoExec also comes with a valuable instruction-tuning training dataset. The experimental results, highlighted in the table below, clearly demonstrate the effectiveness of this approach with just a single round of generation.

Table 2: Improvement of the performance of several models on RepoExec after instruction tuning.

@article{nam2024repoexec, title={RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark}, author={Hai, Nam Le and Manh, Dung Nguyen and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2406.11927v1}, year={2024} }

RepoExec: Evaluate Code Generation with a Repository-Level Executable Benchmark

Abstract

Overview

Evaluation Results

Enhancing Functional Correctness and Dependency Invocation abilities

BibTeX