SRank: Functional Overlap Reranking for Neural Code Generation

Abstract

Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters.

We introduce SRank, a novel reranking strategy for selecting the best solutions from code generation, focusing on modeling the relationships between clusters of solutions. By quantifying the functional overlap between solution clusters, our approach provides a better ranking strategy for code solutions.

Empirical results show that our method achieves remarkable results on the pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% in pass@1 with Codex002, 75.31% with WizardCoder, 53.99% with StarCoder, and 60.55% with CodeGen, surpassing state-of-the-art code generation reranking methods such as CodeT and Coder-Reviewer on the same CodeLLM by a significant margin (≈ 6.1% improvement on average). Even in scenarios with a limited number of sampled solutions and test cases, our approach demonstrates robustness and superiority, marking a new benchmark in code generation reranking.

How SRank Works

SRank introduces a new metric called "functional overlap" to quantify the similarity between clusters of code solutions based on their execution outputs. This allows us to identify the most representative cluster that exhibits maximum overlap with all other clusters. The intuition is that the cluster interacting most comprehensively with others is likely to be the most consistent and, therefore, the most promising cluster that contains optimal solutions.

The concept of "functional overlap" is illustrated in Figure 1. In essence, we execute code solutions from each cluster on the same test inputs and compare their outputs. The level of output match indicates the extent of functional overlap between two clusters.

The overall method pipeline of SRank is illustrated in Figure 2.

Experimental Results

We evaluated SRank on several state-of-the-art CodeLLMs, including Codex, WizardCoder, StarCoder, and CodeGen. Our results show that SRank consistently outperforms existing methods in code generation reranking e.g CodeT and Coder-Reviewer, achieving significant improvements in functional correctness measured in pass@1.

HumanEval

	WizardCoder34B	WizardCoder15B	CodeGen2.5-Instruct	StarCoder	Codex002	CodeGen16B
Random	59.88	45.20	26.68	32.55	37.06	22.78
Greedy	68.90	50.61	28.05	39.63	47.00	29.70
CodeT	72.36	58.64	56.81	50.51	65.80	36.70
Coder-Reviewer	-	49.37	45.63	38.71	66.90	42.60
SRank	75.31	59.99	60.55	53.99	69.66	43.07

MBPP-S

	WizardCoder34B	WizardCoder15B	CodeGen2.5-Instruct	StarCoder	Codex002	CodeGen16B
Random	54.37	45.72	34.60	39.26	47.50	31.54
Greedy	60.42	51.29	42.86	45.90	58.10	42.40
CodeT	63.39	58.18	55.02	58.05	67.70	49.50
Coder-Reviewer	-	52.52	52.74	49.48	64.70	50.30
SRank	64.14	59.01	57.02	58.38	69.25	51.03

These results demonstrate the effectiveness of SRank in improving the accuracy of code generation across various CodeLLMs.

Conclusion

We propose SRank, a novel reranking strategy designed to extract optimal code generation solutions from CodeLLMs. By modeling the relationships between clusters of code solutions, we can more effectively identify the best solutions and improve the overall accuracy of code generation.

We showcase the state-of-the-art performance of SRank on pass@1 across various well-known CodeLLMs, surpassing other ranking methods like CodeT and Coder-Reviewer in extensive evaluations. Our findings also suggest that SRank can potentially address the challenges of code generation in real-world applications, illuminating strategies for selecting superior solutions within constrained coding environments.

BibTeX

@inproceedings{to-etal-2024-functional,
  title      = {Functional Overlap Reranking for Neural Code Generation},
  author     = {To, Hung and Nguyen, Minh and Bui, Nghi},
  booktitle  = {Findings of the Association for Computational Linguistics ACL 2024},
  year       = {2024},
  pages      = {3686--3704},
  publisher  = {Association for Computational Linguistics},
  url        = {https://aclanthology.org/2024.findings-acl.220},
}