CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Du, Kounianhua; Chen, Jizheng; Rui, Renting; Chai, Huacan; Fu, Lingyue; Xia, Wei; Wang, Yasheng; Tang, Ruiming; Yu, Yong; Zhang, Weinan

Computer Science > Software Engineering

arXiv:2405.02355 (cs)

[Submitted on 3 May 2024 (v1), last revised 2 Oct 2024 (this version, v2)]

Title:CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Authors:Kounianhua Du, Jizheng Chen, Renting Rui, Huacan Chai, Lingyue Fu, Wei Xia, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang

View PDF HTML (experimental)

Abstract:Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In this paper, we propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language, which can facilitate natural language based LLMs for better understanding of code syntax and serve as a bridge among different programming languages. To take the extracted structural knowledge into the foundation models, we propose 1) a hard meta-graph prompt template to transform the challenging graphical representation into informative knowledge for tuning-free models and 2) a soft prompting technique that injects the domain knowledge of programming languages into the model parameters via finetuning the models with the help of a pretrained GNN expert model. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert. CodeGRAG improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation. The implementation is available at this https URL.

Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.02355 [cs.SE]
	(or arXiv:2405.02355v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2405.02355

Submission history

From: Kounianhua Du [view email]
[v1] Fri, 3 May 2024 02:48:55 UTC (884 KB)
[v2] Wed, 2 Oct 2024 14:37:01 UTC (1,321 KB)

Computer Science > Software Engineering

Title:CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators