The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?

Gu, Alex; Li, Wen-Ding; Jain, Naman; Olausson, Theo X.; Lee, Celine; Sen, Koushik; Solar-Lezama, Armando

Computer Science > Software Engineering

arXiv:2402.19475 (cs)

[Submitted on 29 Feb 2024]

Title:The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?

Authors:Alex Gu, Wen-Ding Li, Naman Jain, Theo X. Olausson, Celine Lee, Koushik Sen, Armando Solar-Lezama

View PDF HTML (experimental)

Abstract:While language models are increasingly more proficient at code generation, they still frequently generate incorrect programs. Many of these programs are obviously wrong, but others are more subtle and pass weaker correctness checks such as being able to compile. In this work, we focus on these counterfeit samples: programs sampled from a language model that 1) have a high enough log-probability to be generated at a moderate temperature and 2) pass weak correctness checks. Overall, we discover that most models have a very shallow understanding of counterfeits through three clear failure modes. First, models mistakenly classify them as correct. Second, models are worse at reasoning about the execution behaviour of counterfeits and often predict their execution results as if they were correct. Third, when asking models to fix counterfeits, the likelihood of a model successfully repairing a counterfeit is often even lower than that of sampling a correct program from scratch. Counterfeits also have very unexpected properties: first, counterfeit programs for problems that are easier for a model to solve are not necessarily easier to detect and only slightly easier to execute and repair. Second, counterfeits from a given model are just as confusing to the model itself as they are to other models. Finally, both strong and weak models are able to generate counterfeit samples that equally challenge all models. In light of our findings, we recommend that care and caution be taken when relying on models to understand their own samples, especially when no external feedback is incorporated.

Comments:	54 pages, 25 figures
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2402.19475 [cs.SE]
	(or arXiv:2402.19475v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2402.19475

Submission history

From: Alex Gu [view email]
[v1] Thu, 29 Feb 2024 18:59:25 UTC (638 KB)

Computer Science > Software Engineering

Title:The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators