LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Ullah, Saad; Han, Mingji; Pujar, Saurabh; Pearce, Hammond; Coskun, Ayse; Stringhini, Gianluca

Computer Science > Cryptography and Security

arXiv:2312.12575 (cs)

[Submitted on 19 Dec 2023 (v1), last revised 24 Jul 2024 (this version, v3)]

Title:LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Authors:Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants.

Comments:	Accepted for publication in IEEE Symposium on Security and Privacy 2024
Subjects:	Cryptography and Security (cs.CR)
Cite as:	arXiv:2312.12575 [cs.CR]
	(or arXiv:2312.12575v3 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2312.12575

Submission history

From: Saad Ullah [view email]
[v1] Tue, 19 Dec 2023 20:19:43 UTC (1,194 KB)
[v2] Sat, 13 Apr 2024 20:55:53 UTC (1,200 KB)
[v3] Wed, 24 Jul 2024 07:49:14 UTC (1,085 KB)

Computer Science > Cryptography and Security

Title:LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators