skip to main content
10.1145/3564625.3567985acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacsacConference Proceedingsconference-collections
research-article

Transformer-Based Language Models for Software Vulnerability Detection

Published: 05 December 2022 Publication History

Abstract

The large transformer-based language models demonstrate excellent performance in natural language processing. By considering the transferability of the knowledge gained by these models in one domain to other related domains, and the closeness of natural languages to high-level programming languages, such as C/C++, this work studies how to leverage (large) transformer-based language models in detecting software vulnerabilities and how good are these models for vulnerability detection tasks. In this regard, firstly, we present a systematic (cohesive) framework that details source code translation, model preparation, and inference. Then, we perform an empirical analysis of software vulnerability datasets of C/C++ source codes having multiple vulnerabilities corresponding to the library function call, pointer usage, array usage, and arithmetic expression. Our empirical results demonstrate the good performance of the language models in vulnerability detection. Moreover, these language models have better performance metrics, such as F1-score, than the contemporary models, namely bidirectional long short term memory and bidirectional gated recurrent unit. Experimenting with the language models is always challenging due to the requirement of computing resources, platforms, libraries, and dependencies. Thus, this paper also analyses the popular platforms to efficiently fine-tune these models and present recommendations while choosing the platforms for our framework.

References

[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. CoRR abs/2103.06333(2021). arXiv:2103.06333https://arxiv.org/abs/2103.06333
[2]
Anaconda. [n. d.]. Anaconda. https://www.anaconda.com
[3]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. (2014). https://doi.org/10.48550/ARXIV.1406.1078
[4]
Coverity. 2022. Coverity. https://scan.coverity.com/
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 4171–4186. https://doi.org/10.18653/v1/n19-1423
[6]
Xu Duan, Jingzheng Wu, Shouling Ji, Zhiqing Rui, Tianyue Luo, Mutian Yang, and Yanjun Wu. 2019. VulSniper: Focus Your Attention to Shoot Fine-Grained Vulnerabilities. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China) (IJCAI’19). AAAI Press, 4665–4671.
[7]
Hugging Face. 2022. Transformers: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow. https://huggingface.co/docs/transformers/index
[8]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proc. of Findings of the Association for Computational Linguistics: EMNLP 2020. 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
[9]
The Linux Foundation. 2022. Horovod. https://horovod.ai
[10]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027(2020).
[11]
Google. 2020. Atheris. https://pypi.org/project/atheris/.
[12]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2020. GraphCodeBERT: Pre-training Code Representations with Data Flow. CoRR abs/2009.08366(2020). arXiv:2009.08366https://arxiv.org/abs/2009.08366
[13]
B. Hasheminezhad, S. Shirzad, N. Wu, P. Diehl, H. Schulz, and H. Kaiser. 2020. Towards a Scalable and Distributed Infrastructure for Deep Learning Applications. In 2020 IEEE/ACM Fifth Workshop on Deep Learning on Supercomputers (DLS). IEEE Computer Society, Los Alamitos, CA, USA, 20–30. https://doi.org/10.1109/DLS51937.2020.00008
[14]
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. https://direct.mit.edu/neco/article-abstract/9/8/1735/6109/Long-Short-Term-Memory?redirectedFrom=fulltext
[15]
Huggingface.co. 2020. Tokenizer summary. https://huggingface.co/transformers/v3.0.2/tokenizer_summary.html
[16]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR abs/1909.09436(2019). arXiv:1909.09436http://arxiv.org/abs/1909.09436
[17]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery. In 2017 IEEE Symposium on Security and Privacy (SP). 595–614. https://doi.org/10.1109/SP.2017.62
[18]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. CoRR abs/1910.13461(2019). arXiv:1910.13461http://arxiv.org/abs/1910.13461
[19]
Xin Li, Lu Wang, Yang Xin, Yixian Yang, and Yuling Chen. 2020. Automated Vulnerability Detection in Source Code Using Minimum Intermediate Representation Learning. Applied Sciences 10, 5 (2020). https://doi.org/10.3390/app10051692
[20]
Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin. 2021. VulDeeLocator: A Deep Learning-based Fine-grained Vulnerability Detector. IEEE Transactions on Dependable and Secure Computing (2021), 1–1. https://doi.org/10.1109/TDSC.2021.3076142
[21]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. VulPecker: an automated vulnerability detection system based on code similarity analysis. In Proc. of the 32nd Annual Conference on Computer Security Applications. 201–213. https://doi.org/10.1145/2991079.2991102
[22]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput(2021). https://doi.org/abs/1807.06756
[23]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proc. Network and Distributed System Security Symposium (NDSS). Internet Society. https://doi.org/10.14722/ndss.2018.23158
[24]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692(2019). arxiv:1907.11692http://arxiv.org/abs/1907.11692
[25]
Microsoft. 2022. DeepSpeed. https://www.deepspeed.ai.
[26]
James Newsome and Dawn Song. 2005. Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software. In Proc. NDSS.
[27]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[28]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
[29]
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A Primer in BERTology: What We Know About How BERT Works. Trans. of the Association for Computational Linguistics 8 (2020), 842–866. https://doi.org/10.1162/tacl_a_00349
[30]
Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, and Marc W. McConley. 2018. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In Proc. ICMLA. 757– 762.
[31]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proc. EMC2: 5th Edition Co-located with NeurIPS’19. 1–5.
[32]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arxiv:1909.08053 [cs.CL]
[33]
Michael Sutton, Adam Greene, and Pedram Amini. 2007. Fuzzing: Brute Force Vulnerability Discovery. Pearson Education (2007).
[34]
the software quality company TIOBE. 2022. TIOBE Index for May 2022. https://www.tiobe.com/tiobe-index/
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. Advances in neural information processing systems, Vol. 30. Curran Associates, Inc., 5998–6008.
[36]
VULDB. 2020. Apple macOS USD File buffer overflow. https://vuldb.com/?id.163591
[37]
VULDB. 2020. Facebook WhatsApp on Android Video Stream buffer overflow. https://vuldb.com/?id.160672
[38]
VULDB. 2021. NVIDIA Shield TV up to 8.2.1 NVDEC buffer overflow. https://vuldb.com/?id.168508
[39]
VulDeePecker. 2018. Database of "VulDeePecker A Deep Learning-Based System for Vulnerability Detection". https://github.com/CGCL-codes/VulDeePecker
[40]
David A. Wheeler. 2018. Flawfinder. https://dwheeler.com/flawfinder/.
[41]
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In Proc. IEEE Symposium on Security and Privacy. 590–604.
[42]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Proc. NeurIPS.
[43]
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. CoRR abs/1506.06724(2015). arXiv:1506.06724http://arxiv.org/abs/1506.06724
[44]
Noah Ziems and Shaoen Wu. 2021. Security Vulnerability Detection Using Deep Learning Natural Language Processing. arxiv:2105.02388 [cs.CR]
[45]
Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin. 2019. µVulDeePecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans. Dependable Sec. Comput(2019). https://doi.org/10.1109/TDSC.2019.2942930

Cited By

View all

Index Terms

  1. Transformer-Based Language Models for Software Vulnerability Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ACSAC '22: Proceedings of the 38th Annual Computer Security Applications Conference
    December 2022
    1021 pages
    ISBN:9781450397599
    DOI:10.1145/3564625
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 December 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. BERT
    2. GPT-2
    3. Software vulnerability detection
    4. transformer-based models

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • US Army International Technology Center Indo-Pacific

    Conference

    ACSAC

    Acceptance Rates

    Overall Acceptance Rate 104 of 497 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)562
    • Downloads (Last 6 weeks)54
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media