research-article

Learning to Find Usages of Library Functions in Optimized Binaries

Authors:

Toufique Ahmed,

Premkumar Devanbu,

Anand Ashok SawantAuthors Info & Claims

IEEE Transactions on Software Engineering, Volume 48, Issue 10

Pages 3862 - 3876

https://doi.org/10.1109/TSE.2021.3106572

Published: 01 October 2022 Publication History

Abstract

Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries’ behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, “natural” source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to “decompile” binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.

References

[1]

E. J. Chikofsky and J. H. Cross, “Reverse engineering and design recovery: A taxonomy,” IEEE Softw., vol. 7, no. 1, pp. 13–17, Jan. 1990.

Digital Library

[2]

H. B. BE, “Hexrays ida pro,” Aug. 2020. [Online]. Available: http://hex-rays.com/products/ida/

[3]

NSA, “Ghidra,” Aug. 2020. [Online]. Available: https://ghidra-sre.org/

[4]

M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proc. IEEE Symp. Secur. Privacy, 2000, pp. 38–49.

[5]

T.-Y. Wang, S.-J. Horng, M.-Y. Su, C.-H. Wu, P.-C. Wang, and W.-Z. Su, “A surveillance spyware detection system based on data mining methods,” in Proc. IEEE Int. Conf. Evol. Comput., 2006, pp. 3236–3241.

[6]

Y. Ye, D. Wang, T. Li, and D. Ye, “IMDS: Intelligent malware detection system,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2007, pp. 1043–1047.

[7]

Y. Ye, T. Li, Q. Jiang, and Y. Wang, “CIMDS: Adapting postprocessing techniques of associative classification for malware detection,” IEEE Trans. Syst., Man, Cybern., Part C (Appl. Rev.), vol. 40, no. 3, pp. 298–307, May 2010.

Digital Library

[8]

Y. Ye, T. Li, K. Huang, Q. Jiang, and Y. Chen, “Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list,” J. Intell. Inf. Syst., vol. 35, no. 1, pp. 1–20, 2010.

Digital Library

[9]

Y. Ye, T. Li, Q. Jiang, Z. Han, and L. Wan, “Intelligent file scoring system for malware detection from the gray list,” in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2009, pp. 1385–1394.

[10]

M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, “Mining concept-drifting data stream to detect peer to peer botnet traffic,” Univ. Texas Dallas, Richardson, TX, USA, Tech. Rep. UTDCS-05–08, 2008.

[11]

R. Tian, R. Islam, L. Batten, and S. Versteeg, “Differentiating malware from cleanware using behavioural analysis,” in Proc. 5th Int. Conf. Malicious Unwanted Softw., 2010, pp. 23–30.

[12]

R. Islam, R. Tian, L. M. Batten, and S. Versteeg, “Classification of malware based on integrated static and dynamic features,” J. Netw. Comput. Appl., vol. 36, no. 2, pp. 646–656, 2013.

Digital Library

[13]

Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, “A survey on malware detection using data mining techniques,” ACM Comput. Surv., vol. 50, no. 3, pp. 1–40, 2017.

Digital Library

[14]

H. B. BE, “Flirt signatures.” Accessed: Aug.12, 2020. [Online]. Available: https://www.hex-rays.com/products/ida/tech/flirt/in_depth/

[15]

J. Qiu, X. Su, and P. Ma, “Library functions identification in binary code by using graph isomorphism testings,” in Proc. IEEE 22nd Int. Conf. Softw. Anal., Evol., Reeng., 2015, pp. 261–270.

[16]

H. B. BE, “Hexrays ida pro inlined function recovery,” Aug. 2020. [Online]. Available: https://hex-rays.com/products/decompiler/compare/v12_vs_v11/

[17]

Y. Liuet al., “RoBERTa: A robustly optimized BERT pretraining approach,” 2019,.

[18]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.

[19]

T. Eisenbarth, R. Koschke, and D. Simon, “Aiding program comprehension by static and dynamic feature analysis,” in Proc. IEEE Int. Conf. Softw. Maintenance, 2001, pp. 602–611.

[20]

G. Hunt and D. Brubacher, “Detours: Binary interception of Win32 functions,” in Proc. 3rd USENIX Windows NT Symp., 1999, pp. 1–9.

[21]

C. Willems, T. Holz, and F. Freiling, “Toward automated dynamic malware analysis using CWSandbox,” IEEE Secur. Privacy, vol. 5, no. 2, pp. 32–39, Mar./Apr. 2007.

Digital Library

[22]

U. Bayer, C. Kruegel, and E. Kirda, “TTAnalyze: A tool for analyzing malware,” M.S. thesis, Ikarus Softw. Tech. Univ. Vienna, Vienna, Austria, 2006.

[23]

D. Songet al., “BitBlaze: A new approach to computer security via binary analysis,” in Proc. Int. Conf. Inf. Syst. Secur., 2008, pp. 1–25.

[24]

J. Qiu, X. Su, and P. Ma, “Using reduced execution flow graph to identify library functions in binary code,” IEEE Trans. Softw. Eng., vol. 42, no. 2, pp. 187–202, 2015.

[25]

P. Shirani, L. Wang, and M. Debbabi, “BinShape: Scalable and robust binary library function identification using function shape,” in Proc. Int. Conf. Detection Intrusions Malware, Vulnerability Assessment, 2017, pp. 301–324.

[26]

J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2018, pp. 1667–1680.

[27]

Y. David, U. Alon, and E. Yahav, “Neural reverse engineering of stripped binaries using augmented control flow graphs,” Proc. ACM Program. Lang., vol. 4, no. OOPSLA, pp. 1–28, 2020.

[28]

T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley, “BYTEWEIGHT: Learning to recognize functions in binary code,” in Proc. 23rd USENIX Conf. Secur. Symp., 2014, pp. 845–860.

[29]

S. Wang, P. Wang, and D. Wu, “Semantics-aware machine learning for function recognition in binary code,” in Proc. IEEE Int. Conf. Softw. Maintenance Evol., 2017, pp. 388–398.

[30]

E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in binaries with neural networks,” in Proc. 24th USENIX Secur. Symp. Secur., 2015, pp. 611–626.

[31]

K. Pei, J. Guan, D. W. King, J. Yang, and S. Jana, “XDA: Accurate, robust disassembly with transfer learning,” 2020,.

[32]

O. Katz, Y. Olshaker, Y. Goldberg, and E. Yahav, “Towards neural decompilation,” 2019,.

[33]

J. Lacomiset al., “DIRE: A neural approach to decompiled identifier naming,” in Proc. 34th IEEE/ACM Int. Conf. Automat. Softw. Eng., 2019, pp. 628–639.

[34]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018,.

[35]

Z. Fenget al., “CodeBERT: A pre-trained model for programming and natural languages,” 2020,.

[36]

A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning and evaluating contextual embedding of source code,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 5110–5121.

[37]

D. A. Tomassiet al., “BugSwarm: Mining and continuously growing a dataset of reproducible failures and fixes,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., 2019, pp. 339–349.

[38]

BugSwarm, “BugSwarm githubory repository,” Aug. 2020. [Online]. Available: https://github.com/BugSwarm/bugswarm

[39]

Blinded, “Replication package for this work,” Aug. 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4007527

[40]

Docker, “Docker job matrix configuration,” Aug. 2020. [Online]. Available: https://docs.travis-ci.com/user/build-matrix

[41]

T. CI, “Travis build utility,” Aug. 2020. [Online]. Available: https://github.com/travis-ci/travis-build

[42]

T. CI, “Travis dockerhub repository,” Aug. 2020. [Online]. Available: https://hub.docker.com/u/travisci

[43]

Blinded, “Docker containers created,” Aug. 2020. [Online]. Available: https://hub.docker.com/r/binswarm/cbuilds/tags

[44]

M. L. Collard, M. J. Decker, and J. I. Maletic, “Lightweight transformation and fact extraction with the srcML toolkit,” in Proc. IEEE 11th Int. Working Conf. Source Code Anal. Manipulation, 2011, pp. 173–184.

[45]

Huggingface, “Huggingface transformers,” Aug. 2020. [Online]. Available: https://github.com/huggingface/transformers

[46]

C. Casalnuovo, K. Sagae, and P. Devanbu, “Studying the difference between natural and programming language corpora,” Empirical Softw. Eng., vol. 24, no. 4, pp. 1823–1868, 2019.

Digital Library

Cited By

Ciniselli MMartin-Lopez ABavota GBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)On the Generalizability of Deep Learning-based Code Completion Across Programming Language VersionsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644411(99-111)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644411
Ahmed TLedesma NDevanbu P(2023)SynShine: Improved Fixing of Syntax ErrorsIEEE Transactions on Software Engineering10.1109/TSE.2022.321263549:4(2169-2181)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TSE.2022.3212635
Lin WGuo QYin JZuo XWang RGong X(2023)FSmell: Recognizing Inline Function in Binary CodeComputer Security – ESORICS 202310.1007/978-3-031-51476-0_24(487-506)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-51476-0_24
Show More Cited By

Index Terms

Learning to Find Usages of Library Functions in Optimized Binaries
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
  2. Software notations and tools

Index terms have been assigned to the content through auto-classification.

Recommendations

Who’s debugging the debuggers? exposing debug information bugs in optimized binaries
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Despite the advancements in software testing, bugs still plague deployed software and result in crashes in production. When debugging issues —sometimes caused by “heisenbugs”— there is the need to interpret core dumps and reproduce the issue offline on ...
Using evolution patterns to find duplicated bugs
Learning labeling functions in distantly supervised relation extraction

Distant supervision has become the leading method for training large-scale information extractors. It could be encoded in the form of labeling functions, which employ knowledge bases to provide labels for the data. However, most previous works use only ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering Volume 48, Issue 10

Oct. 2022

513 pages

ISSN:0098-5589

Issue’s Table of Contents

0098-5589 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 October 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ciniselli MMartin-Lopez ABavota GBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)On the Generalizability of Deep Learning-based Code Completion Across Programming Language VersionsProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644411(99-111)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644411
Ahmed TLedesma NDevanbu P(2023)SynShine: Improved Fixing of Syntax ErrorsIEEE Transactions on Software Engineering10.1109/TSE.2022.321263549:4(2169-2181)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TSE.2022.3212635
Lin WGuo QYin JZuo XWang RGong X(2023)FSmell: Recognizing Inline Function in Binary CodeComputer Security – ESORICS 202310.1007/978-3-031-51476-0_24(487-506)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-51476-0_24
Ahmed TDevanbu PDwyer MDamian DZeller A(2022)Multilingual training for software engineeringProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510049(1443-1455)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510049

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents