skip to main content
research-article

Learning to Find Usages of Library Functions in Optimized Binaries

Published: 01 October 2022 Publication History

Abstract

Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries’ behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, “natural” source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to “decompile” binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.

References

[1]
E. J. Chikofsky and J. H. Cross, “Reverse engineering and design recovery: A taxonomy,” IEEE Softw., vol. 7, no. 1, pp. 13–17, Jan. 1990.
[2]
H. B. BE, “Hexrays ida pro,” Aug. 2020. [Online]. Available: http://hex-rays.com/products/ida/
[3]
NSA, “Ghidra,” Aug. 2020. [Online]. Available: https://ghidra-sre.org/
[4]
M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proc. IEEE Symp. Secur. Privacy, 2000, pp. 38–49.
[5]
T.-Y. Wang, S.-J. Horng, M.-Y. Su, C.-H. Wu, P.-C. Wang, and W.-Z. Su, “A surveillance spyware detection system based on data mining methods,” in Proc. IEEE Int. Conf. Evol. Comput., 2006, pp. 3236–3241.
[6]
Y. Ye, D. Wang, T. Li, and D. Ye, “IMDS: Intelligent malware detection system,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2007, pp. 1043–1047.
[7]
Y. Ye, T. Li, Q. Jiang, and Y. Wang, “CIMDS: Adapting postprocessing techniques of associative classification for malware detection,” IEEE Trans. Syst., Man, Cybern., Part C (Appl. Rev.), vol. 40, no. 3, pp. 298–307, May 2010.
[8]
Y. Ye, T. Li, K. Huang, Q. Jiang, and Y. Chen, “Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list,” J. Intell. Inf. Syst., vol. 35, no. 1, pp. 1–20, 2010.
[9]
Y. Ye, T. Li, Q. Jiang, Z. Han, and L. Wan, “Intelligent file scoring system for malware detection from the gray list,” in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2009, pp. 1385–1394.
[10]
M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, “Mining concept-drifting data stream to detect peer to peer botnet traffic,” Univ. Texas Dallas, Richardson, TX, USA, Tech. Rep. UTDCS-05–08, 2008.
[11]
R. Tian, R. Islam, L. Batten, and S. Versteeg, “Differentiating malware from cleanware using behavioural analysis,” in Proc. 5th Int. Conf. Malicious Unwanted Softw., 2010, pp. 23–30.
[12]
R. Islam, R. Tian, L. M. Batten, and S. Versteeg, “Classification of malware based on integrated static and dynamic features,” J. Netw. Comput. Appl., vol. 36, no. 2, pp. 646–656, 2013.
[13]
Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, “A survey on malware detection using data mining techniques,” ACM Comput. Surv., vol. 50, no. 3, pp. 1–40, 2017.
[14]
H. B. BE, “Flirt signatures.” Accessed: Aug.12, 2020. [Online]. Available: https://www.hex-rays.com/products/ida/tech/flirt/in_depth/
[15]
J. Qiu, X. Su, and P. Ma, “Library functions identification in binary code by using graph isomorphism testings,” in Proc. IEEE 22nd Int. Conf. Softw. Anal., Evol., Reeng., 2015, pp. 261–270.
[16]
H. B. BE, “Hexrays ida pro inlined function recovery,” Aug. 2020. [Online]. Available: https://hex-rays.com/products/decompiler/compare/v12_vs_v11/
[17]
Y. Liuet al., “RoBERTa: A robustly optimized BERT pretraining approach,” 2019,.
[18]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[19]
T. Eisenbarth, R. Koschke, and D. Simon, “Aiding program comprehension by static and dynamic feature analysis,” in Proc. IEEE Int. Conf. Softw. Maintenance, 2001, pp. 602–611.
[20]
G. Hunt and D. Brubacher, “Detours: Binary interception of Win32 functions,” in Proc. 3rd USENIX Windows NT Symp., 1999, pp. 1–9.
[21]
C. Willems, T. Holz, and F. Freiling, “Toward automated dynamic malware analysis using CWSandbox,” IEEE Secur. Privacy, vol. 5, no. 2, pp. 32–39, Mar./Apr. 2007.
[22]
U. Bayer, C. Kruegel, and E. Kirda, “TTAnalyze: A tool for analyzing malware,” M.S. thesis, Ikarus Softw. Tech. Univ. Vienna, Vienna, Austria, 2006.
[23]
D. Songet al., “BitBlaze: A new approach to computer security via binary analysis,” in Proc. Int. Conf. Inf. Syst. Secur., 2008, pp. 1–25.
[24]
J. Qiu, X. Su, and P. Ma, “Using reduced execution flow graph to identify library functions in binary code,” IEEE Trans. Softw. Eng., vol. 42, no. 2, pp. 187–202, 2015.
[25]
P. Shirani, L. Wang, and M. Debbabi, “BinShape: Scalable and robust binary library function identification using function shape,” in Proc. Int. Conf. Detection Intrusions Malware, Vulnerability Assessment, 2017, pp. 301–324.
[26]
J. He, P. Ivanov, P. Tsankov, V. Raychev, and M. Vechev, “Debin: Predicting debug information in stripped binaries,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2018, pp. 1667–1680.
[27]
Y. David, U. Alon, and E. Yahav, “Neural reverse engineering of stripped binaries using augmented control flow graphs,” Proc. ACM Program. Lang., vol. 4, no. OOPSLA, pp. 1–28, 2020.
[28]
T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley, “BYTEWEIGHT: Learning to recognize functions in binary code,” in Proc. 23rd USENIX Conf. Secur. Symp., 2014, pp. 845–860.
[29]
S. Wang, P. Wang, and D. Wu, “Semantics-aware machine learning for function recognition in binary code,” in Proc. IEEE Int. Conf. Softw. Maintenance Evol., 2017, pp. 388–398.
[30]
E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in binaries with neural networks,” in Proc. 24th USENIX Secur. Symp. Secur., 2015, pp. 611–626.
[31]
K. Pei, J. Guan, D. W. King, J. Yang, and S. Jana, “XDA: Accurate, robust disassembly with transfer learning,” 2020,.
[32]
O. Katz, Y. Olshaker, Y. Goldberg, and E. Yahav, “Towards neural decompilation,” 2019,.
[33]
J. Lacomiset al., “DIRE: A neural approach to decompiled identifier naming,” in Proc. 34th IEEE/ACM Int. Conf. Automat. Softw. Eng., 2019, pp. 628–639.
[34]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018,.
[35]
Z. Fenget al., “CodeBERT: A pre-trained model for programming and natural languages,” 2020,.
[36]
A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning and evaluating contextual embedding of source code,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 5110–5121.
[37]
D. A. Tomassiet al., “BugSwarm: Mining and continuously growing a dataset of reproducible failures and fixes,” in Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., 2019, pp. 339–349.
[38]
BugSwarm, “BugSwarm githubory repository,” Aug. 2020. [Online]. Available: https://github.com/BugSwarm/bugswarm
[39]
Blinded, “Replication package for this work,” Aug. 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4007527
[40]
Docker, “Docker job matrix configuration,” Aug. 2020. [Online]. Available: https://docs.travis-ci.com/user/build-matrix
[41]
T. CI, “Travis build utility,” Aug. 2020. [Online]. Available: https://github.com/travis-ci/travis-build
[42]
T. CI, “Travis dockerhub repository,” Aug. 2020. [Online]. Available: https://hub.docker.com/u/travisci
[43]
Blinded, “Docker containers created,” Aug. 2020. [Online]. Available: https://hub.docker.com/r/binswarm/cbuilds/tags
[44]
M. L. Collard, M. J. Decker, and J. I. Maletic, “Lightweight transformation and fact extraction with the srcML toolkit,” in Proc. IEEE 11th Int. Working Conf. Source Code Anal. Manipulation, 2011, pp. 173–184.
[45]
Huggingface, “Huggingface transformers,” Aug. 2020. [Online]. Available: https://github.com/huggingface/transformers
[46]
C. Casalnuovo, K. Sagae, and P. Devanbu, “Studying the difference between natural and programming language corpora,” Empirical Softw. Eng., vol. 24, no. 4, pp. 1823–1868, 2019.

Cited By

View all

Index Terms

  1. Learning to Find Usages of Library Functions in Optimized Binaries
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Software Engineering
      IEEE Transactions on Software Engineering  Volume 48, Issue 10
      Oct. 2022
      513 pages

      Publisher

      IEEE Press

      Publication History

      Published: 01 October 2022

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 26 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media