research-article

Tracking concept drift in malware families

Authors:

Anshuman Singh,

Andrew Walenstein,

Arun LakhotiaAuthors Info & Claims

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

Pages 81 - 92

https://doi.org/10.1145/2381896.2381910

Published: 19 October 2012 Publication History

Abstract

The previous efforts in the use of machine learning for malware detection have assumed that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations don't change over time. In this paper, we investigate this assumption for malware families as populations. Malware, by design, constantly evolves so as to defeat detection. Evolution in malware may lead to a nonstationary malware population. The problem of nonstationary populations has been called concept drift in machine learning. Tracking concept drift is critical to the successful application of ML based methods for malware detection. If the evolution causes the malware population to drift rapidly then frequent retraining of classifiers may be required to prevent degradation in performance. On the other hand, if the drift is found to be negligible, then ML based methods are robust for such populations for long periods of time.

We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. We illustrate the use of the proposed measures with a study on 3500+ samples from three families of x86 malware, spanning over 5 years. The results of the study show negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples. The measures can likewise be applied to track drift in any number of malware families. Tracking drift in this manner also provides a novel method for feature type selection, i.e., use the feature type that drifts the least.

References

[1]

T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In Proc. of the 28th Annual Intl. Computer Software and Applications Conference, 2003.

Digital Library

[2]

M. Bailey, J. Oberheide, J. Andersen, Z. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on Recent advances in intrusion detection, pages 178--197. Springer-Verlag, 2007.

Digital Library

[3]

U. Bayer, P. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Network and Distributed System Security Symposium (NDSS), 2009.

[4]

S. Choi, S. Cha, and C. Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43--48, 2010.

[5]

C. Collberg and J. Nagra. Surreptitious software: obfuscation, watermarking, and tamperproofing for software protection. Addison-Wesley Professional, 2009.

Digital Library

[6]

S. Delany, P. Cunningham, and B. Smyth. Ecue: A spam filter that uses machine learning to track concept drift. In Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29-September 1, 2006, Riva del Garda, Italy, pages 627--631. IOS Press, 2006.

Digital Library

[7]

S. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In Proceedings of the 19th International Conference on Artificial Intelligence (FLAIRS 2006), pages 340--345, 2006.

[8]

S. Delany, P. Cunningham, A. Tsymbal, and L. Coyle. A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems, 18(4):187--195, 2005.

Digital Library

[9]

A. Dries and U. Rückert. Adaptive concept drift detection. Statistical Analysis and Data Mining, 2(5-6):311--327, 2009.

Digital Library

[10]

F. Fdez-Riverola, E. Iglesias, F. Díaz, J. Méndez, and J. Corchado. Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications, 33(1):36--48, 2007.

Digital Library

[11]

J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.

[12]

J. Friedman and L. Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pages 697--717, 1979.

[13]

J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. Advances in Artificial Intelligence - SBIA 2004, pages 66--112, 2004.

[14]

GDataSoftware. G data malware report. http://www.gdatasoftware.com/uploads/media/G_Data_MalwareReport_H1_2011_EN.pdf, 2011.

[15]

F. Guo, P. Ferrie, and T. Chiueh. A study of the packer problem and its solutions. In Recent Advances in Intrusion Detection, pages 98--115. Springer, 2008.

Digital Library

[16]

P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359--374, 2002.

[17]

M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology, 5(4):335--343, 2009.

[18]

D. Helmbold and P. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27--45, 1994.

Digital Library

[19]

R. Hogg, J. McKean, and A. Craig. Introduction to mathematical statistics, 2005.

[20]

M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, 2005.

[21]

M. Kelly, D. Hand, and N. Adams. The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367--371. ACM, 1999.

Digital Library

[22]

R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281--300, 2004.

Digital Library

[23]

R. Klinkenberg and I. Renz. Adaptive information filtering: Learning drifting concepts. In Proc. of AAAI-98/ICML-98 workshop Learning for Text Categorization, pages 33--40, 1998.

[24]

J. Kolter and M. Maloof. Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research, 7:2721--2744, 2006.

Digital Library

[25]

A. Kuh, T. Petsche, and R. Rivest. Learning time-varying concepts. In Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems (NIPS), pages 183--189, 1990.

Digital Library

[26]

M. Lehman. Laws of software evolution revisited. Software process technology, pages 108--124, 1996.

Digital Library

[27]

P. Li, L. Liu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Recent Advances in Intrusion Detection, pages 238--255. Springer, 2010.

Digital Library

[28]

M. Masud, T. Al-Khateeb, K. Hamlen, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Cloud-based malware detection for evolving data streams. ACM Transactions on Management Information Systems (TMIS), 2(3):16, 2011.

Digital Library

[29]

M. M. Masud, L. Khan, and B. Thuraisingham. A hybrid model to detect malicious executables. In Proc. of the IEEE Intl. Conf. on Communications (ICC 2007), pages 1443--1448, 2007.

[30]

T. R. Microsoft Protection Center and Response. Malware encyclopedia. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Browse.aspx, 2011.

[31]

R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici. Unknown malcode detection using OPCODE representation. In Intelligence and Security Informatics, volume 5376 of Lecture Notes in Computer Science, pages 204--215. Springer Berlin, 2008.

Digital Library

[32]

C. Nachenberg. Computer virus-coevolution. Communications of the ACM, 50(1):46--51, 1997.

Digital Library

[33]

N. Rosenblum, B. Miller, and X. Zhu. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 21--28, 2010.

Digital Library

[34]

B. Rubinstein, B. Nelson, L. Huang, A. Joseph, S. Lau, S. Rao, N. Taft, and J. Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Internet Measurement Conference, pages 1--14, 2009.

Digital Library

[35]

M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. Data mining methods for detection of new malicious executables. In Proc. of S & P 2001: IEEE Symposium on Security and Privacy, pages 38--49, 2001.

Digital Library

[36]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.

Digital Library

[37]

P. Tan, M. Steinbach, and V. Kumar. Introduction to data mining. Pearson Addison Wesley, 2006.

Digital Library

[38]

A. Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 2004.

[39]

G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69--101, 1996.

[40]

T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue. Measuring similarity of large software systems based on source code correspondence. Product Focused Software Process Improvement, pages 179--208, 2005.

Digital Library

[41]

I. Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius University, Lithuania, 2009.

Cited By

Chouchen MOuni AMkaouer M(2024)MULTICR: Predicting Merged and Abandoned Code Changes in Modern Code Review Using Multi-Objective SearchACM Transactions on Software Engineering and Methodology10.1145/368047233:8(1-44)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3680472
Zhang BZhang HRen RWen ZWang Q(2024)Approach to Detect Windows Malware Based on Malicious Tendency Image and ResNet AlgorithmInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450025634:07(1173-1197)Online publication date: 28-Jun-2024
https://doi.org/10.1142/S0218194024500256
Rui LGadyatskaya O(2024)Position: The Explainability Paradox - Challenges for XAI in Malware Detection and Analysis2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00067(554-561)Online publication date: 8-Jul-2024
https://doi.org/10.1109/EuroSPW61312.2024.00067
Show More Cited By

Index Terms

Tracking concept drift in malware families
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
2. Social and professional topics
  1. Computing / technology policy
    1. Computer crime

Recommendations

The Next Malware Battleground: Recovery After Unknown Infection

Malware has become a natural aspect of Internet computing due to the imperfectness of systems that identify malware and prevent their installation. Our ability to control the volume of unwanted and malicious traffic on the Internet—the spam messages, ...
Correlation Analysis between Spamming Botnets and Malware Infected Hosts
SAINT '11: Proceedings of the 2011 IEEE/IPSJ International Symposium on Applications and the Internet

Many of recent cyber attacks are being launched by botnets for the purpose of carrying out large-scale cyber attacks such as spam emails, Distributed Denial of Service (DDoS), network scanning and so on. In many cases, these botnets consist of a lot of ...
Testing malware detectors

In today's interconnected world, malware, such as worms and viruses, can cause havoc. A malware detector (commonly known as virus scanner) attempts to identify malware. In spite of the importance of malware detectors, there is a dearth of testing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence

October 2012

116 pages

ISBN:9781450316644

DOI:10.1145/2381896

General Chair:
Ting Yu
North Carolina State University, USA
,
Program Chairs:
V. N. Venkatakrishan
University of Illinois at Chicago, USA
,
Apu Kapadia
Indiana University, Bloomington, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCS'12

Sponsor:

SIGSAC

CCS'12: the ACM Conference on Computer and Communications Security

October 19, 2012

North Carolina, Raleigh, USA

Acceptance Rates

AISec '12 Paper Acceptance Rate 10 of 24 submissions, 42%;

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
671
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)7

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chouchen MOuni AMkaouer M(2024)MULTICR: Predicting Merged and Abandoned Code Changes in Modern Code Review Using Multi-Objective SearchACM Transactions on Software Engineering and Methodology10.1145/368047233:8(1-44)Online publication date: 30-Jul-2024
https://dl.acm.org/doi/10.1145/3680472
Zhang BZhang HRen RWen ZWang Q(2024)Approach to Detect Windows Malware Based on Malicious Tendency Image and ResNet AlgorithmInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450025634:07(1173-1197)Online publication date: 28-Jun-2024
https://doi.org/10.1142/S0218194024500256
Rui LGadyatskaya O(2024)Position: The Explainability Paradox - Challenges for XAI in Malware Detection and Analysis2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00067(554-561)Online publication date: 8-Jul-2024
https://doi.org/10.1109/EuroSPW61312.2024.00067
Ceschin FBotacin MBifet APfahringer BOliveira LGomes HGrégio A(2023)Machine Learning (In) Security: A Stream of ProblemsDigital Threats: Research and Practice10.1145/36178975:1(1-32)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1145/3617897
Zola FBruse JGalar M(2023)Temporal Analysis of Distribution Shifts in Malware Classification for Digital Forensics2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW59978.2023.00054(439-450)Online publication date: Jul-2023
https://doi.org/10.1109/EuroSPW59978.2023.00054
Tsafrir TCohen ANir ENissim N(2023)Efficient feature extraction methodologies for unknown MP4-Malware detection using Machine learning algorithmsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119615219:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119615
Ceschin FBotacin MGomes HPinagé FOliveira LGrégio A(2023)Fast & Furious: On the modelling of malware detection as an evolving data streamExpert Systems with Applications10.1016/j.eswa.2022.118590212(118590)Online publication date: Feb-2023
https://doi.org/10.1016/j.eswa.2022.118590
Muralidharan TCohen ACohen ANissim N(2022)The infinite race between steganography and steganalysis in imagesSignal Processing10.1016/j.sigpro.2022.108711201:COnline publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1016/j.sigpro.2022.108711
Saidani IOuni AMkaouer M(2022)Improving the prediction of continuous integration build failures using deep learningAutomated Software Engineering10.1007/s10515-021-00319-529:1Online publication date: 20-Jan-2022
https://dl.acm.org/doi/10.1007/s10515-021-00319-5
Kan ZPendlebury FPierazzi FCavallaro LCarlini NDemontis AChen Y(2021)Investigating Labelless Drift Adaptation for Malware DetectionProceedings of the 14th ACM Workshop on Artificial Intelligence and Security10.1145/3474369.3486873(123-134)Online publication date: 15-Nov-2021
https://dl.acm.org/doi/10.1145/3474369.3486873
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents