skip to main content
10.1145/2381896.2381910acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Tracking concept drift in malware families

Published: 19 October 2012 Publication History

Abstract

The previous efforts in the use of machine learning for malware detection have assumed that malware population is stationary i.e. probability distribution of the observed characteristics (features) of malware populations don't change over time. In this paper, we investigate this assumption for malware families as populations. Malware, by design, constantly evolves so as to defeat detection. Evolution in malware may lead to a nonstationary malware population. The problem of nonstationary populations has been called concept drift in machine learning. Tracking concept drift is critical to the successful application of ML based methods for malware detection. If the evolution causes the malware population to drift rapidly then frequent retraining of classifiers may be required to prevent degradation in performance. On the other hand, if the drift is found to be negligible, then ML based methods are robust for such populations for long periods of time.
We propose two measures for tracking concept drift in malware families when feature sets are very large-relative temporal similarity and metafeatures. We illustrate the use of the proposed measures with a study on 3500+ samples from three families of x86 malware, spanning over 5 years. The results of the study show negligible drift in mnemonic 2-grams extracted from unpacked versions of the samples. The measures can likewise be applied to track drift in any number of malware families. Tracking drift in this manner also provides a novel method for feature type selection, i.e., use the feature type that drifts the least.

References

[1]
T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan. N-gram-based detection of new malicious code. In Proc. of the 28th Annual Intl. Computer Software and Applications Conference, 2003.
[2]
M. Bailey, J. Oberheide, J. Andersen, Z. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In Proceedings of the 10th international conference on Recent advances in intrusion detection, pages 178--197. Springer-Verlag, 2007.
[3]
U. Bayer, P. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable, behavior-based malware clustering. In Network and Distributed System Security Symposium (NDSS), 2009.
[4]
S. Choi, S. Cha, and C. Tappert. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1):43--48, 2010.
[5]
C. Collberg and J. Nagra. Surreptitious software: obfuscation, watermarking, and tamperproofing for software protection. Addison-Wesley Professional, 2009.
[6]
S. Delany, P. Cunningham, and B. Smyth. Ecue: A spam filter that uses machine learning to track concept drift. In Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29-September 1, 2006, Riva del Garda, Italy, pages 627--631. IOS Press, 2006.
[7]
S. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In Proceedings of the 19th International Conference on Artificial Intelligence (FLAIRS 2006), pages 340--345, 2006.
[8]
S. Delany, P. Cunningham, A. Tsymbal, and L. Coyle. A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems, 18(4):187--195, 2005.
[9]
A. Dries and U. Rückert. Adaptive concept drift detection. Statistical Analysis and Data Mining, 2(5-6):311--327, 2009.
[10]
F. Fdez-Riverola, E. Iglesias, F. Díaz, J. Méndez, and J. Corchado. Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications, 33(1):36--48, 2007.
[11]
J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.
[12]
J. Friedman and L. Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, pages 697--717, 1979.
[13]
J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learning with drift detection. Advances in Artificial Intelligence - SBIA 2004, pages 66--112, 2004.
[14]
GDataSoftware. G data malware report. http://www.gdatasoftware.com/uploads/media/G_Data_MalwareReport_H1_2011_EN.pdf, 2011.
[15]
F. Guo, P. Ferrie, and T. Chiueh. A study of the packer problem and its solutions. In Recent Advances in Intrusion Detection, pages 98--115. Springer, 2008.
[16]
P. Hall and N. Tajvidi. Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89(2):359--374, 2002.
[17]
M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology, 5(4):335--343, 2009.
[18]
D. Helmbold and P. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27--45, 1994.
[19]
R. Hogg, J. McKean, and A. Craig. Introduction to mathematical statistics, 2005.
[20]
M. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13--23, 2005.
[21]
M. Kelly, D. Hand, and N. Adams. The impact of changing populations on classifier performance. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 367--371. ACM, 1999.
[22]
R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3):281--300, 2004.
[23]
R. Klinkenberg and I. Renz. Adaptive information filtering: Learning drifting concepts. In Proc. of AAAI-98/ICML-98 workshop Learning for Text Categorization, pages 33--40, 1998.
[24]
J. Kolter and M. Maloof. Learning to detect and classify malicious executables in the wild. The Journal of Machine Learning Research, 7:2721--2744, 2006.
[25]
A. Kuh, T. Petsche, and R. Rivest. Learning time-varying concepts. In Proceedings of the 1990 Conference on Advances in Neural Information Processing Systems (NIPS), pages 183--189, 1990.
[26]
M. Lehman. Laws of software evolution revisited. Software process technology, pages 108--124, 1996.
[27]
P. Li, L. Liu, D. Gao, and M. Reiter. On challenges in evaluating malware clustering. In Recent Advances in Intrusion Detection, pages 238--255. Springer, 2010.
[28]
M. Masud, T. Al-Khateeb, K. Hamlen, J. Gao, L. Khan, J. Han, and B. Thuraisingham. Cloud-based malware detection for evolving data streams. ACM Transactions on Management Information Systems (TMIS), 2(3):16, 2011.
[29]
M. M. Masud, L. Khan, and B. Thuraisingham. A hybrid model to detect malicious executables. In Proc. of the IEEE Intl. Conf. on Communications (ICC 2007), pages 1443--1448, 2007.
[30]
T. R. Microsoft Protection Center and Response. Malware encyclopedia. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Browse.aspx, 2011.
[31]
R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici. Unknown malcode detection using OPCODE representation. In Intelligence and Security Informatics, volume 5376 of Lecture Notes in Computer Science, pages 204--215. Springer Berlin, 2008.
[32]
C. Nachenberg. Computer virus-coevolution. Communications of the ACM, 50(1):46--51, 1997.
[33]
N. Rosenblum, B. Miller, and X. Zhu. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, pages 21--28, 2010.
[34]
B. Rubinstein, B. Nelson, L. Huang, A. Joseph, S. Lau, S. Rao, N. Taft, and J. Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Internet Measurement Conference, pages 1--14, 2009.
[35]
M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo. Data mining methods for detection of new malicious executables. In Proc. of S & P 2001: IEEE Symposium on Security and Privacy, pages 38--49, 2001.
[36]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[37]
P. Tan, M. Steinbach, and V. Kumar. Introduction to data mining. Pearson Addison Wesley, 2006.
[38]
A. Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 2004.
[39]
G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69--101, 1996.
[40]
T. Yamamoto, M. Matsushita, T. Kamiya, and K. Inoue. Measuring similarity of large software systems based on source code correspondence. Product Focused Software Process Improvement, pages 179--208, 2005.
[41]
I. Zliobaite. Learning under concept drift: an overview. Technical report, Vilnius University, Lithuania, 2009.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence
October 2012
116 pages
ISBN:9781450316644
DOI:10.1145/2381896
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concept drift
  2. malware
  3. metafeatures
  4. temporal similarity

Qualifiers

  • Research-article

Conference

CCS'12
Sponsor:
CCS'12: the ACM Conference on Computer and Communications Security
October 19, 2012
North Carolina, Raleigh, USA

Acceptance Rates

AISec '12 Paper Acceptance Rate 10 of 24 submissions, 42%;
Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)7
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media