skip to main content
research-article

Automated Feature Selection for Anomaly Detection in Network Traffic Data

Published: 21 June 2021 Publication History

Abstract

Variable selection (also known as feature selection) is essential to optimize the learning complexity by prioritizing features, particularly for a massive, high-dimensional dataset like network traffic data. In reality, however, it is not an easy task to effectively perform the feature selection despite the availability of the existing selection techniques. From our initial experiments, we observed that the existing selection techniques produce different sets of features even under the same condition (e.g., a static size for the resulted set). In addition, individual selection techniques perform inconsistently, sometimes showing better performance but sometimes worse than others, thereby simply relying on one of them would be risky for building models using the selected features. More critically, it is demanding to automate the selection process, since it requires laborious efforts with intensive analysis by a group of experts otherwise. In this article, we explore challenges in the automated feature selection with the application of network anomaly detection. We first present our ensemble approach that benefits from the existing feature selection techniques by incorporating them, and one of the proposed ensemble techniques based on greedy search works highly consistently showing comparable results to the existing techniques. We also address the problem of when to stop to finalize the feature elimination process and present a set of methods designed to determine the number of features for the reduced feature set. Our experimental results conducted with two recent network datasets show that the identified feature sets by the presented ensemble and stopping methods consistently yield comparable performance with a smaller number of features to conventional selection techniques.

References

[1]
Evangelos E. Papalexakis, Alex Beutel, and Peter Steenkiste. 2012. Network anomaly detection using co-clustering. In Proceedings of the 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. IEEE, 403–410.
[2]
Jinoh Kim, Alex Sim, Brian Tierney, Sang Suh, and Ikkyun Kim. 2019. Multivariate network traffic analysis using clustered patterns. Computing 101, 4 (2019), 339–361.
[3]
Sunhee Baek, Donghwoon Kwon, Jinoh Kim, Sang C. Suh, Hyunjoo Kim, and Ikkyun Kim. 2017. Unsupervised labeling for supervised anomaly detection in enterprise and cloud networks. In Proceedings of the 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud’17). IEEE, 205–210.
[4]
Mohiuddin Ahmed, Abdun Naser Mahmood, and Jiankun Hu. 2016. A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 60 (2016), 19–31.
[5]
Donghwoon Kwon, Hyunjoo Kim, Jinoh Kim, Sang C. Suh, Ikkyun Kim, and Kuinam J. Kim. 2019. A survey of deep learning-based network anomaly detection. Cluster Comput. 22, Suppl 1 (2019), 949–961.
[6]
Anna L. Buczak and Erhan Guven. 2015. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18, 2 (2015), 1153–1176.
[7]
Nour Moustafa and Jill Slay. 2015. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS’15). IEEE, 1–6.
[8]
Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani. 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP’18). 108–116.
[9]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Comput. Surv. 50, 6 (2017), 1–45.
[10]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2018. Feature selection: A data perspective. ACM Comput. Surv. 50, 6 (2018), 94.
[11]
Salem Alelyani, Jiliang Tang, and Huan Liu. 2018. Feature selection for clustering: A review. In Data Clustering. Chapman & Hall/CRC, 29–60.
[12]
Qi-Hai Zhu and Yu-Bin Yang. 2018. Discriminative embedded unsupervised feature selection. Pattern Recogn. Lett. 112 (2018), 219–225.
[13]
Ahmed A. Ewees, Mohamed Abd El Aziz, and Aboul Ella Hassanien. 2019. Chaotic multi-verse optimizer-based feature selection. Neural Comput. Appl. 31, 4 (2019), 991–1006.
[14]
Fernando Jiménez, Carlos Martínez, Enrico Marzano, Jose Tomas Palma, Gracia Sánchez, and Guido Sciavicco. 2019. Multiobjective evolutionary feature selection for fuzzy classification. IEEE Trans. Fuzzy Syst. 27, 5 (2019), 1085–1099.
[15]
Yonghua Zhu, Xuejun Zhang, Rongyao Hu, and Guoqiu Wen. 2018. Adaptive structure learning for low-rank supervised feature selection. Pattern Recogn. Lett. 109 (2018), 89–96.
[16]
Esma Nur Cinicioglu and Taylan Yenilmez. 2016. Determination of variables for a Bayesian network and the most precious one. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Springer, 313–325.
[17]
Félix Iglesias and Tanja Zseby. 2015. Analysis of network traffic features for anomaly detection. Mach. Learn. 101, 1--3 (2015), 59–84.
[18]
Ishfaq Manzoor, Neeraj Kumar, et al. 2017. A feature reduced intrusion detection system using ANN classifier. Expert Syst. Appl. 88 (2017), 249–257.
[19]
T. H. Divyasree and K. K. Sherly. 2018. A network intrusion detection system based on ensemble CVM using efficient feature selection approach. Proc. Comput. Sci. 143 (2018), 442–449.
[20]
Tharmini Janarthanan and Shahrzad Zargari. 2017. Feature selection in UNSW-NB15 and KDDCUP’99 datasets. In Proceedings of the 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE’17). IEEE, 1881–1886.
[21]
Chaouki Khammassi and Saoussen Krichen. 2017. A GA-LR wrapper approach for feature selection in network intrusion detection. Comput. Secur. 70 (2017), 255–277.
[22]
Gayatri V. Patil, K. Vinod Pachghare, and Deepak D. Kshirsagar. 2018. Feature reduction in flow based intrusion detection system. In Proceedings of the 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT’18). IEEE, 1356–1362.
[23]
Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2nd IEEE International Conference on Computational Intelligence for Security and Defense Applications (CISDA’09). 53–58.
[24]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, (Mar.2003), 1157–1182.
[25]
Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, and María Tombilla-Sanromán. 2007. Filter methods for feature selection–A comparative study. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning. Springer, 178–187.
[26]
A. Jović, K. Brkić, and N. Bogunović. 2015. A review of feature selection methods with applications. In 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO'15). 1200--1205.
[27]
Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507–2517.
[28]
Sebastian Raschuka. 2015. Python Machine Learning. Packet Publishing Ltd.
[29]
Xue-wen Chen and Jong Cheol Jeong. 2007. Enhanced recursive feature elimination. In Proceedings of the 6th International Conference on Machine Learning and Applications (ICMLA’07). IEEE, 429–435.
[30]
Oznur Tastan, Yanjun Qi, Jaime G. Carbonell, and Judith Klein-Seetharaman. 2009. Prediction of interactions between HIV-1 and human proteins by information integration. In Proceedings of the Annual Conference on Biocomputing. World Scientific, 516–527.
[31]
Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications. IEEE, 1–6.
[32]
Dalwinder Singh and Birmohan Singh. 2019. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. (2019), 105524.
[33]
Devansh Arpit and Yoshua Bengio. 2019. The benefits of over-parameterization at initialization in deep ReLU networks. arXiv:1901.03611. Retrieved from https://arxiv.org/abs/1901.03611.
[34]
Jonathan Wang, Wucherl Yoo, Alex Sim, Peter Nugent, and Kesheng Wu. 2017. Parallel variable selection for effective performance prediction. In Proceedings of the 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’17). IEEE, 208–217.
[35]
Alexandr Katrutsa and Vadim Strijov. 2017. Comprehensive study of feature selection methods to solve multicollinearity problem according to evaluation criteria. Expert Syst. Appl. 76 (2017), 1–11.
[36]
KDD Cup 1999 Data. Retrieved from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
[37]
Robin Sommer and Vern Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In Proceedings of the 2010 IEEE Symposium on Security and Privacy. IEEE, 305–316.
[38]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explor. Newslett. 11, 1 (2009), 10–18.

Cited By

View all

Index Terms

  1. Automated Feature Selection for Anomaly Detection in Network Traffic Data

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Management Information Systems
        ACM Transactions on Management Information Systems  Volume 12, Issue 3
        September 2021
        225 pages
        ISSN:2158-656X
        EISSN:2158-6578
        DOI:10.1145/3468067
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 21 June 2021
        Accepted: 01 December 2020
        Revised: 01 December 2020
        Received: 01 June 2020
        Published in TMIS Volume 12, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Feature selection
        2. ensemble approach
        3. network anomaly detection
        4. cybersecurity analytics

        Qualifiers

        • Research-article
        • Refereed

        Funding Sources

        • U.S. Department of Energy (DOE)
        • Office of Science, Office of Advanced Scientific Computing Research
        • Institute for Information & communications Technology Promotion (IITP)
        • Korea government (MSIP)
        • National Energy Research Scientific Computing Center (NERSC)

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)102
        • Downloads (Last 6 weeks)17
        Reflects downloads up to 15 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media