SELID: Selective Event Labeling for Intrusion Detection Datasets
Abstract
:1. Introduction
- SELID can automatically select a small number of events from the dataset for the purpose of efficient data labeling. We present a content-based vectorization and clustering scheme that can make groups of similar events.
- We verify the effectiveness of SELID with one private dataset from a real security operations center and one public dataset from the Internet for experimental reproducibility.
- The experimental results demonstrate that SELID can reduce the number of labeling events by orders of magnitude without degrading the performance of the machine learning model.
2. Related Work
3. SELID: Selective Event Labeling for Intrusion Detection Datasets
3.1. Dataset
3.2. Vectorization
3.3. Clustering
3.4. Sampling and Labeling
4. Experiments and Discussion
4.1. Experimental Setup
4.2. Experimental Datasets
- Accuracy = ;
- Precision = ;
- Recall = ;
- F1-score = .
- True Positive (): An attack event is classified as an attack event.
- True Negative (): A non-attack event is classified as a non-attack event.
- False Positive (): A non-attack event is classified as an attack event.
- False Negative (): An attack event is classified as a non-attack event.
4.2.1. Private Dataset
4.2.2. Public Dataset
4.3. Experimental Results
4.4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Schneier, B. Managed Security Monitoring: Network Security for the 21st Century. Comput. Secur. 2001, 20, 491–503. [Google Scholar] [CrossRef]
- Alahmadi, B.; Axon, L.; Martinovic, I. 99% False Positives: A Qualitative Study of SOC Analysts’ Perspectives on Security Alarms. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 2783–2800. [Google Scholar]
- Ede, T.; Aghakhani, H.; Spahn, N.; Bortolameotti, R.; Cova, M.; Continella, A.; Steen, M.; Peter, A.; Kruegel, C.; Vigna, G. DEEPCASE: Semi-Supervised Contextual Analysis of Security Events. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 23–26 May 2022; pp. 522–539. [Google Scholar] [CrossRef]
- FireEye. Nine Steps to Eliminate Alert Fatgue. Available online: https://www2.fireeye.com/rs/848-DID-242/images/eb_9_steps_to_eliminate_alert_fatigue_new.pdf (accessed on 6 June 2023).
- Brozek, M. How to Help SOC Analysts Fight ‘Alert Fatigue’. 2019. Available online: https://blog.paloaltonetworks.com/2019/07/help-soc-analysts-fight-alert-fatigue (accessed on 6 June 2023).
- Alert Fatigue and Tuning for Security Analysts. 2018. Available online: https://cybersecurity.att.com/blogs/security-essentials/alert-fatigue-and-tuning-for-security-analysts (accessed on 6 June 2023).
- Hassan, W.; Guo, S.; Li, D.; Chen, Z.; Jee, K.; Li, Z.; Bates, A. NoDoze: Combatting Threat Alert Fatigue with Automated Provenance Triage. In Proceedings of the Network and Distributed Systems Security Symposium, San Diego, CA, USA, 24–27 February 2019. [Google Scholar] [CrossRef]
- Martino, S. Anticipating the Unknowns: 2019 CISCO CISO Benchmark Study. 2019. Available online: https://blogs.cisco.com/security/anticipating-the-unknowns-2019-cisco-ciso-benchmark-study (accessed on 6 June 2023).
- Paloalto. The State of SOAR Report. 2020. Available online: https://start.paloaltonetworks.com/state-of-soar-report-2020.html (accessed on 6 June 2023).
- Zhang, Y.; Jiang, H.; Feng, D.; Xia, W.; Fu, M.; Huang, F.; Zhou, Y. AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 26 April–1 May 2015; pp. 1337–1345. [Google Scholar] [CrossRef]
- Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; Attenberg, J. Feature Hashing for Large Scale Multitask Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09), Montreal, QC, Canada, 14–18 June 2009; pp. 1113–1120. [Google Scholar] [CrossRef] [Green Version]
- Rieck, K.; Trinius, P.; Willems, C.; Holz, T. Automatic Analysis of Malware Behavior Using Machine Learning. J. Comput. Secur. 2011, 19, 639–668. [Google Scholar] [CrossRef]
- Hu, X.; Shin, K.; Bhatkar, S.; Griffin, K. MutantX-S: Scalable Malware Clustering Based on Static Features. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), San Jose, CA, USA, 24–28 June 2013; pp. 187–198. [Google Scholar]
- KISA(Korea Internet & Security Agency) National Cybersecurity White Paper. 2021. Available online: https://www.kisa.or.kr/skin/doc.html?fn=20210208_999999_002.pdf&rs=/result/2021-06/ (accessed on 6 June 2023).
- Pietraszek, T. Using adaptive alert classification to reduce false positives in intrusion detection. In Proceedings of the Recent Advances in Intrusion Detection: 7th International Symposium, RAID 2004, Sophia Antipolis, France, 15–17 September 2004; pp. 102–124. [Google Scholar]
- Paxson, V. Bro: A system for detecting network intruders in real-time. Comput. Netw. 1999, 31, 2435–2463. [Google Scholar] [CrossRef]
- Riley, M.; Elgin, B.; Lawrence, D.; Matlack, C. Missed Alarms and 40 Million Stolen Credit Card Numbers: How Target Blew It. 2014. Available online: https://www.bloomberg.com/news/articles/2014-03-13/target-missed-warnings-in-epic-hack-of-credit-card-data (accessed on 6 June 2023).
- Shen, Y.; Mariconti, E.; Vervier, P.A.; Stringhini, G. Tiresias: Predicting security events through deep learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 592–605. [Google Scholar] [CrossRef] [Green Version]
- Shen, Y.; Stringhini, G. ATTACK2VEC: Leveraging Temporal Word Embeddings to Understand the Evolution of Cyberattacks. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 905–921. [Google Scholar]
- Snort. Snort–Network Intrusion Detection & Prevention System. Available online: https://www.snort.org/ (accessed on 6 June 2023).
- Suricata. Open Source ids/ips/nsm Engine. Available online: https://suricata.io/ (accessed on 6 June 2023).
- Rabin, M. Fingerprinting by Random Polynomials; Technical Report; Center for Research in Computing Technology, Harvard University: Cambridge, MA, USA, 1981. [Google Scholar]
- Scikit-Learn: Machine Learning in Python. Available online: https://scikit-learn.org (accessed on 6 June 2023).
- Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catanzaro, B.; Nicholas, C. Malware Detection by Eating a Whole EXE. arXiv 2017, arXiv:1710.09435. [Google Scholar]
- Coull, S.; Gardner, C. Activation Analysis of a Byte-Based Deep Neural Network for Malware Classification. In Proceedings of the 2019 IEEE Security And Privacy Workshops (SPW), San Francisco, CA, USA, 19–23 May 2019; pp. 21–27. [Google Scholar]
- HTTP DATASET CSIC 2010. Available online: https://www.isi.csic.es/dataset/ (accessed on 6 June 2023).
- Ugarte-Pedrero, X.; Graziano, M.; Balzarotti, D. A Close Look at a Daily Dataset of Malware Samples. ACM Trans. Priv. Secur. 2019, 22, 1–30. [Google Scholar] [CrossRef]
Site | Train | Test | ||
---|---|---|---|---|
FP | TP | FP | TP | |
1 | 346,002 | 2016 | 54,448 | 119 |
2 | 202,835 | 4135 | 190,130 | 6187 |
3 | 10,564 | 541 | 6889 | 403 |
4 | 26,883 | 489 | 12,792 | 384 |
5 | 22,884 | 563 | 84,238 | 1397 |
6 | 87,903 | 24 | 10,554 | 24 |
7 | 6599 | 1952 | 1295 | 160 |
8 | 112,939 | 117 | 4415 | 21 |
9 | 17,050 | 2 | 21,263 | 16 |
10 | 21,484 | 43 | 16,447 | 68 |
all | 855,143 | 9882 | 402,471 | 8779 |
Train | Test | |
---|---|---|
FP | 28,800 | 20,052 |
TP | 7200 | 5013 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jang, W.; Kim, H.; Seo, H.; Kim, M.; Yoon, M. SELID: Selective Event Labeling for Intrusion Detection Datasets. Sensors 2023, 23, 6105. https://doi.org/10.3390/s23136105
Jang W, Kim H, Seo H, Kim M, Yoon M. SELID: Selective Event Labeling for Intrusion Detection Datasets. Sensors. 2023; 23(13):6105. https://doi.org/10.3390/s23136105
Chicago/Turabian StyleJang, Woohyuk, Hyunmin Kim, Hyungbin Seo, Minsong Kim, and Myungkeun Yoon. 2023. "SELID: Selective Event Labeling for Intrusion Detection Datasets" Sensors 23, no. 13: 6105. https://doi.org/10.3390/s23136105
APA StyleJang, W., Kim, H., Seo, H., Kim, M., & Yoon, M. (2023). SELID: Selective Event Labeling for Intrusion Detection Datasets. Sensors, 23(13), 6105. https://doi.org/10.3390/s23136105