skip to main content
10.1145/3589883.3589892acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmltConference Proceedingsconference-collections
research-article
Open access

Learning the Structure of Commands by Detecting Random Tokens Using Markov Model

Published: 27 June 2023 Publication History

Abstract

Learning the syntax and structure of command-line commands is of utmost importance in the field of cyber security to identify valid and malicious sets of commands. It is hard to learn the syntax and structure of every command because of various reasons, such as the continuous evolution of commands, precise syntax requirement, huge volume of available commands, and no room for errors, etc. In this research work, we studied two approaches to learning the structure of the commands by detecting the random tokens in them, such as temp files, temp directories, numerical values, etc. In the first approach, we write hard-coded regular expressions to identify random tokens in a command whereas in the second approach we trained a second-order Markov model which detects the random tokens based on their probabilities. To validate the efficiency of these approaches, we clustered the commands using their word embeddings and sentence embeddings. For clustering, we explored KMeans, and DBScan with word embeddings and sentence clustering based on sentence embeddings. We evaluated the performance of clustering algorithms against three metrics, the Silhouette Coefficient, the Calinski-Harabasz Index, and the Davies-Bouldin Index. The results show that regular expression and the Markov model achieve the same scores for KMeans and DBScan based on word embeddings against three metrics, whereas when clustered using sentence embeddings, the Markov model performs better than regular expression. These results validate our idea of using the Markov model instead of regular expressions, to get similar scores or even better performance with less resource utilization, such as human effort, time to write regular expressions, and maintaining & storage of those regular expressions.

References

[1]
Ben Athiwaratkun and Jack W. Stokes. 2017. Malware classification with LSTM and GRU language models and a character-level CNN. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2482–2486. https://doi.org/10.1109/ICASSP.2017.7952603
[2]
Julia Bedziechowska. 2021. NLP for cyber security - language model for command lines @ F-Secure. (2021). https://www.youtube.com/watch?v=yORkNjBzuN0&ab_channel=GHOSTDay%3AAMLC
[3]
James B.MacQueen. 1967. Some Methods for Classification and Analysis of MultiVariate Observations. In Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman (Eds.). Vol. 1. University of California Press, 281–297.
[4]
George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. 2013. Large-scale malware classification using random projections and neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 3422–3426. https://doi.org/10.1109/ICASSP.2013.6638293
[5]
David L. Davies and Donald W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 2 (1979), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
[6]
Brian D. Davison and Haym Hirsh. 1997. Toward an Adaptive Command Line Interface. In Design of Computing Systems: Social and Ergonomic Considerations, Proceedings of the Seventh International Conference on Human-Computer Interaction, (HCI International ’97), San Francisco, California, USA, August 24-29, 1997, Volume 2, Michael J. Smith, Gavriel Salvendy, and Richard J. Koubek (Eds.). Elsevier, 505–508.
[7]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Portland, Oregon). AAAI Press, 226–231.
[8]
Danny Hendler, Shay Kels, and Amir Rubin. 2018. Detecting Malicious PowerShell Commands using Deep Neural Networks.
[9]
Benjamin Korvemaker and Russ Greiner. 2000. Predicting UNIX Command Lines: Adjusting to User Patterns. (02 2000).
[10]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137.
[11]
Louis G. Michael, James Donohue, James C. Davis, Dongyoon Lee, and Francisco Servant. 2019. Regexes are Hard: Decision-Making, Difficulties, and Risks in Programming Regular Expressions. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 415–426. https://doi.org/10.1109/ASE.2019.00047
[12]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.). Vol. 26. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
[13]
Razvan Pascanu, Jack W. Stokes, Hermineh Sanossian, Mady Marinescu, and Anil Thomas. 2015. Malware classification with recurrent networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304
[14]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
[15]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
[16]
Joshua Saxe and Konstantin Berlin. 2015. Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE). 11–20. https://doi.org/10.1109/MALWARE.2015.7413680
[17]
Thoudam Doren Singh, Abdullah Faiz Ur Rahman Khilji, Divyansha, Apoorva Vikram Singh, Surmila Thokchom, and Sivaji Bandyopadhyay. 2020. Seq2Seq and Joint Learning Based Unix Command Line Prediction System. https://doi.org/10.48550/ARXIV.2006.11558
[18]
Michael R. Smith, Joe B. Ingram, Christopher C. Lamb, Timothy J. Draelos, Justin E. Doak, James B. Aimone, and Conrad D. James. 2017. Dynamic Analysis of Executables to Detect and Characterize Malware. https://doi.org/10.48550/ARXIV.1711.03947
[19]
Jihyeon Song, Jungtae Kim, Sunoh Choi, Jonghyun Kim, and Ikkyun Kim. 2021. Evaluations of AI-based malicious PowerShell detection with feature optimizations. ETRI Journal 43, 3 (2021), 549–560. https://doi.org/10.4218/etrij.2020-0215
[20]
Caliński Tadeusz and JA Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics 3, 1 (1974), 1–27. https://doi.org/10.1080/03610927408827101 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/03610927408827101
[21]
Scott Thede and Mary Harper. 2002. A Second-Order Hidden Markov Model for Part-of-Speech Tagging. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (05 2002). https://doi.org/10.3115/1034678.1034712

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMLT '23: Proceedings of the 2023 8th International Conference on Machine Learning Technologies
March 2023
293 pages
ISBN:9781450398329
DOI:10.1145/3589883
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2023

Check for updates

Author Tags

  1. Clustering
  2. Command-Line
  3. Commands
  4. Markov model
  5. Random Tokens
  6. Randomness
  7. Regular Expressions
  8. Sentence Embeddings
  9. Word Embeddings

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMLT 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)110
  • Downloads (Last 6 weeks)12
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media