5. Machine Learning in Intrusion Detection Systems (IDSs) to Protect CI
ML is a category of AI and is focused on helping computers to learn. This learning is based on previous knowledge from experiences, patterns, and behaviors [
28]. Since 1950, when AI started, a considerable amount of research has been conducted in almost every area of investigation from agriculture to space. In the cybersecurity area, the ability to identify and learn from patterns is used to detect similar attacks. For instance, signature-based IDSs use ML to detect attacks in which signatures had been previously learned [
31]. Although this kind of identification has produced excellent results in identifying previous well-known attacks, its performance is inaccurate when applied to zero-day attacks. Furthermore, a small modification to an attack would change its signature, thus making it difficult to identify an attack by a signature-based IDS [
5]. In the case of anomaly-based IDSs, an ML algorithm models the normal behavior of the network and identifies everything outside of the learned model as an anomaly. This kind of IDS is better at detecting unknown and zero-day attacks. However, the false positive rate is considerably higher, and abnormal behavior is not always an indication of an attack. A plastic bag can block or alter the digital measures of a sensor in a hydroelectric system, and while this is not a cyber-attack, the bag would be detected as an anomaly. More recent research has shown the benefits of a hybrid approach, i.e., mixing the potential of both kinds of IDS. While a mixed approach has some benefits, its use would result in a complex system that is difficult to implement [
7,
31,
34,
69].
ML algorithms have been used to attack and defend in cyberspace [
5,
70]. From a protection point of view, ML classifiers have advantages for security systems. These advantages include (1) decision trees that can find an accurate set of “best” rules that are used to classify network traffic; (2) k-nearest neighbors (an interesting solution in IDSs) that can learn patterns from new traffic to classify zero-days attacks as an unseen class; (3) support vector machines; and (4) artificial neural networks that can adapt to new forms of communications and learn from incidents without training all models again and can adjust their neurons’ weight to identify unseen attacks [
28]. All the previous examples have common characteristics in that they depend on the quality of the dataset to learn to identify a cyber-attack, they conduct supervised learning, and they need a periodic update—there are different updating techniques depending on the trained model and particular needs. Nevertheless, the need to update the model is not just for ML classifiers but for any ML model.
The incorporation of the fourth industrial revolution’s technologies such as the internet of things has exponentially increased the amount of diverse data that CI is generating [
8]. Additionally, SCADA systems, which are the core of most CI, have implemented TCP/IP communication protocols [
32], resulting in a wider attack surface with the possibility of more complex and diverse attacks [
5]. There is a need to develop new technologies to cope with changing and novel risks. ML solutions have established a strong resistance against security threats [
8]. Nonetheless, depending on experts’ labeling is becoming pointless as attackers are always changing their methods, and the exponential increase in real-time network traffic [
28] has made it impossible to keep security rules updated. Additionally, it could be difficult to recognize patterns in unbalanced, noisy, or incomplete data [
71]. These features are normally present in CI’s network traffic. Consequently, UL and RL have become the most adept solutions to cope with these problems. UL helps to uncover hidden characteristics, patterns, and structures from datasets to establish indicators of cyber-attacks [
31,
60] and, through clustering, has enhanced its capacity to identify novel attacks. RL learns from its own experience, and it is the closest to human learning. RL performs well when working in real-time adversarial scenarios [
8], and its characteristics make it attractive as a cybersecurity solution.
Typical security solutions tend not to identify vulnerabilities that merge the interaction of IT and physical systems [
72,
73]. There is a need to develop IDSs with specific characteristics that take into consideration CI requirements: (1) industrial control systems (ICSs) have a continuous operation that cannot be interrupted for long periods to carry out any security management tasks, and the highest service availability is usually mandatory; (2) in industrial networks, the jitter or delay is kept at lower levels than in IT networks; (3) a physical process is developed by sensors, actuators, or programmable logic controllers (PLC), which are key components for ICS operation, and their security is a priority [
1]; (4) a cyber-attack on CI could scale and generate economic losses, and social or political issues, and even impact human lives [
19]; and (5) ICS traffic is more stable, and the payload depends on system specifications and usually manages their communication protocols [
74].
Details of some of the ML algorithms used in IDSs are explained in
Figure 2. The most frequently used ML method is supervised learning. This method has shown meaningful results in measures such as accuracy. Nevertheless, making comparisons between results is not simple work, since they are calculated using different measures from different algorithms and training datasets. In [
60], the authors based their evaluation on calculating the area under the curve (AUC) and obtained the best possible results (1.0). This measure does not allow the minimization of one type of error. Thus, the AUC is not useful if optimization of false positives or false negatives is needed. Most complex metrics are also included to evaluate the performance of IDSs, such as the Matthews correlation coefficient (MCC) and F1-score. The latter is becoming popular since it is computed as a harmonic mean of precision and recall [
75]. There is a limitation in the parametric comparison of ML algorithms used in IDSs, and most of the analyzed works do not evaluate the results with a variety of measures [
35,
76,
77]. The most common measure is accuracy, followed by precision, recall, and F1-score, as shown in
Table 5. Calculating metrics such as the MCC, confusion matrices, specificity, sensitivity, and the kappa coefficient help to understand the behavior of ML algorithms and to deeply understand the research results, as in the case of [
72], in which the authors offer the results in more than five metrics.
In the case of anomaly-based IDSs, the detection rate and false alarm rate are the most common metrics used to evaluate the detectors. Nonetheless, these cannot fully assess a detector designed to work in CI. For instance, detection latency is a key factor [
74]. Operators of CI need to know about a cyber-attack as soon as possible.
Ensemble models obtained positive results using the F1-score as an evaluation metric, however, the training dataset could not represent the current threats due to it being from 1990 [
31]. Models that used decision trees, neighbor-based models [
27,
76], and recurrent neural networks [
82] obtained results over 0.96 in accuracy, with more updated datasets. The problem to solve using an ML model is not always the same. In some cases, it is a binary classification, while in others, it is a multiclassification. The number of classification options depends on the security information available in the dataset and the model’s purpose. From the cybersecurity perspective, it is not enough to detect a cyber-attack—binary classification. It would be better to know which kind of intrusion was detected in the system—multiclassification. This knowledge can determine incident management. There has been a surge in new techniques such as the clustering-based classification methodology named perceptual pigeon galvanized optimization (PPGO) [
72]. Although this technique proposes a binary classification, it has good results not only in metrics such as accuracy but also in different evaluations such as MCC, confusion matrices, sensitivity, specificity, and F1-score. This kind of technique has better options to implement in industrial networks than some multiclassification solutions with less accurate results. Additionally, PPGO is also a method for choosing the optimal features, which is always a challenge when working with ML. An analysis of some previous works that have been done to develop IDs using ML is shown in
Table 6.
Future selection (FS) is a demanding task, not only in the development of ML algorithms for industrial systems but also in any solution that implements ML. In classification problems, an adequate FS technique finds the best characteristics that solve the problem, increases the classification accuracy, and decreases the training and testing time. There are different techniques for FS, and some of the most common are wrapper methods, which include forward, backward, and stepwise selection; filter methods, which include measures such as Pearson’s correlation and analysis of variance (ANOVA); and embedded methods, in which the FS process is evolving as part of creating models such as decision trees. Additional methods or tools that can be used for FS have been developed, such as principal component analysis (PCA). Although a deep analysis of the FS techniques is out of the scope of this review, it is necessary to highlight their importance. An example of an FS algorithm for IDSs in CI was developed in [
73], where the authors present a wrapper method composed of the BAT algorithm and support vector machines (SVMs). The results were positive in different measures; however, the study was carried out with the benchmark KDD Cup dataset from 1999, which might bring some limitations to its implementation in real-world scenarios since the dataset cannot represent the characteristics of current attacks and only has data from four kinds of attacks: denial-of-service attacks, which prevent users from accessing services; probe attacks, which scan vulnerabilities; remote-to-local attacks, which obtain access from remote connections; and user-to-root attacks, which obtain root access from a normal user.
As shown in [
33,
71,
75], the detection time is a factor that should be considered and calculated. Although proper identification is mandatory to protect CI, the detection time is key in avoiding escalation, mitigating the major effects of a cyber attack, and being able to continue to offer the service.
Currently, to overcome the identified setbacks related to the application of ML algorithms in IDSs, there has been a tendency to use hierarchical, layered [
33], hybrid, or meta-learning algorithms. These algorithms improve the capacity for the detection of unseen and infrequent attacks and conserve their accuracy in the detection of well-known attacks. In general, one model is used as the input for the next one, and multiple combinations of models have been shown to produce positive results in measures such as accuracy, as shown in
Table 5. The results are generally well-accepted and much better than a classical approximation. However, some of them have been proven by datasets that, for the most part, do not represent current threats, thus diminishing the capacity to generalize the results and establishing doubts about their behavior in the real world. Furthermore, there is a concern about the technical requirements needed to develop and support the models. Additionally, they have not been successful at identifying all types of intrusions [
34]. Most models lack proper adaptivity [
83] as the attackers’ changing patterns are usually not identified. In some cases, they require human intervention to introduce new vulnerabilities, however, the number of new vulnerabilities could surpass the technique’s availability.
In [
84], the authors present a hybrid approach that focuses on dealing with highly imbalanced data in SCADA. This proposal combines a customized content-level detector—a Bloom filter—with an instance-based learner (k-nearest neighbor (KNN)). The detector is signature-based; therefore, it cannot detect attacks that were not previously identified. To overcome this issue, the authors used KNN. However, the performance is highly dependent on the number of neighbors considered for classification. Implementing hybrid algorithms with unsupervised learning is also an option, as presented in [
85], where a mutated self-organizing map algorithm (MUSOM) deployed an agent that identified the node behavior as malicious or normal. The MUSOM wants to reduce the learning rate, which is a positive characteristic in developing security systems for SCADA due to the decrease in the training time without increasing the memory needs.
In [
60], meta-learning approaches—bagging, boosting, stacking, cascading, delegating, voting, and arbitrating—with unsupervised learning were tested in 21 datasets, and the authors concluded that no algorithm outperformed another during the research. Despite this, they were able to recognize that some factors would improve the results, such as implementing accurate parameter tuning or using a better feature extractor.
Another method is to focus on developing models to detect specific attacks, as shown in
Table 7. This kind of approximation mainly focuses on the most frequent and high-impact attacks on CI such as distributed denial-of-service (DDoS) attacks, which affect a service’s availability. In detecting DDoS attacks, results above 0.97 in classification accuracy have been obtained [
32]. The interruption of the availability of CI tends to have the most severe impact on people’s daily lives as it interferes with access to daily commodities such as energy, communications, and water. Although the other security information characteristics are also vital—integrity and confidentiality—CI operators always prioritize availability over all other considerations [
22,
24].
In previous research, as illustrated in
Table 5, there are positive results for IDSs that implement ML techniques, where some of them obtain results over 0.99 in measures of accuracy. Nonetheless, the training datasets do not have logs from cyber–physical systems such as sensors or actuators. These components are essential for the operation of CI and have specific characteristics [
92]. Therefore, the results can be imprecise due to the inaccuracy and outdatedness of the datasets used to train the models. Additionally, the kinds of cyber-attacks that CI is a victim of differ from the typical attacks on other infrastructure mainly due to (1) the physical components that are involved, (2) the real-time data transmission, (3) the geographically distributed components [
93], (4) the kind of attacker, and (5) the attack motivation. When these characteristics are taken into consideration, a different set of threats is analyzed, as shown in
Table 7. These types of attacks include elements such as the alteration or disruption of the information issued by specific sensors [
87,
88].
Finally, from the cybersecurity point of view, the design of new ML-based IDSs should consider their robustness against adversarial attacks. These attacks exploit the vulnerabilities of ML systems to bypass IDSs [
94]. Adversarial attacks use different attack vectors, for instance, the alteration of the classifier to change the output, the modification of the input data, and an adversarial honeypot. Some of the techniques used to develop an adversarial attack are the fast gradient sign method (FGSM) and projected gradient descent (PGD), which add noise to the original data [
95]. These attacks are particularly challenging as some authors argue that the maximum mean discrepancy (MMD) might not be effective in identifying legitimate and malicious traffic. However, previous research works have found that if modifications are made to the original implementation, MMD would help in the identification of adversarial attacks [
96]. Defense techniques were also implemented to improve the security of ML-based IDSs in [
94,
97], where the authors proposed three categories: modify the input data, augmenting the original dataset to improve the capacity of generalization (Gaussian data augmentation); modify the classifier, changing the loss function or adding more layers (gradient masking); add an external model, adding one or more models during the test, and keeping the original (generative adversarial networks (GANs)).
6. Conclusions and Future Direction
In this paper, we have presented a survey on IDSs that have been developed for the protection of CI, based on data from the last five years. These IDSs use ML techniques as a principal component to detect cyber-attacks. Although there are meaningful advances in the development of detection tools for the accurate identification of known attacks, there are still challenges, such as the detection of zero-day attacks, the model’s updating, and the high rate of false positives. Future research could focus on improving these identified challenges. This work highlights the weaknesses and strengths of: (1) the ML used to improve the cybersecurity level of CI; (2) the cybersecurity datasets; and (3) the CI security requirements. Finally, it serves as a starting point for forthcoming studies.
The protection of CI is a national security concern [
1], and its cybersecurity models depend on traditional approximations that typically utilize standalone security solutions [
98]. Systems such as IDSs incorporate ML solutions to improve the prediction capacity, and different kinds of learning methods have been implemented to obtain results that do not cover all the protection levels required to secure CI. On the one hand, supervised learning has been producing positive results when identifying well-known attacks, but it struggles to detect zero-day attacks. On the other hand, unsupervised learning, which is better at detecting unknown attacks, does not obtain the same results as known attack vectors. Additionally, reinforcement learning has been incorporated to resolve high-dimensional cyber defense problems [
8]. More complex approximations are being developed, and meta-learning learners and artificial neural networks have been tested.
Although the results seem promising in the anomaly detection field, most of the testing that has been conducted was carried out with datasets that do not represent network traffic from CI from either past or present cyber threats, thus questioning the algorithms’ generalization capacity in real-world scenarios. There is a need for accurate characterization of data extracted from CI’s networks, not only to train network-based IDSs but to help in the development of host-based IDSs. Developing a more accurate dataset is an open area of research that would highly contribute to closing the gap between academic findings and real-world applications.
Comparing results with previous works is challenging.
Table 5 and
Table 6 show some works that have been developed to detect cyber-attacks using ML techniques; however, this comparison is not an easy task since they used different datasets with different techniques, and in some cases, they calculated different metrics or calculated only the accuracy of the model [
30,
72] and we already know that accuracy metric is not enough to analyze an ML model. Particularly in ICS, the detection time is a factor that must be calculated. Additionally, the works might not have enough information to replicate the model. Thus, advances in how to compare ML models are considered an encouraging research area. Additionally, there is a need to close the gap between cybersecurity systems and incident management, so organizations can undertake appropriate control measures to mitigate risk proactively [
18].