1 Background and Study Motivations
One of the most prevalent illegal online practices is phishing website [
3], which makes use of fake versions of benign websites to build trust in online users and steal their login information, social security number, bank account details, and credit card details. Additionally, it is used to transmit harmful software like ransomware via Facebook, Twitter, SMS, and other channels. As a result of successful phishing website attacks, vulnerable sectors like banks, e-commerce, education, healthcare, broadcast media, agriculture, and so on commonly experience loss of productivity, reductions in competitiveness, risks to their survival, and compromises to national security [
8].
It is impossible to prevent cyber criminals from devising phishing websites, but this threat can be mitigated by detecting a particular website as fraudulent and warning internet users to take the necessary precautions before they fall for phishing assaults [
22]. URL takes users anywhere on the Internet but checking the trustworthiness of URL structure is usually unnoticed by them. Only inspecting the URL structures could not be an assurance for safe web security, unless verifications of hidden malicious web contents have been carried out. The big issue here is: do Internet-dependent users have access to and know-how of website source code? [
11], for the investigation of such cybercrime and the classification of its subtypes, technical interventions assisted by emerging paradigms of technologies like artificial intelligence technologies are needed [
12].
According to the 2022 survey on trends in phishing activity by the
Anti-phishing Working Group (
APWG), the number of distinct phishing websites is growing at an alarming rate. For example, the APWG was able to detect 1,025,968 distinct phishing websites in the 1st quarter of 2022, 1,097,811 distinct phishing website attacks in the 2nd quarter of 2022, 1,270,883 distinct phishing websites in the 3rd quarter of 2022, and 1,350,037 distinct phishing websites in the 4th quarter of 2022 [
34–
37]. Figure
1 exhibits the month-wise details of the aforementioned reports.
Most modern browsers, including Firefox, Chrome, Internet Explorer, and some anti-virus programs, makes use of blacklisting and whitelisting approaches to detect and block phishing websites [
30,
31], despite these approaches being unable to detect newborn phishing websites[
8,
30–
31]. To overcome these concerns, numerous phishing website detection approaches were proposed by the scientific community. For example, the heuristic (rule-based) approach is used to detect phishing websites by extracting features from web contents and being able to recognize fresh phishing website attacks [
8,
30,
31] despite the attacker could bypass the heuristic parameters once he/she has the knowledge of heuristic algorithms [
30]. Visual similarity is also another approach used to detect phishing websites by comparing the screenshots or images of benign websites with phishing websites to find a certain similarity ratio [
30] and able detect fresh website attacks [
8], despite this approach has the following limitations: (i) comparing the entire images of benign websites against phishing websites requires more computational time and more storage space to save images [
8,
30], (ii) comparing animated web pages against the phishing website could result in high false negative rate due to a low percentage of similarity score, and (iii) it fails to detect phishing websites when the website background is slightly modified [
30].
Because of their promising performance in preventing and detecting new attacks by automatically discovering hidden patterns from huge datasets,
machine learning (
ML) and
deep learning (
DL) approaches are now received wider acceptance in the domain of cyber security, particularly in malware detection and classification, intrusion detection, spam detection, and phishing detection despite the success of these approaches rely on the nature and characteristics datasets and features used [
13–
15].
Compared to DL, ML algorithms can efficiently detect phishing websites in a much faster time and do not need specific hardware like
Graphics Processing Units (
GPU) for implementation [
3]. Many domains already make use of ML, which is a key technology for both existing and future information systems. The use of ML in Cybersecurity, however, is still in its infancy, demonstrating a substantial gap between research and practice [
22]. Over 90% of companies already use some form of AI/ML in their defensive measures. Although most of these solutions currently employ “unsupervised” techniques and are primarily used for anomaly detection and such an observation demonstrates a stark gap between theory and application, especially when compared to other fields where ML has already established itself as a valuable asset [
22].
ML [
44] has been widely used in a range of areas. One of the most notable applications of ML is classification. It has been applied to a variety of tasks, including e-mail spam filtering, remote sensing, crop classification, sports result prediction, seismic event classification, biomedical informatics, machine fault detection, and power monitoring. Despite their multifaceted benefits, classification algorithms have a number of practical constraints, especially when dealing with complicated data (such as unstructured data and streaming data) [
1]. To begin with, the majority of real-world data have a variety of properties. The data may also contain noise, pointless features, redundant features, and abnormalities [
44]. Model over-fitting and under-fitting issues are the core challenges that traditional or individual ML algorithms encounter as a result of noise, variation, and bias in datasets [
1,
2]. Ensemble learning methods are found to be state-of-the-art solutions for numerous classification tasks in order to overcome the aforementioned concerns.
The ensemble learning method combines ideal solutions or collective skills of multiple separate models to provide decisions that are more accurate and better than those yielded by individual-based ML algorithms [
1,
9–
10,
16,
18]. The reason behind using an ensemble classifier is its enhanced predictive performance; builds a stable model and is capable of removing the class imbalance [
2,
20]. Ensemble learning methods are mainly categorized into three classes namely bagging, boosting, and stacking. The brief details of these methods are presented as follows.
Bagging (bootstrap aggregation) ensemble method: in this method, the dataset is randomly partitioned, the base estimators are built in parallel format, and the sample with replacement is carried out concurrently. The final decisions are aggregates of each classifier's average prediction result. This approach allows sections of individual training to have identical instances and is used to reduce bias and variance in predicted results [
2]. This method is advised to be employed when we have insufficient data size [
20]. Random Forest is a well-known bagging-type ensemble learner [
46].
Boosting (iterative and sequential) ensemble learning method: in this method, the instances of the training datasets are reweighted progressively and base learners are generated sequentially [
2,
20]. The incorrectly classified data by prior base learners will be given more weight [
20]. This method follows the iterative strategy to add ensemble members to fix previous models' inaccurate predictions and the final outputs are a weighted average of the predictions. The initial weight is set to W
ti = 1/N, the next model attempts to rectify errors produced by the prior model, and the process is continued until the model is able to accurately forecast the entire dataset or reaches its maximum prediction capability [
2,
20]. This ensemble method is capable of reducing variance [
2]. Gradient boosting and Cat boosting classifiers are among boosting-type ensemble learners.
Stacking (Meta ensemble learning method)
: in this method, a number of diverse base learners are combined at the first layer, and then a Meta learner is trained to combine the predictions of base learners at the next layer [
7,
20]. The purpose of the Meta learner is to determine the extent to which the prediction outcomes of the base learners may be combined, to learn any misclassification patterns discovered by the base classifiers [
7], and to rectify the dataset's misclassification [
20].
Taking into account the multifaceted benefits of ensemble learning approaches such as enhanced predictive performance, building a stable model, and capable of removing class imbalance [
2,
20], we implemented single and hybrid ensemble learning algorithms to detect phishing websites. Single ensemble learners are classifiers that are ensemble on their own or they combine the prediction results of multiple decision trees to yield the final output. Random forest, gradient boost, and cat boost are a few examples. Hybrid ensemble learners are classifiers that integrate numerous single ensemble learners. Stacking and majority voting are two examples and are used to combine the results of the random forest, gradient boost, and cat boost.
This article introduces the cat-boost classifier for two reasons. The first reason is that, despite being the most recent version of the boosting ML algorithm, it was not taken into account by 30 recently reviewed ML-based phishing website detection research works [
15] and by the phishing website detection research works using ensemble learning methods [
9–
10,
16–
18,
23]. The second reason is that despite being a good option and demonstrating superior performance in many academic fields [
5], including traffic engineering, finance, meteorology, medicine, electrical utilities, astronomy, marketing, biology, psychology, and biochemistry], the cat-boost classifier is not widely used in the field of Cybersecurity [
5]. Hence, the proposed study is one attempt to address the aforementioned gaps.
The main contributions of the proposed study are presented as follows:
•
Phishing website attack is a dynamic problem. Ensemble learning methods are among the cutting-edge solutions for detecting newly devised phishing websites. The effectiveness of these methods, however, is significantly impacted by the nature and characteristics of datasets. To address these concerns, we used three different trustworthy public datasets to train and test each ensemble learner and then determined which one performed best across these datasets in terms of accuracy and train-test computational time.
•
Dealing with the curse of dimensionality is one of the major challenging aspects of devising an accurate and quick predictive ML model. These issues could be solved by eliminating noisy, unnecessary, and duplicated features and only applying informative features [
38]. To address these concerns, we applied the
Uni-variate feature selection (
UFS) technique on each ensemble learning classifier and found encouraging results.
•
To the best of our knowledge, the combined use of the CATB, GB, and RF for phishing website detection is one of the first attempts.
•
It is difficult to determine which classifier, among CATB, GB, and RF, is optimal for the Meta learner in the stacking ensemble learning approach without doing experiments, and Meta-RF was found to be suitable for DS-1, while Meta-CATB was found suitable for both DS-2 and DS-3 in terms of scoring better accuracy.
•
Our proposed approach attained better phishing website detection accuracy compared to other studies that used the same datasets.
The flows of remaining articles are organized as follows: literature review; materials and methods; results and discussions; conclusions and future remarks; acknowledgments, and references.
4 Result and Discussions
The higher accuracy score and faster model computing time can assure the trustworthiness of any phishing website detection system. Furthermore, the scalability and performance consistency of the phishing website detection system across numerous datasets is critical to combating various variants of phishing website attacks. That is why the proposed study used three distinct phishing website datasets to train and test each ensemble learning method such as CATB, GB, RF, stacking, and hard vote. DS-1 has 87 features and 11,430 records, DS-2 has 31 features and 11,054 records, and DS-3 has 48 features and 10,000 records.
Indeed, the success of any ML algorithm depends on using cleaned representative datasets and a choice of informative features [
13–
15,
21,
39]. To achieve this aim, appropriate model hyper-parameter values, cross-validation, and the top informative website features were used to optimize the model performance in terms of accuracy and computational time.
Ensemble learning methods are known for requiring a lot of computational time because they combine multiple ML algorithms [
7,
9–
10]. To address these concerns, each ensemble learning method was trained and tested both before and after applying the UFS technique and obtained promising results following the application of the UFS technique.
Using an uneven dataset could lead to biased model results because the model will favor the classes that are in the majority. To address these concerns, we employed the widely used dataset balancing strategy called SMOTE to change the uneven dataset ratio of DS-2 from 56%:44% to 50%:50%.
It is challenging to determine which classifier, out of CATB, GB, and RF, is best for the Meta learner in the stacking ensemble learning method without doing experiments. Due to scoring the highest accuracy, the Meta-RF was found better for DS-1, while the Meta-CATB was found better for both DS-2 and DS-3 as was mentioned in the methodology section.
In principle, lowering the number of less informative features could reduce the model computing time while increasing the model accuracy [
44]. Our experimental results demonstrate that the values of the model hyper-parameters, in addition to the use of the UFS technique, are crucial in determining how accurate and quickly a model can be computed. The overall experiments are presented as follows.
4.1 Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-1
As per the UFS technique, 69 out of the 87 features in DS-1 are judged to be statistically significant for phishing website detection. Moreover, the results of RF and Meta-RF suggest that using the UFS technique can boost accuracy while maintaining the same computing time. The results of CATB, GB, and hard vote suggest that using the UFS technique can boost accuracy while reducing the computing time.
Table
4 shows the suitable model hyper-parameter values, accuracy, and computing time for each ensemble learner both before and after the UFS technique was applied to the DS-1. The results with the
bold green color indicate better performance improvement following the application of the UFS technique, the results with
bold blue color indicate the same model performance obtained both before and after the UFS technique, and the result with
bold black color indicates which model hyper-parameter values is changed after the applying the UFS technique.
The experimental finding in Table
4 demonstrates that the UFS technique marginally increased the accuracy of each ensemble learner without adversely affecting the model's computation time. Moreover, when used on DS-1 CATB classifier slightly outperforms the other ensemble learners, including Meta-RF, hard vote, GB, and RF respectively, in terms of scoring the highest accuracy. For instance, after the UFS technique was applied to the DS-1, the CATB's accuracy increased by
0.17% (from 97.73% to
97.9%), while its computational time was reduced by 2 seconds (from 12 seconds to
10 seconds). The aforementioned performances of CATB were attained after employing the top 69 features and adjusting the CATB iteration value from 346 to 350. The aforementioned CATB accuracy was
0.48% higher than the Meta-RF accuracy,
0.52% higher than the hard vote accuracy,
0.74% higher than the GB accuracy, and
1.36% higher than the RF accuracy.
Following the application of the UFS technique to DS-1, changing the iteration value from 436 to 350 increased the CATB accuracy by 0.17% while reducing the computational time by 2 seconds. Changing the n_estimators value from 73 to 69 increased the GB accuracy by 0.09% while reducing the computational time by 2 seconds. Changing the n_estimators value from 45 to 32 and the random_state value from 0 to 12 increased the RF accuracy by 0.04%, with the same computational time (12 seconds). Changing the CATB iteration value from 346 to 350 and learning_rate value from 0.1 to 0.3; changing the GB n_estimators value from 73 to 69 and random_state value from 12 to 3; changing the RF n_estimators value from 45 to 32 and random_state value from 0 to 12 increased the Meta-RF accuracy by 0.18%, with same computational time (47 seconds). Changing the CATB iteration value from 346 to 350; changing the GB n_estimators value from 73 to 69 and random_state value from 12 to 3; changing the RF n_estimators value from 45 to 30 and random_state value from 0 to 12 increased the hard vote accuracy by 0.18% while reducing the computational time by 1 second.
Despite it being demonstrated the slowest computational time (47 seconds) compared to all the remaining ensemble learners, the Meta-RF attained the second-highest accuracy (97.42%) when used on DS-1. Despite scoring the lowest accuracy (96.54%) when used on DS-1, the RF classifier had a quicker computational time (4 seconds) both before and after the application of the UFS technique compared to all the remaining ensemble learners. The aforementioned RF computational time (4 seconds) was nearly
12 times faster than the Meta-RF computational time,
more than 4 times faster than the hard vote and GB computational times, and
more than 2 times faster than the CATB computational time. As stated in [
45], the RF requires a fewer computational time cost, the RF computational time cost at test time is O (T, D), where T is the size of the trees in the forest and D is the maximum depth (without the root), and the computational time cost of RF can be lower if trees are unbalanced [
45]. This could be the main reason that the RF exhibited the fastest computational time.
Figure
3 shows the accuracy of each ensemble learner before and after the UFS technique was used on the DS-1, while Figure
4 shows the accuracy, f1-score, precision, and recall of each ensemble learner after the UFS technique was used on the DS-1.
4.2 Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-2
As per the UFS technique, 27 out of the 31 features in DS-2 are judged to be statistically significant for phishing website detection. Moreover, the results of RF suggest that using the UFS technique can boost accuracy while maintaining the same computing time, while the results of CATB, GB, hard vote, and Meta-CATB suggest that using the UFS technique can boost accuracy while reducing the computing time.
Table
5 shows the suitable model hyper-parameter values, accuracy, and computing time for each ensemble learner both before and after the UFS technique was applied to the DS-2. The results with the
bold green color indicate better performance improvement following the application of the UFS technique, the results with
bold blue color indicate the same model performance obtained both before and after the UFS technique, and the result with
bold black color indicates which model hyper-parameter values is changed after the applying the UFS technique.
The experimental finding in Table
5 demonstrates that the UFS technique marginally improved the accuracy of each ensemble learner while reducing the computation time. Moreover, the RF, CATB, and hard vote strongly compete with each other to yield superior accuracy despite the RF slightly beating them. For instance, after the UFS technique was applied to the DS-2, the RF's accuracy increased by
0.19% (from 97.18% to
97.37%). The aforementioned performance of RF was attained after employing the top 27 features, adjusting the RF max_depth value from 19 to 18, the n_estimators value from 40 to 46, and the random_state value from 0 to 6. The aforementioned RF accuracy (97.37%) was
0.01% higher than the CATB accuracy,
0.03% higher than the hard vote accuracy,
0.1% higher than the GB accuracy, and
0.19% higher than the Meta-RF accuracy.
Following the application of the UFS technique to DS-2, changing the n_estimators value from 48 to 66 and the random_state value from 12 to 42 increased the CATB accuracy by 0.19% while reducing the computational time by 4 seconds. Changing the n_estimators value from 56 to 50 and the random_state value from 2 to 3 increased the GB accuracy by 0.31% while reducing the computational time by 30 seconds. Changing the CATB max_depth value from 10 to 9, iteration value from 48 to 50, and random_state value from 12 to 42; changing the GB n_estimators value from 56 to 50 and random_state value from 2 to 3; changing the RF max_depth value from 20 to 18, n_estimators value from 40 to 46, and random_state value from 12 to 6 increased the Meta-RF accuracy by 0.28% while reducing the computational time by 149 seconds. Changing the CATB iteration value from 48 to 62 and random_state value from 12 to 42; changing the GB n_estimators value from 56 to 50 and random_state value from 2 to 3; changing the RF max_depth value from 19 to 18, n_estimators value from 36 to 42 and random_state value from 0 to 6 increased the hard vote accuracy by 0.12% while reducing the computational time by 32 seconds.
As it performed in DS-1, the RF classifier attained the fastest computational time when used on DS-2. The RF computational time (12 seconds) in DS-2 was nearly 42 times faster than the Meta-RF computational time, more than 10 times faster than the hard vote computational time, more than 6 times faster than the GB computational time, and nearly 3 times faster than the CATB computational time. As it performed in DS-1, the CATB classifier attained the second quickest computing time (34 seconds) when used on DS-2.
Figure
5 shows the accuracy of each ensemble learner before and after the UFS technique was used on the DS-2, while Figure
6 shows the accuracy, f1-score, precision, and recall of each ensemble learner after the UFS technique was used on the DS-2.
4.3 Each Ensemble Learner's Performance Comparisons When Used on DS-3
Because DS-3 only contains feature categories from the URL and site contents, whereas DS-1 and DS-2 both included four feature categories, including URL, web content/source code, domain, and page rank features, and because DS-3 has fewer (10,000) instances compared to the instances of DS-1 (11,430) and DS-2 (11,055), the study attempted to judge each ensemble learner performance in DS-3 with the absence of the UFS technique.
Our experimental findings reveal that each ensemble learner attained more than 98% accuracy when used on DS-3, despite the CATB slightly beating them in terms of scoring the highest accuracy (98.59%). Table
6 shows the suitable model hyper-parameter values, accuracy, and computing time for each ensemble learner with the absence of the UFS technique in DS-3.
The experimental findings in Table
6 exhibit that the CATB, hard vote, GB, and Meta-CATB strongly compete with each other to yield the highest accuracy despite the CATB slightly beating them. The CATB classifier came in first by attaining the highest accuracy of 98.59%. The aforementioned accuracy was 0.04% higher than the accuracy of the hard vote, 0.07% higher than the accuracy of the GB, 0.08% higher than the accuracy of the Meta-CATB accuracy, and 0.34% higher than the accuracy of the RF.
While the CATB classifier came in first by attaining 98.59% accuracy, the hard vote ensemble attained the second-best accuracy (98.55%), the GB attained the third-best accuracy (98.52%), the Meta-CATB attained the fourth-best accuracy (98.51%), and the RF attained the fifth best accuracy (98.25%).
As it performed in DS-1 and DS-2, the RF classifier came in first by attaining the fastest computational time (13 seconds) when used on DS-2 and the CATB classifier attained the second quickest computing time (28 seconds) when used on DS-3. However, as it performed in DS-2, Meta-CATB attained the slowest computational time (622 seconds) compared to all the remaining ensemble learners. Figure
7 shows the accuracy, f1-score, precision, and recall of each ensemble learner when used on the DS-3.
4.4 Summary of Key Experimental Findings
The suitability of the CATB for big data and the superiority of CATB in many academic fields, including traffic engineering, finance, meteorology, medicine, electrical utilities, astronomy, marketing, biology, psychology, and biochemistry were exhibited in [
5]. Our study introduced the cat-boost classifier for phishing website detection and CATB demonstrated scalable, consistent, and superior accuracy in a variety of phishing website datasets compared to the remaining ensemble learners as per our experimental findings.
As can be seen in Table
7, compared to the remaining ensemble learners, the CATB exhibited superior accuracy when used on DS-1 and DS-3 by attaining 97.9% and 98.59% accuracy, respectively. The CATB attained the second-best accuracy (97.36%) when used on DS-2 with just 0.01% accuracy remaining to reach the first-best accuracy (97.37%) attained by RF in DS-2.
The CATB accuracy (97.9%) in DS-1 was 1.07% higher than the best accuracy (96.83%) attained by RF in the study [
14], despite the use of DS-1. The CATB accuracy (98.59%) in DS-3 was 3.99% higher than the best accuracy (94.6%) attained by RF in the study [
26] and 0.59% higher than the best accuracy (98%) attained by RF in the study [
48] despite the use of DS-3. The aforementioned CATB accuracy in DS-2 was 0.36% higher than the best accuracy (97%) attained by RF in the study [
19], despite the use of DS-2. Our experimental results are consistent with those of [
47], which show that when used on various datasets, the CATB outperforms other publicly available boosting algorithms in terms of accuracy.
The RF was the most widely used classifier for phishing website detection and exhibited the top phishing website detection accuracy in 17 out of 30 systematically reviewed studies [
11]. However, in our study, the RF demonstrated the lowest accuracy when used on DS-1 and DS-3 compared to the remaining ensemble learners such as CATB, Meta-RF, hard vote, and GB despite the RF exhibiting superior performance in DS-2 by attaining 97.37% accuracy. The RF accuracy in DS-2 (97.37%) was 0.37% higher than the best accuracy (97%) attained by RF in the study [
19], despite the use of DS-2. The study [
14] conducted experiments using DS-1 and stated that classifiers such as RF, SVM, and Decision trees are quite sensitive to the order of attributes in the datasets. This may be the main reason that the RF demonstrated lower accuracy in DS-1 as compared to CATB, Meta-RF, hard vote, and GB.
When applied to DS-1, DS-2, and DS-3, the RF was the fastest classifier in our study when compared to all the remaining ensemble learners, with computing times of 4, 12, and 10 seconds, respectively. As stated in [
45], the RF requires a fewer computational time cost, the RF computational time cost at test time is O (T, D), where T is the size of the trees in the forest and D is the maximum depth (without the root), and the computational time cost of RF can be lower if trees are unbalanced [
45]. These may be the main reasons that RF exhibited faster computational time compared to all the remaining ensemble learners.
While the RF came in first, the CATB was found to be the second fastest classifier in our study when applied to DS-1, DS-2, and DS-3, with computational times of 10, 34, and 28 seconds, respectively. As stated in [
5,
47], base predictors in CATB are oblivious decision trees, often known as decision tables. The term “oblivious” refers to the usage of the same splitting criterion throughout a level of the tree. These balanced trees allow for a significant speedup in execution during testing and are less prone to over-fitting [
5,
47]. As stated in [
47], the computing time of the CATB depends on the number of TS to be computed in each iteration and the set of candidate tree splits that are taken into consideration at each iteration. These may be the main reasons that CATB exhibited faster computational time compared to the GB, Meta-RF, and Meta-CATB.
In this study, Meta-RF and Meta-CATB demonstrated the slowest computational time compared to all the remaining ensemble learners when applied to DS-1, DS-2, and DS-3, with computational times of 47, 501, and 622 seconds, respectively. This computational time result was expected because hybrid ensemble learning methods are known for requiring a lot of computational time because they involve multiple ML algorithms [
7,
9–
10].
Moreover, the application of the UFS technique increased the accuracy of each ensemble learner in DS-1 and DS-2 without adversely affecting the model's computation time. Our experimental results demonstrate that the values of the model hyper-parameters, in addition to the use of the UFS technique, are crucial in determining how accurate and quickly a model can be computed. Table
7 exhibits the summary of each ensemble learner's performance when used on DS-1, DS-2, and DS-3.
5 Conclusion and Future Work
One of the most prevalent dynamic and unlawful online practices is website phishing, which makes use of fake versions of benign websites to build trust in online users, steal sensitive information, and transfer malicious software like ransomware. As a result of website phishing success, vulnerable institutions like banks, e-commerce, education, healthcare, broadcast media, agriculture, and so on commonly encounter loss of productivity, reductions in competitiveness, risks to their survival, and compromises to national security. To address the aforementioned concerns, the intervention of an intelligent phishing website detection model powered by ML is needed. However, model over-fitting and under-fitting issues are the core challenges that traditional or individual ML algorithms encounter as a result of noise, variation, and bias in datasets, and ensemble learning methods are found to be state-of-the-art solutions for numerous classification tasks in order to overcome the aforementioned concerns. That is why our proposed study conducted rigorous experiments on both single and hybrid ensemble learning methods like CATB, GB, RF, Meta-RF, Meta-CATB, and hard vote.
Furthermore, the scalability and performance consistency of the phishing website detection system across numerous datasets is critical to combating various variants of phishing website attacks. That is why the proposed study used three distinct phishing website datasets to train and test each ensemble learning method to identify the best-performed one in terms of accuracy and model computational time. Due to combining multiple ML algorithms, ensemble learning methods are known for requiring a lot of computational time. To address these concerns, we applied the UFS technique and obtained promising results. Our experimental findings demonstrate two core benefits of applying the UFS technique: (i) the UFS technique can boost accuracy while maintaining the same computing time and (ii) the UFS technique can boost accuracy while reducing the computing time.
Our study introduced the cat-boost classifier for phishing website detection and our experimental findings exhibited that CATB demonstrated scalable, consistent, and superior accuracy in a variety of phishing website datasets like DS-1, DS-2, and DS-3. Despite scoring the lowest accuracy in DS-1 and DS-3, the RF classifier exhibited the highest accuracy when used on DS-2. Meta-RF attained the second-best accuracy when used on DS-1 despite scoring the slowest computational time.
When it comes to model computational time, the RF classifier was discovered to be the fastest when applied to all datasets (DS-1, DS-2, and DS-3), while the CATB classifier was discovered to be the second quickest when applied to all datasets (DS-1, DS-2, and DS-3). Meta-RF and Meta-CATB, on the other hand, had the slowest computational time across all datasets, despite the fact that the UFS technique helped to reduce their computing time. Our experimental results demonstrate that the values of the model hyper-parameters, in addition to the use of the UFS technique, are crucial in determining how accurate and quickly a model can be computed.
To address the limitations of the current study and undertake comparative performance analysis, the study advocated including appropriate DL algorithms, mobile-based phishing, large datasets, and other feature selection techniques in future work.