research-article

Open access

Single and Hybrid-Ensemble Learning-Based Phishing Website Detection: Examining Impacts of Varied Nature Datasets and Informative Feature Selection Technique

Authors:

Kibreab Adane,

Berhanu Beyene,

Mohammed AbebeAuthors Info & Claims

Digital Threats: Research and Practice, Volume 4, Issue 3

Article No.: 46, Pages 1 - 27

https://doi.org/10.1145/3611392

Published: 06 October 2023 Publication History

PDF eReader

Abstract

To tackle issues associated with phishing website attacks, the study conducted rigorous experiments on RF, GB, and CATB classifiers. Since each classifier was an ensemble learner on their own; we integrated them into stacking and majority vote ensemble architectures to create hybrid-ensemble learning. Due to ensemble learning methods being known for their high computational time costs, the study applied the UFS technique to address these concerns and obtained promising results. Since the scalability and performance consistency of the phishing website detection system across numerous datasets is critical to combating various variants of phishing website attacks, we used three distinct phishing website datasets (DS-1, DS-2, and DS-3) to train and test each ensemble learning method to identify the best-performed one in terms of accuracy and model computational time. Our experimental findings reveal that the CATB classifier demonstrated scalable, consistent, and superior accuracy across three distinct datasets (attained 97.9% accuracy in DS-1, 97.36% accuracy in DS-2, and 98.59% accuracy in DS-3). When it comes to model computational time, the RF classifier was discovered to be the fastest when applied to all datasets, while the CATB classifier was discovered to be the second quickest when applied to all datasets.

1 Background and Study Motivations

One of the most prevalent illegal online practices is phishing website [3], which makes use of fake versions of benign websites to build trust in online users and steal their login information, social security number, bank account details, and credit card details. Additionally, it is used to transmit harmful software like ransomware via Facebook, Twitter, SMS, and other channels. As a result of successful phishing website attacks, vulnerable sectors like banks, e-commerce, education, healthcare, broadcast media, agriculture, and so on commonly experience loss of productivity, reductions in competitiveness, risks to their survival, and compromises to national security [8].

It is impossible to prevent cyber criminals from devising phishing websites, but this threat can be mitigated by detecting a particular website as fraudulent and warning internet users to take the necessary precautions before they fall for phishing assaults [22]. URL takes users anywhere on the Internet but checking the trustworthiness of URL structure is usually unnoticed by them. Only inspecting the URL structures could not be an assurance for safe web security, unless verifications of hidden malicious web contents have been carried out. The big issue here is: do Internet-dependent users have access to and know-how of website source code? [11], for the investigation of such cybercrime and the classification of its subtypes, technical interventions assisted by emerging paradigms of technologies like artificial intelligence technologies are needed [12].

According to the 2022 survey on trends in phishing activity by the Anti-phishing Working Group (APWG), the number of distinct phishing websites is growing at an alarming rate. For example, the APWG was able to detect 1,025,968 distinct phishing websites in the 1st quarter of 2022, 1,097,811 distinct phishing website attacks in the 2nd quarter of 2022, 1,270,883 distinct phishing websites in the 3rd quarter of 2022, and 1,350,037 distinct phishing websites in the 4th quarter of 2022 [34–37]. Figure 1 exhibits the month-wise details of the aforementioned reports.

Fig. 1.

Fig. 2.

Most modern browsers, including Firefox, Chrome, Internet Explorer, and some anti-virus programs, makes use of blacklisting and whitelisting approaches to detect and block phishing websites [30, 31], despite these approaches being unable to detect newborn phishing websites[8, 30–31]. To overcome these concerns, numerous phishing website detection approaches were proposed by the scientific community. For example, the heuristic (rule-based) approach is used to detect phishing websites by extracting features from web contents and being able to recognize fresh phishing website attacks [8, 30, 31] despite the attacker could bypass the heuristic parameters once he/she has the knowledge of heuristic algorithms [30]. Visual similarity is also another approach used to detect phishing websites by comparing the screenshots or images of benign websites with phishing websites to find a certain similarity ratio [30] and able detect fresh website attacks [8], despite this approach has the following limitations: (i) comparing the entire images of benign websites against phishing websites requires more computational time and more storage space to save images [8, 30], (ii) comparing animated web pages against the phishing website could result in high false negative rate due to a low percentage of similarity score, and (iii) it fails to detect phishing websites when the website background is slightly modified [30].

Because of their promising performance in preventing and detecting new attacks by automatically discovering hidden patterns from huge datasets, machine learning (ML) and deep learning (DL) approaches are now received wider acceptance in the domain of cyber security, particularly in malware detection and classification, intrusion detection, spam detection, and phishing detection despite the success of these approaches rely on the nature and characteristics datasets and features used [13–15].

Compared to DL, ML algorithms can efficiently detect phishing websites in a much faster time and do not need specific hardware like Graphics Processing Units (GPU) for implementation [3]. Many domains already make use of ML, which is a key technology for both existing and future information systems. The use of ML in Cybersecurity, however, is still in its infancy, demonstrating a substantial gap between research and practice [22]. Over 90% of companies already use some form of AI/ML in their defensive measures. Although most of these solutions currently employ “unsupervised” techniques and are primarily used for anomaly detection and such an observation demonstrates a stark gap between theory and application, especially when compared to other fields where ML has already established itself as a valuable asset [22].

ML [44] has been widely used in a range of areas. One of the most notable applications of ML is classification. It has been applied to a variety of tasks, including e-mail spam filtering, remote sensing, crop classification, sports result prediction, seismic event classification, biomedical informatics, machine fault detection, and power monitoring. Despite their multifaceted benefits, classification algorithms have a number of practical constraints, especially when dealing with complicated data (such as unstructured data and streaming data) [1]. To begin with, the majority of real-world data have a variety of properties. The data may also contain noise, pointless features, redundant features, and abnormalities [44]. Model over-fitting and under-fitting issues are the core challenges that traditional or individual ML algorithms encounter as a result of noise, variation, and bias in datasets [1, 2]. Ensemble learning methods are found to be state-of-the-art solutions for numerous classification tasks in order to overcome the aforementioned concerns.

The ensemble learning method combines ideal solutions or collective skills of multiple separate models to provide decisions that are more accurate and better than those yielded by individual-based ML algorithms [1, 9–10, 16, 18]. The reason behind using an ensemble classifier is its enhanced predictive performance; builds a stable model and is capable of removing the class imbalance [2, 20]. Ensemble learning methods are mainly categorized into three classes namely bagging, boosting, and stacking. The brief details of these methods are presented as follows.

Bagging (bootstrap aggregation) ensemble method: in this method, the dataset is randomly partitioned, the base estimators are built in parallel format, and the sample with replacement is carried out concurrently. The final decisions are aggregates of each classifier's average prediction result. This approach allows sections of individual training to have identical instances and is used to reduce bias and variance in predicted results [2]. This method is advised to be employed when we have insufficient data size [20]. Random Forest is a well-known bagging-type ensemble learner [46].

Boosting (iterative and sequential) ensemble learning method: in this method, the instances of the training datasets are reweighted progressively and base learners are generated sequentially [2, 20]. The incorrectly classified data by prior base learners will be given more weight [20]. This method follows the iterative strategy to add ensemble members to fix previous models' inaccurate predictions and the final outputs are a weighted average of the predictions. The initial weight is set to W_ti = 1/N, the next model attempts to rectify errors produced by the prior model, and the process is continued until the model is able to accurately forecast the entire dataset or reaches its maximum prediction capability [2, 20]. This ensemble method is capable of reducing variance [2]. Gradient boosting and Cat boosting classifiers are among boosting-type ensemble learners.

Stacking (Meta ensemble learning method): in this method, a number of diverse base learners are combined at the first layer, and then a Meta learner is trained to combine the predictions of base learners at the next layer [7, 20]. The purpose of the Meta learner is to determine the extent to which the prediction outcomes of the base learners may be combined, to learn any misclassification patterns discovered by the base classifiers [7], and to rectify the dataset's misclassification [20].

Taking into account the multifaceted benefits of ensemble learning approaches such as enhanced predictive performance, building a stable model, and capable of removing class imbalance [2, 20], we implemented single and hybrid ensemble learning algorithms to detect phishing websites. Single ensemble learners are classifiers that are ensemble on their own or they combine the prediction results of multiple decision trees to yield the final output. Random forest, gradient boost, and cat boost are a few examples. Hybrid ensemble learners are classifiers that integrate numerous single ensemble learners. Stacking and majority voting are two examples and are used to combine the results of the random forest, gradient boost, and cat boost.

This article introduces the cat-boost classifier for two reasons. The first reason is that, despite being the most recent version of the boosting ML algorithm, it was not taken into account by 30 recently reviewed ML-based phishing website detection research works [15] and by the phishing website detection research works using ensemble learning methods [9–10, 16–18, 23]. The second reason is that despite being a good option and demonstrating superior performance in many academic fields [5], including traffic engineering, finance, meteorology, medicine, electrical utilities, astronomy, marketing, biology, psychology, and biochemistry], the cat-boost classifier is not widely used in the field of Cybersecurity [5]. Hence, the proposed study is one attempt to address the aforementioned gaps.

The main contributions of the proposed study are presented as follows:

•

Phishing website attack is a dynamic problem. Ensemble learning methods are among the cutting-edge solutions for detecting newly devised phishing websites. The effectiveness of these methods, however, is significantly impacted by the nature and characteristics of datasets. To address these concerns, we used three different trustworthy public datasets to train and test each ensemble learner and then determined which one performed best across these datasets in terms of accuracy and train-test computational time.

•

Dealing with the curse of dimensionality is one of the major challenging aspects of devising an accurate and quick predictive ML model. These issues could be solved by eliminating noisy, unnecessary, and duplicated features and only applying informative features [38]. To address these concerns, we applied the Uni-variate feature selection (UFS) technique on each ensemble learning classifier and found encouraging results.

•

To the best of our knowledge, the combined use of the CATB, GB, and RF for phishing website detection is one of the first attempts.

•

It is difficult to determine which classifier, among CATB, GB, and RF, is optimal for the Meta learner in the stacking ensemble learning approach without doing experiments, and Meta-RF was found to be suitable for DS-1, while Meta-CATB was found suitable for both DS-2 and DS-3 in terms of scoring better accuracy.

•

Our proposed approach attained better phishing website detection accuracy compared to other studies that used the same datasets.

The flows of remaining articles are organized as follows: literature review; materials and methods; results and discussions; conclusions and future remarks; acknowledgments, and references.

2 Literature Review

2.1 Related Works

In this study, recent, pertinent, and reliable related research works are rigorously reviewed to look for open issues and to exhibit the study's unique contributions.

Zemir et al. [9] employed two types of Meta ensemble learning methods to detect phishing websites. The first one combines the KNN and RF as base learners while the XGBoost is used as Meta learners. The second one combines the ANN and RF as base learners while the XGBoost is used as Meta learners. Each ensemble learner experimented with a single dataset that contains 11,055 records and 32 features. The study used 10-fold cross-validation and used PCA for dimension reduction. The experimental findings reveal that the Meta learner that combines RF and ANN attained the highest accuracy of 97.4%, 96% precision, 98.1% recall, and 97% f1-score with a computational time of 105.32 seconds [9].

Subasi and Kremic [10] employed ensemble learning methods. The study used 10-fold cross-validation and a single dataset that contains 11,055 records and 30 features to train and test each ensemble learner. The experimental findings reveal that the ensemble learner that combines the SVM and Ada-boost attained the highest accuracy of 97.61% and computational time of 8193.72 seconds [10]. Pandey et al. [32] employed an ensemble learning method that combines the RF and SVM. The experimental findings reveal that their proposed approach attained the highest accuracy of 94%.

Othman and Hassan [16] employed stacking ensemble methods that combine Naive Bayes, Decision Trees, KNN, RF, Quadratic Discriminant Analysis, and Linear Discriminant Analysis as base learners and Logistic regression as Meta learner. The study used 10-fold cross-validation and two different datasets to train and test each ensemble learner. The experimental findings reveal that their proposed approach attained 97.49% accuracy in the first dataset and 98.69% accuracy in the second dataset. Similarly, Shikalgar et al. [18] employed a stacking ensemble method that combines the SVM and XG-Boost as base learners and RF as Meta learners. The study used a single dataset that contains 2,905 records and 9 features. The experimental findings reveal that their proposed approach attained 85.6% accuracy.

Karim et al. [17] applied major vote ensemble methods that combine Linear regression, SVM, and DT. The study used a single dataset that contains 11,054 records and 33 features and used 70%/30% train-test dataset split ratios. The experimental findings reveal that their proposed approach attained 98.12% accuracy following the application of the canopy feature selection technique. Similarly, Ubing et al. [20] employed the majority vote ensemble method that combines Gaussian Naive Bayes, SVM, KNN, Logistic regression, MLP, ANN, RF, and GB. The study used a single dataset that contains 5,126 records (55% phish and 45% benign) and 30 features and used 80%/20% dataset train-test split rations. The experimental findings reveal that their proposed approach attained the highest accuracy of 95.4%. Maini et al. [23] applied a major vote ensemble method that combines RF, AdaBoost, XGBoost, Decision Tree, SVM, KNN, Logistic regression, and Naïve Bayes. The experimental findings reveal that their proposed approach attained the highest accuracy of 93.6%.

Tabassum et al. [33] employed ensemble learning approaches that combine the SVM, RF, DT, and XGBoost. The study used a single dataset that contains 11,055 records (4,898 benign and 6,157 phishing) and 30 features and used 75%/25% dataset train-test split ratios. The experimental findings reveal that their proposed approach attained the highest accuracy of 98.28%.

While many efforts have been made to identify phishing websites using various ML algorithms, researchers have neglected to apply the CATB classifier, despite the fact that it is the most recent advancement in ML. A single dataset was used in the majority of the studies [9–10, 17–18, 20, 23, 32] to train ensemble learners, and a few studies [9–10] carried out a run-time analysis of the proposed model. Hence, addressing the aforementioned gaps is vital to building an efficient and effective phishing website detection model.

3 Materials and Methods

3.1 Proposed Phishing Website Detection Architecture

Figure 1 exhibits our proposed architecture.

3.2 Dataset Details

In this study, three distinct public datasets were used to train and test each ensemble learning method to empirically validate their scalability and performance consistency across these datasets. The first dataset (DS-1) has 87 features and 11,430 records. The second dataset (DS-2) has 31 features and 11,055 records. The third dataset (DS-3) has 48 features and 10,000 records. The DS-1 and DS-2 both included four feature categories, including URL, web content/source code, domain, and page rank features, while DS-3 only includes the URL and web content/source code features. Brief details of each dataset are presented in Table 1 as follows.

Table 1.

Datasets	#of instances	#of features	Category of website features included	Label	Dataset Sources
DS-1	11,430	87	URL, web content/source code, domain, and page rank	1 for phish and 0 for benign	Mendeley, constructed by Hannousse & Yahiouche [14]: https://data.mendeley.com/datasets/c2gw7fy2j4
DS-2	11,054	31	URL, web content/source code, domain, and page rank	1 for phish and -1 for benign	Kaggle: https://www.kaggle.com/datasets/eswarchandt/phishing-website-detector
DS-3	10,000	48	URL and web content/source code	1 for phish and 0 for benign	Mendeley, constructed by Chiew et al.[26]: https://www.sciencedirect.com/science/article/abs/pii/S0020025519300763?via%3Dihub

Table 1. Dataset Details

Table 2.

Feature rank	Feature identifier	Feature description
F1	‘length_url'	Website is phishing if it contains a long URL (>75 characters), suspicious if the URL length is>54, otherwise benign [14, 26–29].
F2	‘length_hostname'	Website is phishing if it contains a large #of characters in the hostname [14, 26].
F3	‘ip’	Website is phishing if the IP address exists in the parts of the hostname/domain name [14, 26–29].
F4	‘nb_dots'	Website is phishing if # of dots in the domain name is > 2 [14, 26–27, 29].
F5-F16	‘nb_hyphens', ‘nb_at', ‘nb_qm', ‘nb_and', ‘nb_eq', ‘nb_underscore', ‘nb_percent', ‘nb_slash', ‘nb_star', ‘nb_colon', ‘nb_semicolon', ‘nb_dollar'	Website is phishing if any of the following special characters exist in the parts of domain and subdomain: ‘-‘, ‘@’, ‘?’,’&’, ‘ = ’,’_’, ’%’,’/’, ‘*’,’: ’,’;’ and ’$’ [14].
F17-F18	‘www’, ’com’	Website is phishing if common terms like ‘www’ and ‘com’ occurred more than once in the URL [14].
F19	‘nb_dslash'	Website is phishing if double slash (‘//’) located in >7th position when we count characters from ‘HTTP/HTTPS [14, 28–29] or if it exists in other parts of the URL like domain, subdomain, path [14].
F20	‘http_in_path'	Website is phishing if ‘HTTP’ exist in the parts of the URL path [14]
F21	‘https_token'	Website is phishing if ‘https_token’ is present in the domain name [14, 26–29].
F22-F23	‘ratio_digits_url', ‘ratio_digits_host'	Website is phishing if more # of digits exist in the URL and hostname [14].
F24-25	‘tld_in_path', ‘tld_in_subdomain'	Website is phishing if the top-level domain (TLD) exists in the URL path and subdomains [14].
F26	‘abnormal_subdomain'	Website is phishing if the URL contains patterns like ‘w[w]? [0-9]*’ instead of ‘www’ [14].
F27	‘nb_subdomains'	Website is phishing if the # of subdomain(s) >1 [14, 27].
F28	‘prefix_suffix'	Website is phishing if a hyphen (-) exists as a prefix/ suffix in any part of domain names [14, 26, 28–29].
F29	‘shortening_service'	Website is phishing if the website contains shortening services (Tiny URL) like goo. gl and bit.ly [14, 26–29].
F30	‘nb_redirection'	Website is benign if the # of page redirections <=1, suspicious if >=2 and <4, otherwise phishing [14, 26–29].
F31	‘nb_external_redirection'	Website is phishing if the # of external page redirection is > 4 [14, 26–29].
F32-F41	‘length_words_raw', ‘shortest_words_raw', ‘shortest_word_host', ‘shortest_word_path', ‘longest_words_raw', ‘longest_word_host', ‘longest_word_path', ‘avg_words_raw', ‘avg_word_host', ‘avg_word_path'	NLP and word raw features could be used in detecting phishing websites [14]. Hence, considering the overall length of words in raw, shortest words in raw, host, and path, as well as counting the longest word in raw, host, path and the average words in raw, host, and path would help to detect phishing websites [14].
F42	‘phish_hints'	Phishing URLs may contain sensitive words to build trust in website visitors. The total number of occurrences of the following words in URL is considered as phishing indicators: ‘login’, ‘wp’, ‘includes’, ‘content’, ‘site’, ‘admin’, ‘images’, ‘CSS’, ‘js’, ‘Alibaba’ ‘my account', ‘dropbox’, ‘signin’, ‘themes’, ‘view’, and ‘plugins’[14].
F43-45	‘domain_in_brand', ‘brand_in_subdomain', ‘brand_in_path',	Website is phishing if the domain is found in the brand, the brand is found in the subdomain and path [14]
F46	‘suspecious_tld'	Website is phishing if suspicious top-level domains (TLD) are found in the URL. Suspicious TLDs could be collected from: https://www.broadcom.com/ and https://www.spamhaus.org [14].
F47	‘statistical_report'	The website is benign if the hostname does not exist in the lists of top phishing domains or phishing IP ranks [26–29].
F48	‘nb_hyperlinks'	Website is phishing if no hyperlinks are found in web content [14, 26–27].
F49-F50	‘ratio_intHyperlinks', ‘ratio_extHyperlinks'	Website is phishing if the ratio of external hyperlinks pointing to the target website is more than the ratio of internal hyperlinks pointing to the same base domain of the website [14].
F51	‘nb_extCSS'	Website is phishing if the website incorporates an external CSS file (foreign link) pointed to the target website [14].
F52	‘ratio_extRedirection'	Website is phishing if the ratio of external redirection is more than the ratio of internal redirections [14].
F53	‘ratio_extErrors'	Phishing web pages are usually containing fake hyperlinks. Therefore, checking all hyperlinks of web pages and counting external hyperlinks connection errors are useful to detect phishing websites [14].
F54	‘external_favicon'	Website is phishing if the website's graphical image (favicon) is loaded from an external domain [14, 26–29].
F55	‘links_in_tags'	It is typical for <Link> tags on benign websites to contain links leading to web pages inside the same domain as the URL. In order to detect phishing, the percentage of internal links in <Link> tags is taken into account [14].
F56	‘ratio_intMedia'	Benign websites mostly use media (audio, video, images) stored and retrieved from the same domain [14]. Website is phishing if a large number (>61%) of image/audio/ video links are loaded from the external source (domain) [27–28].
F57	‘ratio_extMedia'	Website is benign if <22% of image/audio/video links are loaded from the external source (domain), suspicious if >22% and <=61%, otherwise, phishing [27–29].
F58	‘popup_window'	The website is benign if the popup window does not contain text fields to collect sensitive data from online users [14, 26–29].
F59	‘safe_anchor'	Website is phishing if the website <a> tag contains one of these links: ‘mailto’, ‘javascript’, and ‘#’ [14].
F60	‘empty_title'	Website is phishing if the website title is not found in the <title> tag of the HTML source code [14].
F61	‘domain_in_title'	Website is phishing if the domain name of the website is found in <title> tag instead of the parts of the webpage title [14].
F62	‘domain_with_copyright'	The website is benign if the domain name is found within the copyright logo [14].
F63	‘whois_registered_domain'	Website is phishing if the URL with domains does not match any records in the WHOIS database [14].
F64	‘domain_registration_length'	Website is benign if the website domain renewal amount is regularly paid for many years (has a long expiry date) [14, 26], otherwise phishing website indicator if the domain expires in <=1 year [27–29]
F65	‘domain_age'	Website is phishing if the domain age is short-lived or purchased for <=6 months [14, 26, 29].
F66	‘web_traffic'	If the website has the fewest visitors according to the Alexa web traffic report [14, 27, 28], it may be a phishing website. A website is considered to be benign if its ranking is among the top 100,000 [26, 28–29].
F67	‘dns_record'	Website is phishing if the DNS record of the URL domain is found empty in the WHO_IS database [14, 26–29].
F68	‘google_index'	Website is phishing if the website is not indexed and retrieved from the Google search engine [14, 26–29].
F69	‘page_rank'	Phishing website has the lowest page rank score as per Alexa [26] and Open-page rank [14]. The weight of a page rank ranges from 0 to 1 [26]. Website is phishing if the page rank weight of the website is less than 0.2 [28–29].

Table 2. Descriptions of the Top Phishing Website Predictive Features in DS-1

3.3 Data Preprocessing

Since any ML algorithm's success depends on using cleaned representative datasets and a choice of informative features [13–15, 21, 39], handling missing values, outliers, and the elimination of duplicate data are among the core tasks carried out at the data preprocessing stage. Each feature found in DS-1, DS-2, and DS-3 has been scaled and centered by subtracting the feature's mean and dividing it by its standard deviation. To accomplish this task, the standard scalar library was imported using the “sklearn.preprocessing.StandardScaler()” method.

Using an uneven dataset could lead to biased model results because the model will favor the classes that are in the majority. To address these concerns, the study applied a popular dataset balancing strategy called Synthetic Minority Over-sampling Technique (SMOTE) to change the uneven dataset ratio of DS-2 from 56%:44% to 50%:50%. The SMOTE generates new synthetic or augmented samples that are most comparable to the existing samples [25] to avoid data duplication and important data loss as a result of oversampling and under-sampling the actual datasets. To achieve this aim, first, the SMOTE packages were installed using the “pip install imblearn” command. Next, the SMOTE library was imported using the “from imblearn.over_sampling import SMOTE” method, and finally, the SMOTE function was fit to the training dataset.

3.3.1 Informative Feature Selection Technique.

As certain data may have redundant and misleading effects while other data may have no contribution to the performance of classifiers [44] data preparation is a crucial phase in any regression or classification activity. Moreover, the selection and use of the optimal subsets of features are important to improve model accuracy, reduce train-test computation time, and fight against model over-fitting [14, 44].

There are many effective methods for feature selection. Among them, filter, wrapper, and embedded-based feature selection are well known. Unlike wrapper and embedded-based feature selection, filter-based feature selection chooses informative features at the data preprocessing stage or independently of ML algorithms [14]. Because of their scalability and faster computational time, the filter-based technique has gotten the greatest attention compared to the wrapper and embedding feature selection techniques [38, 44].

Filter-based feature selection in the literature can be broadly categorized into two categories: (i) Uni-variate and (ii) Multivariate filter methods [44]. These methods differ from one another in accordance with the number of filters used and the selection method [38, 44]. There are two main reasons why the Uni-variate filters are more frequently chosen over multivariate filters in a range of scenarios [38, 44]. The first reason is that some circumstances simply need the observation of individual performance indicators, such as assessing the importance of individual features based on their statistical significance level. The second reason is that Uni-variate filters compute more quickly than multivariate filters. This might be due to Uni-variate strategies evaluating and ranking features independently using performance measures, unlike multivariate methods, which examine all feature subsets using a particular search strategy [38, 44]. These are the main reason for using the UFS technique in our proposed approach.

Ensemble learning methods are known for requiring a lot of computational time because they involve multiple ML algorithms [7, 9–10]. To address these concerns, the UFS technique that makes use of the ANOVA F-test was implemented and the top phishing website predictive features were described as shown in Table 2. Since DS-3 contains only feature categories from the URL and site contents and has fewer (10,000) instances compared to DS-1 and DS-2, we examined each ensemble learner's performance in DS-3 without using the UFS technique.

The UFS technique selects the top informative features based on each feature's statistical significance level i.e., the features with the highest score percentile of K. The study followed four-step processes to implement the UFS technique. In the first step, f_classif, SelectKbest, and SelectPercentile libraries were imported. In the second step, the “SelectKBest(f_classif, k = 69)” and the “SelectKBest(f_classif, k = 27)” methods were employed to select the top 69 informative features in DS-1 and to select the top 27 informative features in DS-2, respectively. In the third step, the selected features by the UFS technique were transformed into the training (X_train) and testing (X_test) dataset. In the fourth step, the top selected features were fit to the model to be trained (X_train dataset). The “X_train.columns[ufs.get_support()]” method was used to display the top phishing websites' predictive features and these features are briefly described in Sections 3.3.1.1, 3.3.1.2, and 3.3.1.3.

3.3.1.1 The Top Phishing Websites Prediction Features in DS-1.

As per the UFS technique, 69 out of the 87 features in DS-1 are judged to be statistically significant for phishing website detection. Table 5 presents the descriptions of each top 69 features.

3.3.1.2 The Top Phishing Websites Prediction Features in DS-2.

As per the UFS technique, 27 out of the 31 features in DS-2 are judged to be statistically significant for phishing website detection. Since each DS-2 feature description is found in [19], we did not describe them in this article to avoid redundancy and save space. A list of each top 27 features is presented as follows.

3.3.1.3 The Top Phishing Websites Prediction Features in DS-3.

All (48) features of DS-3 were used to examine each ensemble learner's performance. Since each DS-3 feature description is found in [26], we did not describe them in this article to avoid redundancy and save space. A list of each top 48 features is presented as follows.

3.4 Cross Validation

The model is considered to be effective when it can adapt to fresh or unforeseen data and make appropriate predictions. K-Fold cross-validation is one of the commonly used techniques to validate the ML model performance on unseen data and also ensures the trained model does not over-fit [38–39, 41]. In this study, 10-KFold cross-validation has been employed to test each ensemble learner's skill on unobserved data. Accordingly, the training dataset is partitioned into 10 equal pieces. The model training is carried out on K-1 (nine parts) while the model validation is done on the remaining K (one part). This process is repeated ten times until each partition is used as the testing set. The final prediction result is the average accuracy of each model across all testing sets [38–39, 41].

As was stated in [41], even 10-fold cross-validation and 10% hold-out validation are comparable; however, in 10-fold cross-validation, mean accuracy is taken into account after the model has been trained and evaluated 10 times. Similar to the 20% hold-out validation, 5-fold cross-validation trains on 80% of the data and tests on the remaining 20%. Is it true that 5-fold CV trains on 80% of the data? Yes. In a 5-fold CV, the data is divided into 5 equal folds, making 20% of the data available in each fold; four folds are utilized for training while the remaining one fold is used for testing. Each fold contains 20% of the data, so there are four folds totaling 20 X 4 = 80% of the data. The obvious distinction between the 5-Fold and 80%-20% train-test split is that in a 5-fold CV, the model is trained and tested five times, and the mean accuracy is then taken into account when determining the model's accuracy [41].

It is important to note that applying 10-fold cross-validation on large datasets is not advised because the number of folds strongly correlates to the model processing time [41]. Since DS-1 contains a large number of features and instances than DS-2 and DS-3, we also noticed that applying 10-Fold cross-validation in DS-1 required a lot of computational effort. To address these concerns, we used 10-fold cross-validation in DS-2 and DS-3 while an 80%-20% holdout cross-validation/train-test split was used in DS-1. 80%-20% holdout cross-validation [15], like 10-fold cross-validation, is commonly utilized by ML researchers [9–10, 16, 38–39, 41] to test the ML model performance on unseen data.

3.5 Ensemble Learning Methods

To detect phishing websites, single and hybrid ensemble learning methods were employed in this study. Single ensemble learners are classifiers that are ensemble on their own or they combine the prediction results of multiple decision trees to yield the final output. Random forest, gradient boost, and cat boost are a few examples. Hybrid ensemble learners are classifiers that integrate numerous single ensemble learners. Stacking and majority voting are two examples and are used to combine the results of the random forest, gradient boost, and cat boost. The brief details of these ensemble learners are presented as follows.

3.5.1 Gradient-boost (GB).

GB is boosting type ensemble learner that trains multiple decision trees in sequential and iterative manners and then combines each decision tree prediction score to produce the final prediction result [40]. GB operates based on three key components: (i) a weak learner, (ii) a loss function, and (iii) an additive model. GB has learners (usually decision tree) component to minimize errors yielded by the preceding trees, has a loss function component for residual identification, and has the additive component to add trees over time and update existing tree values [4]. GB uses optimizer-like Gradient descent to minimize the error between assigned values. GB adjusts weights only after computing the level errors. Then, combine the results of the newly generated tree with the results of the previous tree. This process is repeated until a predetermined number of trees have been boosted or until reducing loss at the required or acceptable level. Because of its high flexibility, the GB classifier can be tailored to any data-driven task [4].

As stated in [40], the computational time cost of the GB classifier is O (N_ of data, n_ of features), where N is the number of dataset samples and n is the number of features. When using massive data, the GB classifier takes a very lengthy time to train [40]. We used the following model hyper-parameters to optimize the performance of the GB classifier.

max_depth: used to specify the maximum limit of each tree that can grow in depth [39], and is used to control model over-fitting. The default value of max_depth is set to 3. The max_depth value could be any integer up to infinity. We look for intervals between 5 and 50 to obtain the optimal parameter value.

n_estimators: used to specify the number of boosting steps to be carried out. The default value of n_estimators is set to 100. The n_estimators value could be any integer up to infinity. We look for intervals between 5 and 500 to obtain the ideal value.

learning_rate: used to minimize the gradient step. There is a tradeoff between n_estimators and learning_rate. More n_estimators value is required when we use fewer learning_rate values [41]. The learning_rate value could be any integer up to infinity. We looked for intervals between 0.1 and 10 to obtain the optimal parameter value.

random_state: when building trees, this parameter is used to control the randomness of the training sample splits for the generated trees. Like the random.seed() function, it is used to obtain reproducible results when repeatedly running the ML model. The default value of random_state is set to none. The random_state value could be any integer up to infinity. We search for intervals between 0 and 50 to obtain the optimal parameter value.

3.5.2 Cat-boost (CATB).

CATB is a family member of boosting ensemble learning approach, an open-source, and enhanced version of the GB classifier [5]. This classifier introduced new advancements to boosting algorithms and contained novel techniques such as Ordered Target Statistics (OTS) and Ordered Boosting techniques as base predictors for automatic encoding of a categorical variable when building a decision tree, using permutation-driven random dataset sample selection strategy, fight against prediction shifts triggered by target leakage, balanced, fast and is robust against over-fitting [5–6, 46–47]. The CATB outperforms other publicly accessible boosting algorithms in terms of quality when applied to a variety of datasets as per [47].

Base predictors in CATB are oblivious decision trees, often known as decision tables. The term "oblivious" refers to the usage of the same splitting criterion throughout a level of the tree. These balanced trees allow for a significant speedup in execution during testing and are less prone to over-fitting [5, 47].

CATB is capable of handling categorical variables in an automated manner. Target statistics (TS) in CATB is used to estimate the predicted target value in each category. The computing time of the CATB depends on the number of target statistics (T) to be computed in each iteration and the set of candidate tree splits (C) that are taken into consideration at each iteration [47]. The following model hyper-parameters were used to optimize the performance of the CATB classifier.

max_depth: used to specify the maximum limit of each tree that can grow in depth [39]. The iteration value could be any integer up to 32, although the integer intervals of 1 to 10 are good [40]. We look for intervals between 1 and 15 to obtain the optimal parameter value.

iterations: used to define the top limit of the trees that can be generated and high value could result in model over-fitting [40]. We looked for intervals between 30 and 500 to obtain the optimal parameter value.

learning_rate: used to minimize the gradient step. There is a tradeoff between learning_rate and iterations value. More iteration values are needed when we use fewer learning_rate values [41]. The learning_rate value could be any integer up to 16, although the integer intervals of 1 to 10 are recommended [40]. We looked for intervals between 0.1 and 10 to obtain the optimal parameter value.

random_state: when building trees, this parameter is used to control the randomness of the training sample splits for the generated trees. Like the random.seed() function, it is used to obtain reproducible results when repeatedly running the ML model. The default value of random_state is set to none and we look for intervals between 0 and 50 to obtain the optimal parameter value.

3.5.3 Random Forest (RF).

RF is a popular ensemble learner and contains a collection of different decision trees that have not been pruned, uses the random approach to generate independent trees and subsets of data, and the final prediction result is the aggregated average results of each independent tree. For huge datasets, this approach yields better results than a single decision tree due to creating far less variance [3]. The RF is a good candidate in numerous fields to carry out classification and regression activities with the highest accuracy and a fewer computational time cost [45] and is highly robust against over-fitting issues [3, 46]. The RF exhibited the top phishing website detection accuracy in 17 out of 30 systematically reviewed studies [11].

The computational time cost of RF [45] at test time is O (T, D), where T is the size of the trees in the forest and D is the maximum depth (without the root). However, the computational time cost of RF can be lower if trees are unbalanced [45]. The following model hyper-parameters were used to optimize the performance of the RF classifier.

max_depth: used to specify the maximum limit of each tree that can grow in depth [39], and is used to control model over-fitting. The default value of max_depth is set to none. We look for intervals between 5 and 50 to obtain the optimal parameter value.

n_estimators: used to specify the number of trees in the forest [39]. The default value of n_estimators is set to 100 and we search for intervals between 5 and 500 to obtain the optimal parameter value.

random_state: when building trees, this parameter is used to control the randomness of the training sample splits for the generated trees. Like the random.seed() function, it is used to obtain reproducible results when repeatedly running the ML model. The default value of random_state is set to none and we search for intervals between 0 and 50 to obtain the optimal parameter value.

3.5.4 Stacking (Meta Ensemble Method).

Stacking ensemble methods use a number of independent classifiers as base learners at the first layer, and a Meta learner is used at the next layer to combine the predictions of base learners [7, 20]. The stacking ensemble method was used in [9, 16,18] and exhibited better accuracy compared to the individual ensemble and non-ensemble learners.

We conducted experiments on three different Meta learners, namely, Meta-CATB, Meta-GB, and Meta-RF to look for the appropriate one for DS-1, DS-2, and DS-3. Due to scoring the highest accuracy, the Meta-RF was found better for DS-1, while the Meta-CATB was found better for both DS-2 and DS-3. Table 3 exhibits how the Meta learners are selected.

Table 3.

Dataset	Base learners	Meta learner	Accuracy	Best performed Meta learner
DS-1	RF and GB	CATB	96.59%	RF
	RF and CATB	GB	96.89%
	GB and CATB	RF	97.24%
DS-2	RF and GB	CATB	96.83%	CATB
	RF and CATB	GB	96.52%
	GB and CATB	RF	96.7%
DS-3	RF and GB	CATB	98.51%	CATB
	RF and CATB	GB	97.96%
	GB and CATB	RF	98.43%

Table 3. Meta Learner Selection Method

3.5.5 Majority Vote Ensemble Method.

The hard voting ensemble method was implemented to predict the class with the most sums of votes from the models. This method was used in [17, 20, 23] and exhibited better accuracy compared to the individual ensemble and non-ensemble learners. In our study, this method was implemented to predict the class with the most sums of votes from the models such as CATB, GB, and RF.

3.6 Implementation Tools

The experiments were conducted on the Google co-lab cloud environment because it offers high-performance computing for training and testing our proposed models. The Python programming language was used to implement each ensemble learner because it is one of the most prominent languages used for data analysis and is more commonly utilized in a data science project [39]. The pandas, numpy, and matplotlib libraries were used for data handling and analysis [39].

3.7 Model Evaluation Metrics

In this study, accuracy, f1-Score, precision, recall, and train-test computational time cost were used as core model evaluation metrics. The true positive rate (TP rate), true negative rate (TN rate), false positive rate (FP rate), and false negative rate (FN rate) can be used to demonstrate the outcome of the classification activities. Their confusion matrix can be presented as follows:

The accuracy metric is the sum of correct phishing and benign website predictions divided by the sum of all correct and incorrect predictions. The accuracy metric is recommended to be used with balanced datasets. In short, the accuracy metric formula could be written as

Where the TP rate represents the number of phishing websites correctly labeled as Phishing. TNR represents the number of legitimate websites correctly labeled as legitimate. FP rate represents the number of legitimate websites incorrectly labeled as phishing websites and in this case, the FP rate denies Internet users from accessing authentic websites. FNR represents the number of phishing websites wrongly labeled as legitimate websites and in this case, the FN rate allows Internet users to visit phishing websites and is a dangerous one.

Recall (sensitivity) is used to measure the completeness of the phishing website detection model. In short, the recall formula can be written as

Precision is used to measure the exactness of the phishing website detection model. In short, the precision formula can be written as

F1-Score metric incorporate the harmonic mean found by combining the recall and precision. In short, the F1-measure formula can be written as

The Computational Time metric is used to measure the amount of time taken by each classifier to complete train and test tasks. This metric is used due to the fact that prediction algorithms are supposed to provide a swift prediction time along with the highest level of accuracy prior to internet visitors’ handover of their confidential data to fraudulent websites.

4 Result and Discussions

The higher accuracy score and faster model computing time can assure the trustworthiness of any phishing website detection system. Furthermore, the scalability and performance consistency of the phishing website detection system across numerous datasets is critical to combating various variants of phishing website attacks. That is why the proposed study used three distinct phishing website datasets to train and test each ensemble learning method such as CATB, GB, RF, stacking, and hard vote. DS-1 has 87 features and 11,430 records, DS-2 has 31 features and 11,054 records, and DS-3 has 48 features and 10,000 records.

Indeed, the success of any ML algorithm depends on using cleaned representative datasets and a choice of informative features [13–15, 21, 39]. To achieve this aim, appropriate model hyper-parameter values, cross-validation, and the top informative website features were used to optimize the model performance in terms of accuracy and computational time.

Ensemble learning methods are known for requiring a lot of computational time because they combine multiple ML algorithms [7, 9–10]. To address these concerns, each ensemble learning method was trained and tested both before and after applying the UFS technique and obtained promising results following the application of the UFS technique.

Using an uneven dataset could lead to biased model results because the model will favor the classes that are in the majority. To address these concerns, we employed the widely used dataset balancing strategy called SMOTE to change the uneven dataset ratio of DS-2 from 56%:44% to 50%:50%.

It is challenging to determine which classifier, out of CATB, GB, and RF, is best for the Meta learner in the stacking ensemble learning method without doing experiments. Due to scoring the highest accuracy, the Meta-RF was found better for DS-1, while the Meta-CATB was found better for both DS-2 and DS-3 as was mentioned in the methodology section.

In principle, lowering the number of less informative features could reduce the model computing time while increasing the model accuracy [44]. Our experimental results demonstrate that the values of the model hyper-parameters, in addition to the use of the UFS technique, are crucial in determining how accurate and quickly a model can be computed. The overall experiments are presented as follows.

4.1 Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-1

As per the UFS technique, 69 out of the 87 features in DS-1 are judged to be statistically significant for phishing website detection. Moreover, the results of RF and Meta-RF suggest that using the UFS technique can boost accuracy while maintaining the same computing time. The results of CATB, GB, and hard vote suggest that using the UFS technique can boost accuracy while reducing the computing time.

Table 4 shows the suitable model hyper-parameter values, accuracy, and computing time for each ensemble learner both before and after the UFS technique was applied to the DS-1. The results with the bold green color indicate better performance improvement following the application of the UFS technique, the results with bold blue color indicate the same model performance obtained both before and after the UFS technique, and the result with bold black color indicates which model hyper-parameter values is changed after the applying the UFS technique.

Table 4.

Ensemble learners	Adjusted model Parameter values before UFS	Adjusted model Parameter values after UFS	Accuracy (%) before UFS	Accuracy (%) after UFS	Compute time in seconds before UFS	Compute time in seconds after UFS
	max_depth = 5,	max_depth = 5,
CATB	iterations = 346,	iterations = 350,	97.73	97.9	12	10
	learning_rate = 0.1,	learning_rate = 0.1,
	random_state = 0	random_state = 0
	max_depth = 5,	max_depth = 5,
GB	n_estimators = 73,	n_estimators = 69,	97.07	97.16	19	17
	learning_rate = 0.1,	learning_rate = 0.1,
	random_state = 12	random_state = 3
RF	max_depth = 10,	max_depth = 10,	96.5	96.54	4	4
	n_estimators = 45,	n_estimators = 32,
	random_state = 0	random_state = 12
	CATB = (max_depth = 5,	CATB = (max_depth = 5,
	iterations = 346,	Iterations = 350
Meta-RF	learning_rate = 0.1,	learning_rate = 0.3,	97.24	97.42	47	47
	random_state = 0)	random_state = 0)
	GB = (max_depth = 5,	GB = (max_depth = 5,
	n_estimators = 73,	n_estimators = 69,
	learning_rate = 0.1,	learning_rate = 0.1,
	random_state = 12)	random_state = 3)
	RF = (max_depth = 10,	RF = (max_depth = 10,
	n_estimators = 45, random_state = 0)	n_estimators = 32, random_state = 12)
	CATB = (max_depth = 5,	CATB = (max_depth = 5,
	Iterations = 346,	Iterations = 350,
Hard vote	learning_rate = 0.1,	learning_rate = 0.1,	97.2	97.38	19	18
	random_state = 0)	random_state = 0)
	GB = (max_depth = 5,	GB = (max_depth = 5,
	n_estimators = 73,	n_estimators = 69,
	learning_rate = 0.1,	learning_rate = 0.1,
	random_state = 12)	random_state = 3)
	RF = (max_depth = 10,	RF = (max_depth = 10,
	n_estimators = 45,	n_estimators = 30,
	random_state = 0)	random_state = 12)

Table 4. Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-1

The experimental finding in Table 4 demonstrates that the UFS technique marginally increased the accuracy of each ensemble learner without adversely affecting the model's computation time. Moreover, when used on DS-1 CATB classifier slightly outperforms the other ensemble learners, including Meta-RF, hard vote, GB, and RF respectively, in terms of scoring the highest accuracy. For instance, after the UFS technique was applied to the DS-1, the CATB's accuracy increased by 0.17% (from 97.73% to 97.9%), while its computational time was reduced by 2 seconds (from 12 seconds to 10 seconds). The aforementioned performances of CATB were attained after employing the top 69 features and adjusting the CATB iteration value from 346 to 350. The aforementioned CATB accuracy was 0.48% higher than the Meta-RF accuracy, 0.52% higher than the hard vote accuracy, 0.74% higher than the GB accuracy, and 1.36% higher than the RF accuracy.

Following the application of the UFS technique to DS-1, changing the iteration value from 436 to 350 increased the CATB accuracy by 0.17% while reducing the computational time by 2 seconds. Changing the n_estimators value from 73 to 69 increased the GB accuracy by 0.09% while reducing the computational time by 2 seconds. Changing the n_estimators value from 45 to 32 and the random_state value from 0 to 12 increased the RF accuracy by 0.04%, with the same computational time (12 seconds). Changing the CATB iteration value from 346 to 350 and learning_rate value from 0.1 to 0.3; changing the GB n_estimators value from 73 to 69 and random_state value from 12 to 3; changing the RF n_estimators value from 45 to 32 and random_state value from 0 to 12 increased the Meta-RF accuracy by 0.18%, with same computational time (47 seconds). Changing the CATB iteration value from 346 to 350; changing the GB n_estimators value from 73 to 69 and random_state value from 12 to 3; changing the RF n_estimators value from 45 to 30 and random_state value from 0 to 12 increased the hard vote accuracy by 0.18% while reducing the computational time by 1 second.

Despite it being demonstrated the slowest computational time (47 seconds) compared to all the remaining ensemble learners, the Meta-RF attained the second-highest accuracy (97.42%) when used on DS-1. Despite scoring the lowest accuracy (96.54%) when used on DS-1, the RF classifier had a quicker computational time (4 seconds) both before and after the application of the UFS technique compared to all the remaining ensemble learners. The aforementioned RF computational time (4 seconds) was nearly 12 times faster than the Meta-RF computational time, more than 4 times faster than the hard vote and GB computational times, and more than 2 times faster than the CATB computational time. As stated in [45], the RF requires a fewer computational time cost, the RF computational time cost at test time is O (T, D), where T is the size of the trees in the forest and D is the maximum depth (without the root), and the computational time cost of RF can be lower if trees are unbalanced [45]. This could be the main reason that the RF exhibited the fastest computational time.

Figure 3 shows the accuracy of each ensemble learner before and after the UFS technique was used on the DS-1, while Figure 4 shows the accuracy, f1-score, precision, and recall of each ensemble learner after the UFS technique was used on the DS-1.

Fig. 3.

Fig. 4.

4.2 Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-2

As per the UFS technique, 27 out of the 31 features in DS-2 are judged to be statistically significant for phishing website detection. Moreover, the results of RF suggest that using the UFS technique can boost accuracy while maintaining the same computing time, while the results of CATB, GB, hard vote, and Meta-CATB suggest that using the UFS technique can boost accuracy while reducing the computing time.

Table 5 shows the suitable model hyper-parameter values, accuracy, and computing time for each ensemble learner both before and after the UFS technique was applied to the DS-2. The results with the bold green color indicate better performance improvement following the application of the UFS technique, the results with bold blue color indicate the same model performance obtained both before and after the UFS technique, and the result with bold black color indicates which model hyper-parameter values is changed after the applying the UFS technique.

Table 5.

Classifiers	Adjusted model parameters before UFS	Adjusted model parameters after UFS	Accuracy (%) before UFS	Accuracy (%) after UFS	Compute time in seconds before UFS	Compute time in seconds after UFS
	max_depth = 11,	max_depth = 11,
	iterations = 48,	iterations = 66,	97.17	97.36	38	34
CATB	learning_rate = 0.4,	learning_rate = 0.4,
	random_state = 12	random_state = 42
	max_depth = 9,	max_depth = 9,
	n_estimators = 56,	n_estimators = 50,	96.96	97.27	106	76
GB	learning_rate = 0.4, random_state = 2	learning_rate = 0.4, random_state = 3
	max_depth = 19,	max_depth = 18,
RF	n_estimators = 40,	n_estimators = 46,	97.18	97.37	12	12
	random_state = 0	random_state = 6
	CATB = (max_depth = 11,	CATB = (max_depth = 9,
Meta-CATB	iterations = 48,	Iterations = 50,
	learning_rate = 0.4, random_state = 12)	learning_rate = 0.4, random_state = 42)	96.90	97.18	650	501
	GB = (max_depth = 9,	GB = (max_depth = 9,
	n_estimators = 56,	n_estimators = 50,
	learning_rate = 0.4,	learning_rate = 0.4,
	random_state = 2)	random_state = 3)
	RF = (max_depth = 20,	RF = (max_depth = 18,
	n_estimators = 40,	n_estimators = 46,
	random_state = 12)	random_state = 6)
	CATB = (max_depth = 11,	CATB = (max_depth = 11,
	Iterations = 48,	Iterations = 62,
Hard vote	learning_rate = 0.4,	learning_rate = 0.4,	97.22	97.34	156	124
	random_state =12)	random_state = 42)
	GB = (max_depth = 9,	GB = (max_depth = 9,
	n_estimators = 56,	n_estimators = 53,
	learning_rate = 0.4,	learning_rate = 0.4,
	random_state = 2)	random_state = 3)
	RF = (max_depth = 19,	RF = (max_depth = 18,
	n_estimators = 36,	n_estimators = 42,
	random_state = 0)	random_state = 6)

Table 5. Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-2

The experimental finding in Table 5 demonstrates that the UFS technique marginally improved the accuracy of each ensemble learner while reducing the computation time. Moreover, the RF, CATB, and hard vote strongly compete with each other to yield superior accuracy despite the RF slightly beating them. For instance, after the UFS technique was applied to the DS-2, the RF's accuracy increased by 0.19% (from 97.18% to 97.37%). The aforementioned performance of RF was attained after employing the top 27 features, adjusting the RF max_depth value from 19 to 18, the n_estimators value from 40 to 46, and the random_state value from 0 to 6. The aforementioned RF accuracy (97.37%) was 0.01% higher than the CATB accuracy, 0.03% higher than the hard vote accuracy, 0.1% higher than the GB accuracy, and 0.19% higher than the Meta-RF accuracy.

Following the application of the UFS technique to DS-2, changing the n_estimators value from 48 to 66 and the random_state value from 12 to 42 increased the CATB accuracy by 0.19% while reducing the computational time by 4 seconds. Changing the n_estimators value from 56 to 50 and the random_state value from 2 to 3 increased the GB accuracy by 0.31% while reducing the computational time by 30 seconds. Changing the CATB max_depth value from 10 to 9, iteration value from 48 to 50, and random_state value from 12 to 42; changing the GB n_estimators value from 56 to 50 and random_state value from 2 to 3; changing the RF max_depth value from 20 to 18, n_estimators value from 40 to 46, and random_state value from 12 to 6 increased the Meta-RF accuracy by 0.28% while reducing the computational time by 149 seconds. Changing the CATB iteration value from 48 to 62 and random_state value from 12 to 42; changing the GB n_estimators value from 56 to 50 and random_state value from 2 to 3; changing the RF max_depth value from 19 to 18, n_estimators value from 36 to 42 and random_state value from 0 to 6 increased the hard vote accuracy by 0.12% while reducing the computational time by 32 seconds.

As it performed in DS-1, the RF classifier attained the fastest computational time when used on DS-2. The RF computational time (12 seconds) in DS-2 was nearly 42 times faster than the Meta-RF computational time, more than 10 times faster than the hard vote computational time, more than 6 times faster than the GB computational time, and nearly 3 times faster than the CATB computational time. As it performed in DS-1, the CATB classifier attained the second quickest computing time (34 seconds) when used on DS-2.

Figure 5 shows the accuracy of each ensemble learner before and after the UFS technique was used on the DS-2, while Figure 6 shows the accuracy, f1-score, precision, and recall of each ensemble learner after the UFS technique was used on the DS-2.

Fig. 5.

Fig. 6.

4.3 Each Ensemble Learner's Performance Comparisons When Used on DS-3

Because DS-3 only contains feature categories from the URL and site contents, whereas DS-1 and DS-2 both included four feature categories, including URL, web content/source code, domain, and page rank features, and because DS-3 has fewer (10,000) instances compared to the instances of DS-1 (11,430) and DS-2 (11,055), the study attempted to judge each ensemble learner performance in DS-3 with the absence of the UFS technique.

Our experimental findings reveal that each ensemble learner attained more than 98% accuracy when used on DS-3, despite the CATB slightly beating them in terms of scoring the highest accuracy (98.59%). Table 6 shows the suitable model hyper-parameter values, accuracy, and computing time for each ensemble learner with the absence of the UFS technique in DS-3.

Table 6.

Classifiers	Adjusted model parameters	Accuracy (%)	Compute time in seconds	#of website features
CATB	max_depth = 7, iterations = 70, learning_rate = 0.4, random_state = 12	98.59	28	48/All
GB	max_depth = 9, n_estimators = 54, learning_rate = 0.4, random_state = 2	98.52	104	48/All
RF	max_depth = 19, n_estimators = 36, random_state = 5	98.25	13	48/All
Meta-CATB	CATB = (max_depth = 11, iterations = 48, learning_rate = 0.4, random_state = 12) GB = (max_depth = 9, n_estimators = 56, learning_rate = 0.4, random_state = 2) RF = (max_depth = 20, n_estimators = 40, random_state = 12)	98.51	622	48/All
Hard vote	CATB = (max_depth = 11, Iterations = 48, learning_rate = 0.4, random_state = 12) GB = (max_depth = 9, n_estimators = 56, learning_rate = 0.4, random_state = 2) RF = (max_depth = 19, n_estimators = 36, random_state = 0)	98.55	167	48/All

Table 6. Each Ensemble Learner's Performance Comparisons When Applied to DS-3

The experimental findings in Table 6 exhibit that the CATB, hard vote, GB, and Meta-CATB strongly compete with each other to yield the highest accuracy despite the CATB slightly beating them. The CATB classifier came in first by attaining the highest accuracy of 98.59%. The aforementioned accuracy was 0.04% higher than the accuracy of the hard vote, 0.07% higher than the accuracy of the GB, 0.08% higher than the accuracy of the Meta-CATB accuracy, and 0.34% higher than the accuracy of the RF.

While the CATB classifier came in first by attaining 98.59% accuracy, the hard vote ensemble attained the second-best accuracy (98.55%), the GB attained the third-best accuracy (98.52%), the Meta-CATB attained the fourth-best accuracy (98.51%), and the RF attained the fifth best accuracy (98.25%).

As it performed in DS-1 and DS-2, the RF classifier came in first by attaining the fastest computational time (13 seconds) when used on DS-2 and the CATB classifier attained the second quickest computing time (28 seconds) when used on DS-3. However, as it performed in DS-2, Meta-CATB attained the slowest computational time (622 seconds) compared to all the remaining ensemble learners. Figure 7 shows the accuracy, f1-score, precision, and recall of each ensemble learner when used on the DS-3.

Fig. 7.

4.4 Summary of Key Experimental Findings

The suitability of the CATB for big data and the superiority of CATB in many academic fields, including traffic engineering, finance, meteorology, medicine, electrical utilities, astronomy, marketing, biology, psychology, and biochemistry were exhibited in [5]. Our study introduced the cat-boost classifier for phishing website detection and CATB demonstrated scalable, consistent, and superior accuracy in a variety of phishing website datasets compared to the remaining ensemble learners as per our experimental findings.

As can be seen in Table 7, compared to the remaining ensemble learners, the CATB exhibited superior accuracy when used on DS-1 and DS-3 by attaining 97.9% and 98.59% accuracy, respectively. The CATB attained the second-best accuracy (97.36%) when used on DS-2 with just 0.01% accuracy remaining to reach the first-best accuracy (97.37%) attained by RF in DS-2.

The CATB accuracy (97.9%) in DS-1 was 1.07% higher than the best accuracy (96.83%) attained by RF in the study [14], despite the use of DS-1. The CATB accuracy (98.59%) in DS-3 was 3.99% higher than the best accuracy (94.6%) attained by RF in the study [26] and 0.59% higher than the best accuracy (98%) attained by RF in the study [48] despite the use of DS-3. The aforementioned CATB accuracy in DS-2 was 0.36% higher than the best accuracy (97%) attained by RF in the study [19], despite the use of DS-2. Our experimental results are consistent with those of [47], which show that when used on various datasets, the CATB outperforms other publicly available boosting algorithms in terms of accuracy.

The RF was the most widely used classifier for phishing website detection and exhibited the top phishing website detection accuracy in 17 out of 30 systematically reviewed studies [11]. However, in our study, the RF demonstrated the lowest accuracy when used on DS-1 and DS-3 compared to the remaining ensemble learners such as CATB, Meta-RF, hard vote, and GB despite the RF exhibiting superior performance in DS-2 by attaining 97.37% accuracy. The RF accuracy in DS-2 (97.37%) was 0.37% higher than the best accuracy (97%) attained by RF in the study [19], despite the use of DS-2. The study [14] conducted experiments using DS-1 and stated that classifiers such as RF, SVM, and Decision trees are quite sensitive to the order of attributes in the datasets. This may be the main reason that the RF demonstrated lower accuracy in DS-1 as compared to CATB, Meta-RF, hard vote, and GB.

When applied to DS-1, DS-2, and DS-3, the RF was the fastest classifier in our study when compared to all the remaining ensemble learners, with computing times of 4, 12, and 10 seconds, respectively. As stated in [45], the RF requires a fewer computational time cost, the RF computational time cost at test time is O (T, D), where T is the size of the trees in the forest and D is the maximum depth (without the root), and the computational time cost of RF can be lower if trees are unbalanced [45]. These may be the main reasons that RF exhibited faster computational time compared to all the remaining ensemble learners.

While the RF came in first, the CATB was found to be the second fastest classifier in our study when applied to DS-1, DS-2, and DS-3, with computational times of 10, 34, and 28 seconds, respectively. As stated in [5, 47], base predictors in CATB are oblivious decision trees, often known as decision tables. The term “oblivious” refers to the usage of the same splitting criterion throughout a level of the tree. These balanced trees allow for a significant speedup in execution during testing and are less prone to over-fitting [5, 47]. As stated in [47], the computing time of the CATB depends on the number of TS to be computed in each iteration and the set of candidate tree splits that are taken into consideration at each iteration. These may be the main reasons that CATB exhibited faster computational time compared to the GB, Meta-RF, and Meta-CATB.

In this study, Meta-RF and Meta-CATB demonstrated the slowest computational time compared to all the remaining ensemble learners when applied to DS-1, DS-2, and DS-3, with computational times of 47, 501, and 622 seconds, respectively. This computational time result was expected because hybrid ensemble learning methods are known for requiring a lot of computational time because they involve multiple ML algorithms [7, 9–10].

Moreover, the application of the UFS technique increased the accuracy of each ensemble learner in DS-1 and DS-2 without adversely affecting the model's computation time. Our experimental results demonstrate that the values of the model hyper-parameters, in addition to the use of the UFS technique, are crucial in determining how accurate and quickly a model can be computed. Table 7 exhibits the summary of each ensemble learner's performance when used on DS-1, DS-2, and DS-3.

Table 7.

Datasets	Before & after UFS	Ensemble learners	Accuracy (%)	F1-score (%)	Precision (%)	Recall (%)	Compute time in seconds	Total #of features	Cross validation method
DS-1	Before	CATB	97.73	97.71	97.53	97.88	12	87/All	Holdout (80% train & 20% test)
		GB	97.07	97.04	97	97.08	19
		RF	96.5	96.45	96.71	96.2	4
		Meta-RF	97.24	97.22	97.17	97.26	47
		Hard vote	97.2	97.16	97.42	96.91	19
	After	CATB	97.9	97.88	97.63	98.14	10	69	Holdout (80% train & 20% test)
		GB	97.16	97.13	97.08	97.17	17
		RF	96.54	96.49	96.96	96.02	4
		Meta RF	97.42	97.38	97.86	96.91	47
		Hard vote	97.38	97.35	97.43	97.26	18
DS-2	Before	CATB	97.17	97.18	96.82	97.57	38	31/All	10-Fold
		GB	96.96	96.99	96.47	97.52	106
		RF	97.18	97.19	96.86	97.55	12
		Meta-CATB	96.9	96.92	96.7	97.16	650
		Hard vote	97.22	97.23	96.96	97.52	156
	After	CATB	97.36	97.38	96.82	97.97	34	27	10-Fold
		GB	97.27	97.3	96.51	98.12	76
		RF	97.37	97.4	96.65	98.18	12
		Meta-CATB	97.18	97.2	96.81	97.61	501
		Hard vote	97.34	97.34	97.37	97.34	124
DS-3	Before	CATB	98.59	98.58	98.45	98.71	28	48/All	10-Fold
		GB	98.52	98.51	98.29	98.73	104
		RF	98.25	98.23	98.26	98.2	10
		Meta-CATB	98.51	98.49	98.5	98.5	622
		Hard vote	98.55	98.53	98.47	98.6	167

Table 7. Summary of Each Ensemble Learner's Performance When Used on DS-1, DS-2, and DS-3

5 Conclusion and Future Work

One of the most prevalent dynamic and unlawful online practices is website phishing, which makes use of fake versions of benign websites to build trust in online users, steal sensitive information, and transfer malicious software like ransomware. As a result of website phishing success, vulnerable institutions like banks, e-commerce, education, healthcare, broadcast media, agriculture, and so on commonly encounter loss of productivity, reductions in competitiveness, risks to their survival, and compromises to national security. To address the aforementioned concerns, the intervention of an intelligent phishing website detection model powered by ML is needed. However, model over-fitting and under-fitting issues are the core challenges that traditional or individual ML algorithms encounter as a result of noise, variation, and bias in datasets, and ensemble learning methods are found to be state-of-the-art solutions for numerous classification tasks in order to overcome the aforementioned concerns. That is why our proposed study conducted rigorous experiments on both single and hybrid ensemble learning methods like CATB, GB, RF, Meta-RF, Meta-CATB, and hard vote.

Furthermore, the scalability and performance consistency of the phishing website detection system across numerous datasets is critical to combating various variants of phishing website attacks. That is why the proposed study used three distinct phishing website datasets to train and test each ensemble learning method to identify the best-performed one in terms of accuracy and model computational time. Due to combining multiple ML algorithms, ensemble learning methods are known for requiring a lot of computational time. To address these concerns, we applied the UFS technique and obtained promising results. Our experimental findings demonstrate two core benefits of applying the UFS technique: (i) the UFS technique can boost accuracy while maintaining the same computing time and (ii) the UFS technique can boost accuracy while reducing the computing time.

Our study introduced the cat-boost classifier for phishing website detection and our experimental findings exhibited that CATB demonstrated scalable, consistent, and superior accuracy in a variety of phishing website datasets like DS-1, DS-2, and DS-3. Despite scoring the lowest accuracy in DS-1 and DS-3, the RF classifier exhibited the highest accuracy when used on DS-2. Meta-RF attained the second-best accuracy when used on DS-1 despite scoring the slowest computational time.

When it comes to model computational time, the RF classifier was discovered to be the fastest when applied to all datasets (DS-1, DS-2, and DS-3), while the CATB classifier was discovered to be the second quickest when applied to all datasets (DS-1, DS-2, and DS-3). Meta-RF and Meta-CATB, on the other hand, had the slowest computational time across all datasets, despite the fact that the UFS technique helped to reduce their computing time. Our experimental results demonstrate that the values of the model hyper-parameters, in addition to the use of the UFS technique, are crucial in determining how accurate and quickly a model can be computed.

To address the limitations of the current study and undertake comparative performance analysis, the study advocated including appropriate DL algorithms, mobile-based phishing, large datasets, and other feature selection techniques in future work.

Acknowledgements

The authors appreciate all of the collaborating editors and anonymous reviewers for their thoughtful criticism and recommendations.

References

[1]

A. Taha. 2021. Intelligent ensemble learning approach for phishing website detection based on weighted soft voting. Mathematics 9, 21 (2021), 1--13. DOI:

Abstract

1 Background and Study Motivations

2 Literature Review

2.1 Related Works

3 Materials and Methods

3.1 Proposed Phishing Website Detection Architecture

3.2 Dataset Details

3.3 Data Preprocessing

3.3.1 Informative Feature Selection Technique.

3.3.1.1 The Top Phishing Websites Prediction Features in DS-1.

3.3.1.2 The Top Phishing Websites Prediction Features in DS-2.

3.3.1.3 The Top Phishing Websites Prediction Features in DS-3.

3.4 Cross Validation

3.5 Ensemble Learning Methods

3.5.1 Gradient-boost (GB).

3.5.2 Cat-boost (CATB).

3.5.3 Random Forest (RF).

3.5.4 Stacking (Meta Ensemble Method).

3.5.5 Majority Vote Ensemble Method.

3.6 Implementation Tools

3.7 Model Evaluation Metrics

4 Result and Discussions

4.1 Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-1

4.2 Each Ensemble Learner's Performance Comparisons Before and After the UFS Applied to DS-2

4.3 Each Ensemble Learner's Performance Comparisons When Used on DS-3

4.4 Summary of Key Experimental Findings

5 Conclusion and Future Work

Acknowledgements

References

Cited By

Index Terms

Recommendations

Single Classifier Selection for Ensemble Learning

Error-correcting output codes based ensemble feature extraction

Classifier ensemble generation and selection with multiple feature representations for classification applications in computer-aided detection and diagnosis on mammography

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations