A Modified Decision Tree and its Application to Assess Variable Importance

Francis Fuller Bbosa, Makerere University and Makerere University, Uganda, [email protected]

Ronald Wesonga, Sultan Qaboos University, Oman, [email protected]

Peter Nabende, Makerere University, Uganda, [email protected]

Josephine Nabukenya, Makerere University, Uganda, [email protected]

DOI: https://doi.org/10.1145/3478905.3479245
DSIT 2021: 2021 4th International Conference on Data Science and Information Technology, Shanghai, China, July 2021

This paper presents an approach to further improve the data reduction abilities of the traditional C4.5 algorithm by integrating the information gain ratio and forward stepwise regression algorithms. Motivated by the fact that the traditional C4.5 algorithm utilizes a full set of antecedent attributes without taking into consideration irrelevant attributes which is a precursor to spurious predictive model estimates. This study aims to overcome this drawback by developing and evaluating the performance of an importance-based attribute selection algorithm called the C4.5-Forward Stepwise (C4.5-FS) for improving the data reduction abilities of the traditional C4.5 classifiers. Five datasets with dimensionality ranging from 6 to 10,000 attributes were employed to evaluate the model performance. The goodness of fit for the modified and traditional C4.5 classifier was done using k-fold cross-validation based on a confusion matrix. Experimental results revealed that the C4.5-FS algorithm trained on fewer antecedent attributes improved the data reduction capabilities of the traditional C4.5 algorithm trained on a full set of antecedent attributes by achieving higher accuracy.

CCS Concepts: • Computing methodologies → Feature selection;

KEYWORDS: Machine learning, modified, data reduction, big data, significant, C4.5

ACM Reference Format:
Francis Fuller Bbosa, FFB, Bbosa*, Ronald Wesonga, RW, Wesonga, Peter Nabende, PN, Nabende and Josephine Nabukenya, JN, Nabukenya. 2021. A Modified Decision Tree and its Application to Assess Variable Importance. In 2021 4th International Conference on Data Science and Information Technology (DSIT 2021), July 23-25, 2021, Shanghai, China. ACM, New York, NY, USA, 12 Pages. https://doi.org/10.1145/3478905.3479245

1 INTRODUCTION

Attribute selection in predictive modeling is a crucial step as it affects the performance of the model [1; 2]. The advent of big data has nullified the adage that “the more the attributes, the better the performance” in predictive modeling [3]. Additionally, due to the advent of “big data", it is paramount to identify the antecedent attributes which reasonably influence the consequent output in a fitted predictive model [4]. This is attributed to the fact that datasets contain numerous variables with different structures characterized by their Volume, Velocity, Variety, Veracity, and Value, making it arduous for traditional statistical procedures that are often exclusively accustomed to the investigation of structured and homogeneous data, to process and analyze [5; 6] thus negatively affecting the accuracy of the prediction model.

Moreover, [4] assert that this is a challenge given that most traditional statistical techniques that are currently employed rely on intrinsic assumptions, lack a natural and simple approach of assessing the comparative importance of each antecedent attribute for modeling the consequent attribute. This is underscored by several scholars [7; 8] who revealed that traditional statistical methods are weak with respect to ascertaining the significance of antecedent attributes when dealing with complex and large data as a result of noise accumulation and the mysteriousness of the actual distributions of the numerous antecedent and consequent attributes.

Several scholars [9; 10] argue that traditional statistical techniques have been predominantly employed in predictive modeling but identification of an appropriate combination of relevant antecedent attributes for the consequent attribute is a key challenge. This is mainly attributed to the fact that statistical techniques have their model assumptions and pre-specified complex underlying interactions between antecedent and consequent attributes which when violated leads to inaccurate and unreliable estimates [9]. Furthermore, ascertaining interactions among attributes using traditional statistical methods entails the pre-specification of interactions between antecedent and consequent attributes [10]. Hence as the number of antecedent attributes in the model increases, the number of possible interactions that can be investigated also increases and thus leading to a complicated model that can be difficult to comprehend and fit [9].

According to [11], superfluous antecedent attributes included in a model may degrade the performance of the model as well as escalate the cost of resources for data processing and analysis. As a result, [3] suggest that fitting a small number of highly predictive attributes by eliminating inapt and redundant attributes in order to avoid overfitting results in a more efficient and understandable model. On the other hand, [12] allude that machine-learning predictive algorithms in contrast to traditional statistical techniques thrive on big data and make less assumptions about the data, permitting them to make use of non-normally distributed attributes that are used to partition antecedent attributes into groups with similar consequent attributes.

To deal with the issues associated with ascertaining key antecedent attributes in a predictive model based on conventional statistical techniques, various machine learning techniques have been proposed in recent years [3; 4]. This is because some machine learning methods particularly data mining are non-parametric in nature and thus make fewer assumptions about the data, allowing them to make use of non-normally distributed attributes [12]. This is emphasized by [13; 14] who revealed that machine learning techniques have been successfully employed in predictive modelling studies [15]. Nonetheless, identification of a combination of key antecedent attributes that optimally determine the consequent attribute remains challenging [8] depending on the specific technique as well as the dimensionality of the dataset [16]. This is because a significant attribute for a linear technique could be non-significant for a non-linear technique [17].

Therefore, there is no unanimous agreement on a single technique that performs best under all conditions [3]. Nevertheless, the application of machine learning to ascertain significant antecedent attributes in predictive modeling remains limited [18].

Whereas various machine learning techniques, for instance, Artificial Neural Networks (ANN), Support Vector Machines (SVM), K-Nearest Neigbbour (KNN) as well as hybrid models have been labeled as ‘black box’ [19], C4.5 algorithms have become popular for their simplicity and natural way of identifying relevant attributes in predictive models and can deal with both nonlinear and interaction effects [20] in addition to their respectable predictive performance [21]. Hence decision tree algorithms provide a simple framework devoid of pre-specifications of underlying the interactions among model attributes thus providing a conducive platform for researchers to evaluate attribute interactions after trees are grown [10].

Several scholars [22; 23; 24] have employed the traditional C4.5 algorithms as a data reduction technique. However, this study aims to further improve the data reduction abilities of the traditional C4.5 algorithms so that it works very well for other scenarios that require data reduction.

Considerable research aimed at eliminating irrelevant antecedent attributes while maintaining or improving the performance of the predictive models has been undertaken in the past decade. In their study to compare attribute selection techniques, techniques, Gowda, Manjunath, and Jayaram [25] employed correlation-based feature, gain ratio, radial basis function network and backpropagation neural network techniques. The experimental output indicated that the attribute subsets selected by correlation-based feature technique were registered marginal improvements for both back propagation neural network and Radial basis function network classification accuracy when compared to attribute subsets generated by information gain.

In 2018, Hoque, Singh, and Bhattacharyya [26] proposed “EFS-MI: an ensemble feature selection method for classification”. In a bid to develop an optimal attribute subset, the researchers employed decision trees, SVM, random forests and KNN to generate an ensemble method for attribute selection known as EFS-MI. Findings from their study revealed that the EFS-MI techniques outperformed the single traditional classifiers by attaining an optimal subset of antecedent attributes. Gao, Wen, and Zhang [27] proposed “An Improved Random Forest Algorithm for Predicting Employee Turnover”. The authors modified the random forest technique to enhance its ability to predict the turnover of employees. They developed a weighted quadratic random forest method which was employed as the prediction model. The modified random forest outperformed the traditional random forest method and hence the authors concluded that the weighted quadratic random forest method offers a novel analytic method that could facilitate a more reliable and accurate prediction of employee turnover.

In 2020, Manonmani and Balakrishnan [28] developed a relevant attribute identification algorithm known as “Improved Teacher Learner Based Optimization (ITLBO)” and compared its performance with that of the traditional Support Vector Machine(SVM), Convolution Neural Networks. Results revealed that the ITLBO registered a 36% reduction in attributes hence an overall enhancement in predictive accuracy compared to the traditional algorithms. Thakur and Han [29] proposed “A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method”. The researchers employed 19 machine learning techniques to build the fall detection system. Among these, a novel technique that integrated the KNN classifier, Ada boost algorithm and K-fold cross validation was developed. Their findings reveal that the novel techniques outperformed all the other 18 techniques. They concluded that modification of techniques by integrating several techniques and algorithms can enhance the performance of the traditional machine learning technique.

Overall, this research underscores the importance of confirming the relationships between antecedent attributes on the occurrence of a consequent attribute in a predictive model by ascertaining the key/relevant determinants of the target attribute. This would simplify the computation process, and enhance the efficiency of the entire decision-making process.

The motivation is that the traditional C4.5 algorithm utilizes a full set of antecedent attributes without taking into consideration irrelevant attributes which is a precursor to spurious predictive model estimates. In this paper, we explore how a C4.5 tree machine learning approach, integrated with the forward stepwise regression process can be used to identify important attributes as well as enhance its overall predictive capacity from a data mining perspective.

1.1 Problem statement

The advent of “big data” has made it difficult to identify antecedent attributes that reasonably influence the consequent output in a fitted predictive model. This is underscored by the fact that the currently employed traditional statistical techniques rely on intrinsic assumptions, lack a natural and simple approach to assessing the comparative importance of each antecedent attribute for modeling the consequent attribute. Hence traditional statistical methods are weak with respect to ascertaining the significance of antecedent attributes when dealing with complex and large data as a result of noise accumulation and the mysteriousness of the actual distributions of the numerous antecedent and consequent attributes.

Machine learning techniques have been proposed to deal with the limitations of traditional statistical techniques with respect to ascertaining key antecedent attributes in predictive models. Unfortunately, relevant literature reveals that application of machine learning particularly decision trees to ascertain significant antecedent attributes in predictive modeling remains limited, in addition to the absence of a unanimous agreement on a single machine learning technique that performs best under all conditions.

2 MATERIALS AND METHODS

2.1 Data sources

The data utilized in this experimental study were obtained from the machine learning repository of the University of California at Irvine (UCI)¹ (Dua & Graff, 2017); a data warehouse of databases used by several researchers (Chicco & Jurman, 2020; El-Bialy, Salamay, Karam, & Khalifa, 2015; Tougui, Jilbab, & El Mhamdi, 2020; Zriqat, Altamimi, & Azzeh, 2016) for experimental investigation of machine learning techniques. Four datasets were employed to evaluate the performance of the proposed novel algorithm with their data dimension ranging from 14 to 10,000 attributes. The four datasets were selected based on the fact that they were intended for classification associated tasks, in addition to having been cited by other scholars. On the other hand, a domesticated malaria incidence dataset obtained from the Ministry of Health through the District Health Information Software (DHIS2), meteorological data were obtained from Uganda National Meteorological Authority (UNMA)², whereas demographic data were obtained from the Uganda Bureau of Statistics (UBoS)³ comprising monthly data for the period January 2012 to December 2019 for Kampala from all the above-stated sources was also used. Table 1 shows the details of the data sets employed in this study.

Table 1: Properties of the experimental datasets

Dataset name	Description	Number of attributes	Number of instances	Data Source
Malaria	Data set with attributes to predict malaria incidence thresholds	6	96	Compiled by researcher from various sources .i.e. www.health.go.ug www.ubos.org www.unma.go.ug
Heart_UCI	Data set with attributes to detect the presence of heart disease in the patient	14	303	https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Breast Cancer Wisconsin (Diagnostic)	Data set with attributes to detect breast cancer diagnosis as benign or malignant	32	569	https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
APS Failure	Data set with attributes to predict determinants of Air Pressure System (APS) failure of Scania trucks	171	60000	https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks
Arcene	Data set with attributes to distinguish cancer versus normal patterns from mass-spectrometric data	10000	900	https://archive.ics.uci.edu/ml/datasets/Arcene

2.2 Software

R software version 3.6.3 [31], an open-source dialect of the S language and environment which was developed at Bell Laboratories [32] was used. The “funModeling” version 1.9.3 [33], “FSelector” version 0.31 [34] and “rpart” version 4.1-15 [35] packages in r were employed for data processing, computation of significant antecedent attributes.

2.3 Data preprocessing

2.3.1 Data transformation. The researchers inspected the dataset to determine whether any data transformations were necessary. Below were the necessary transformations undertaken by the researchers during data preprocessing.

2.3.1.1 Data Normalization. Given that the data was collected from different sources with dissimilar units of measurement necessitated standardization for comprehensive comparisons. Therefore, the researchers employed z-score standardization, which rescales the various attribute values around the mean of the attribute given by equation (1) [36].

\begin{equation} A = \frac{{{x_i} - \mu }}{\sigma } \end{equation}
(1)

Where $A$ is the normalized attribute value, ${x_i}$ is the original attribute, $\mu $ is the mean of the attribute and $\sigma $ is the attribute standard deviation

2.3.1.2 Data Discretization. The researchers undertook data discretization to create a consistent group of antecedent attributes since the dataset contained both continuous and categorical attributes [37] as well as generating more interpretable categories in the data that can enhance the comprehensibility of predictive models [37]. The researchers employed the entropy discretization procedures adapted from [38];

Let $Z$ be a continuous attribute with $[ {x,y} ]$ as its domain values
We establish cut-off thresholds $( {{b_1},{b_2}, \ldots \ldots ,{b_{m - 1}}} )$ , where $x < {b_1} < {b_2} < \ldots < {b_{m - 1}}$ <${\rm{\ }}y$
Hence separating $[ {x,y} ]$ into m disjoint thresholds .i.e. [$x,{\rm{\ }}{b_1})$ , (${b_1},{\rm{\ }}{b_2}]$ ,…, (${b_{m - 1}},y{\rm{\ }}]$
The continuous values of $Z$ were then transformed into $m$ different discrete values $( {{f_1},{f_2}, \ldots \ldots ,{f_m}} )$ , as illustrated below;

\begin{equation}{d_m}\ if\ {a_{m - 1}} \le B \le y\end{equation}
(2)

where i=1,2,3… … ,m-1

2.4 Training and validation data

The researchers split the final merged dataset into training and validation datasets. 70% of the data was allocated to the training cluster for the development of the classifiers. The remaining data (30%) was assigned to the validation groups for the assessment of model performance [39].

2.5 Assessing the significance of antecedent attributes

There are three key algorithms for building decision tree learners; Classification and Regression Trees (CART) [40], Iterative Dichotomiser 3 (ID3) [41] and C4.5 algorithms [42]. The CART is only applicable for continuous class attributes (regression), whereas the ID3 and C4.5 are applicable for categorical class attributes [9].

According to Asha et al.[43], ID3 which uses information gain is subjective towards attributes with huge numbers of distinct values over attributes with fewer values although the latter is more informative. Hence, the gain ratio employed by the C4.5 algorithm overcomes the above weaknesses of the ID3 by undertaking pruning during and after constructing the tree as well as handling continuous attributes [44]; thereby attaining high accuracy in various tree learning scenarios [45]. The researchers employed the information gain ratio based on C4.5 to evaluate the significance of each antecedent attribute by measuring the gain in information [46].

2.5.1 Decision trees. To attain decision trees with discrete attributes, a splitting process was undertaken by the researchers based on a training dataset D, an attribute list $A$ in the following way [46]:

Given training dataset D, with m possible class values.
Generate an attribute list $A$ with $V$ possible discrete values
Given a probability distribution $P = {p_1},{p_2}, \ldots ,{p_n}$ representing the proportion of cases and a sample size $S$ , containing consequent attribute, then the information carried by this distribution for the root node known as entropy of D is given by:

\begin{equation*}{\rm{Entropy}}\left( {\rm{D}} \right) = - \mathop \sum \limits_{{\rm{i}} = 1}^{\rm{m}} \left( {{{\rm{p}}_{\rm{i}}}} \right){\rm{lo}}{{\rm{g}}_2}\left( {{{\rm{p}}_{\rm{i}}}} \right)\end{equation*}

Where: m is the total number of cases or incidences

The logarithm is base 2 because entropy is a measure of the expected encoding length measured in bits [47]. ${p_i}$ denotes the probability that an instance belongs to the class, and this can be estimated by dividing the number of instances in class i by the total number of instances in the dataset. i.e.

\begin{equation*} {{\rm{p}}_{\rm{i}}} = \frac{{{\rm{No}}.{\rm{\ of\ instances\ in\ class\ i}}}}{{{\rm{Total\ instances}}}} \end{equation*}

For each attribute, Compute Entropy of Attribute (V) for attribute list V with respect to the root note (D)

\begin{equation}{\rm{Entrop}}{{\rm{y}}_{\rm{A}}}\left( {\rm{D}} \right) = \mathop \sum \limits_{{\rm{j}} = 1}^{\rm{v}} \frac{{\left| {{{\rm{D}}_{\rm{j}}}} \right|}}{{\left| {\rm{D}} \right|}}{\rm{*Entropy}}\left( {{{\rm{D}}_{\rm{j}}}} \right)\end{equation}
(3)

where ${D_j}$ denotes the total number of “positive” or “negative” cases data (D) and (V) is the number of classes in the dataset.

Compute the information gain for each attribute

\begin{equation*}{\rm{Info}}\_{\rm{Gain}}\left( {\rm{A}} \right) = {\rm{Entropy}}\left( {\rm{D}} \right) - {\rm{Entrop}}{{\rm{y}}_A}\left( D \right)\end{equation*}
Normalise equation (4) by computing the split information using the equation (4);

\begin{equation}{\rm{Split}}\,{\rm{info}}\,A\left( D \right) = - \mathop \sum \limits_{{\rm{j}} = 1}^{\rm{V}} \frac{{\left| {{{\rm{D}}_{\rm{j}}}} \right|}}{{\left| {\rm{D}} \right|}}{\rm{*}}{\log _2}\left[ {\frac{{\left| {{{\rm{D}}_{\rm{j}}}} \right|}}{{\left| {\rm{D}} \right|}}} \right]\end{equation}
(4)
Compute the gain ratio(A) for attribute list A with respect to the root node (D)

\begin{equation}{\rm{Gainratio}}\left( {\rm{A}} \right) = \left( {\frac{{{\rm{Info}}\_{\rm{Gain}}\left( {\rm{A}} \right)}}{{{\rm{Splitinfo}}\left( {\rm{A}} \right)}}} \right)\end{equation}
(5)
Select the attribute which has maximizes the information gain ratio.
Remove the attribute that offers the highest information gain ratio from the set of attributes.
Repeat the above steps until we test all attributes, or the decision tree has all leaf nodes.

Hence the higher the gain ratio, the more impact the antecedent attribute had on determining the consequent attribute.

2.5.2 Proposed modified gain ratio based learning algorithm. The researcher integrated the information gain ratio into the inherited properties of the forward stepwise regression process [48] in order to assess the importance of each antecedent attribute [49]. Hence after computing the C4.5 algorithm, The researchers commenced with an empty classifier and kept adding antecedent attributes with the highest gain ratio values (attributes that improve the model highest), to the trained model, one at a time in descending order.

For every addition, the trained classifier was fit/generalized onto test data until the stopping criteria were met. The criteria to terminate was achieving accuracy metrics greater or equal to those returned by the traditional C4.5 technique. The above process was repeated until all attributes were added to the classifier that generated an accuracy metric greater or equal to the matrix generated by the C4.5 trained model on all attributes and thus achieving similar performance metric scores particularly accuracy. The algorithm entailed in the process of enhanced attribute selection based on the C4.5-FS is indicated in Figure 1.

2.6 Analysis

The researcher used the C4.5 algorithms, a descendant of the Iterative Dichotomizer 3 (ID3) algorithm developed by [50] to generate the decision tree. The C4.5 algorithm uses entropy and information gain to measure the homogeneity of the consequent attribute that the antecedent attributes match. C4.5 builds a decision tree in a top-down approach. At every node of the tree, one attribute was tested depending on minimizing entropy or maximizing information gain and the outcomes were employed in the splitting criteria [51].

2.7 Goodness of Fit

The 10-fold cross-validation method was employed to validate the accuracy of the traditional C4.5 and C4.5-FS classifiers. Here, the researchers split the dataset into 10 groups. The 10 groups were divided into two (2), nine (9) sub-groups as training data, and one (1) sub-group for testing data [52]. The researchers used accuracy with reference to the confusion matrix table to evaluate the performance of the decision tree [53] as indicated the equation (6);

\begin{equation} {\rm{Accuracy}} = \left( {\frac{{{\rm{TP}} + {\rm{TN}}}}{{{\rm{TP}} + {\rm{TN}} + {\rm{FP}} + {\rm{FN}}}}} \right) \end{equation}
(6)

Where TP, TN, FP, FN denote actual positive, actual negative, false positive, and false-negative observations respectively.

3 EXPERIMENTAL RESULTS

The researchers subjected five experimental datasets to the traditional C4.5 algorithm in order to derive the gain ratio values as illustrated in equation (5) for each antecedent attribute.

3.1 Significant antecedent attributes in determining the consequent attribute

The researchers employed a C4.5-FS algorithm that inherited characteristics of the forward-selection stepwise regression process by integrating classifier performance evaluation metrics based on a confusion matrix with an information gain ratio to select attributes that significantly influence the target class. The researchers experimented to concisely demonstrate this modified important attribute selection and the results are shown in Table 2.

Table 2: Most important attributes by dataset

Name of Dataset	Significant attributes (attribute names denoted in “” where attribute description was anonymized)
Malaria	Under 5 population
Heart_UCI	Chest pain type (cp) and depression induced by exercise relative to rest (oldpeak)
Breast Cancer Wisconsin (Diagnostic)	“X10”, “X30”, “X9”, “X23”
APS Failure	"al_000", "ag_003", "cj_000" ,"ay_009"
Arcene	"V4381", "V7147", "V4545", "V6038", "V167", "V218" "V6180", "V8610", "V6974", "V9801", "V6440", "V6700" "V5915", "V8093", "V824"

Table 2 reveals that when the C4.5-FS algorithm is trained on the experimental datasets, the number of antecedent attributes is reduced from 6, 13, 31, 170 and 9999 as indicated in Table 1 to 1, 2, 4, 4 and 15 for the “Malaria”, “Heart_UCI”, “Breast Cancer Wisconsin (Diagnostic)”, “APS Failure” and “Arcene” data sets respectively. This implies that the modified C4.5 algorithm was able to achieve the accuracy of the traditional C4.5 algorithm trained on a full set of attributes at the expense of some irrelevant attributes. Hence generating more robust models for predicting the target attributes in the experimental datasets. For instance; a more robust model for determining malaria incidence thresholds based on the sample meteorological and demographic determinants would be fitted as:

\begin{equation*} malaria\ incidence = U5\_POP\left( {Under\ 5\ \ population} \right) \end{equation*}

3.2 Goodness of fit results

The researchers compared the classification accuracy of the traditional C4.5 and C4.5-FS algorithms using the selected variable dataset against the full-variable (original) data sets. The goodness of fit results was based on the average results achieved from the 10^th subgroup from each sub-group. Table 3 shows the accuracy obtained using the selected attribute subset and full attribute set.

Table 3: Comparison of accuracy of traditional and C4.5-FS algorithms

Dataset	Ratio of dataset Attributes (P) to Instances (N) (P/N)	Classifier Accuracy		Improvement of C4.5-FS over traditional C4.5 (%)	Ratio of parameters ${{\boldsymbol{p}}_1}$ in C4.5-FS to traditional C4.5 attributes ${{\boldsymbol{p}}_0}$ (${{\boldsymbol{p}}_1}/{{\boldsymbol{p}}_0})$
Dataset	Ratio of dataset Attributes (P) to Instances (N) (P/N)	Traditional (%)	C4.5-FS (%)	Improvement of C4.5-FS over traditional C4.5 (%)
Malaria	0.0625	72.4	75.9	3.5	1/1
Heart_UCI	0.0462	76.9	76.9	0	0.286
Breast Cancer Wisconsin (Diagnostic	0.0562	87.1	92.4	5.3	0.286
APS Failure	0.00285	90.4	91.0	0.6	0.25
Arcene	11.1111	76.7	86.7	10	0.016

Results from Table 3 indicate that the C4.5-FS algorithm with fewer attributes attained equal or higher performance when both models were fit on test data. Furthermore, it is evident that the modified C4.5 classifier outperforms the traditional C4.5 classifier in all the experimental datasets. Hence the resultant attribute subset using the C4.5-FS algorithm is the most significant set of antecedents that improves the predictive accuracy of the consequent attribute. In terms of performance improvements, Table 3 also reveals that the data reduction ability of the modified C4.5 improves with the increment in the number of attributes. Thus, more than half of the attributes that were deemed to be significant by the traditional C4.5 classifier were eliminated by the proposed C4.5-FS classifier except for the malaria dataset. In other words, we can actually pay no attention to more than 50% of the original attributes retained by the traditional C4.5 classifier and still achieve similar or higher accuracy.

4 DISCUSSION

The key objective of this study is to further improve the data reduction abilities of the traditional C4.5 algorithms so that it works very well for other scenarios that require data reduction as well as enhance its overall predictive capacity from a data mining perspective. The proposed C4.5-FS algorithm that inherits properties of the forward stepwise regression algorithm outperforms the traditional C4.5 algorithms for each of the experimental datasets. This implies that integrating the forward stepwise regression process enhances the performance of the traditional C4.5 algorithms as evidenced by its better performance when fitted on previously unknown data samples. This phenomenon may possibly be due to the two (2) phased approach of the C4.5-FS algorithm that allows the researchers to employ the information gain values computed during the first phase into the second phase that starts with an empty model; thus enabling the researchers to begin inputting antecedent attributes in descending order following their information gain ratio values.

Nonetheless, the results herein that reveal that the data reduction capabilities of the traditional C4.5 algorithm can be improved by integrating it with other machine learning techniques are in agreement with those of other previous studies [26; 27].

The study had some limitations. Firstly, the researchers did not examine the effect of interactions of antecedent attributes on the consequent attribute given that C4.5 techniques are based on a greedy search algorithm, which does not deal well with investigations involving attribute interactions [54]. Additionally, the majority of publicly available experimental data is imbalanced [55, 56], which is a limitation to building reliable prediction models [56; 57] due to biasedness and variance in predictive estimates [58]. Hence the proposed modified decision tree learner could benefit from a comparison with other data mining techniques which would make it easily adaptable for scenarios of imbalanced data.

Besides the above, the proposed C4.5-FS adopts the assumption that the target class attribute is binary, all numeric antecedent attributes are standardized and the dataset is complete without any missing data.

5 CONCLUSION

In this paper, a C4.5-FS algorithm is proposed to further improve the data reduction abilities of the traditional C4.5 algorithms. Performance comparisons were made between the modified gain ratio algorithm and the conventional gain ratio algorithm. Experimental results revealed that the modified C4.5 algorithm outperformed the traditional C4.5 algorithms when fit on previous unknown sample data. This implies that we can employ the proposed C4.5-FS algorithm to identify significant antecedent attributes in a predictive model without negatively affecting the accuracy of the model.

This research could benefit future works, particularly comparing the performance of the C4.5 with various machine learning techniques on similar imbalanced data. Above all, this paper is part of a research in progress study that is intended to address the drawbacks of using traditional statistical techniques to undertake predictions based on big data.

Reference

The authors extend their appreciation to Mr. Douglas Candia and Mr. Frank Namugera who contributed to improving this research. This research was partly funded by Makerere University through the Staff Development, Welfare and Retirement Benefits Committee (SDWRBC).
REFERENCES
Araujo, M., & Guisan, A. (2006). Five (or so) challenges for species distribution modelling. Journal of Biogeography, 33, 1677–1688.
Bedia, J., Busqué, J., & Gutiérrez, J. (2011). Predicting plant species distribution across an alpine rangeland in northern Spain:A comparison of probabilistic methods. Applied Vegetation Science, 14(3), 415–432.
Khiabani, F. B., Ramezankhani, A., Azizi, F., Hadaegh, F., Steyerberg, E. W., & Khalili, D. (2015). A tutorial on variable selection for clinical prediction models: Feature selection methods in data-mining could improve the results. In Journal of Clinical Epidemiology. Elsevier Ltd. https://doi.org/10.1016/j.jclinepi.2015.10.002
Greenwell, B. M., Boehmke, B. C., & Mccarthy, A. J. (2018). A Simple and Effective Model-Based Variable Importance Measure (pp. 1–27).
Basco, A., & Senthilkumar, N. C. (2017). Real-time analysis of healthcare using big data analytics. IOP Conference Series: Materials Science and Engineering, 263(4). https://doi.org/10.1088/1757-899X/263/4/042056
Hu, C. H., Lee, H. S., Lara, E., & Gan, S. (2018). The Ensemble and Model Comparison Approaches for Big Data Analytics in Social Sciences. Practical Assessment, Research & Evaluation, 23(17).
Garg, A., & Tai, K. (2013). Comparison of statistical and machine learning methods in modelling of data with multicollinearity. Int. J. Modelling, Identification and Control, 18(4).
Loucoubar, C., Paul, R., Bar-hen, A., Huret, A., Tall, A., Sokhna, C., Trape, J.-F., Ly Badara, A., Faye, J., Diop, A., & Sakuntabhai, A. (2011). An Exhaustive , Non-Euclidean , Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria. PLoS ONE, 6(9), 1–16. https://doi.org/10.1371/journal.pone.0024085
Chang, L., & Chien, J. (2013). Analysis of driver injury severity in truck-involved accidents using a non-parametric classification tree model. Safety Science, 51(1), 17–22. https://doi.org/10.1016/j.ssci.2012.06.017
Ramezankhani, A., Kabir, A., Pournik, O., Azizi, F., & Hadaegh, F. (2016). Classification-based data mining for identification of risk patterns associated with hypertension in Middle Eastern population. Medicine, 95(35).
Krakovska, O., Christie, G., Sixsmith, A., Ester, M., & Moreno, S. (2019). Performance comparison of linear and non- linear feature selection methods for the analysis of large survey datasets. PLoS ONE, 14(3), 1–17.
Heide, E. M. M. Van Der, Veerkamp, R. F., Pelt, M. L. Van, Kamphuis, C., Athanasiadis, I., & Ducro, B. J. (2019). Comparing regression , naive Bayes , and random forest methods in the prediction of individual survival to second lactation in Holstein cattle. Journal of Dairy Science, 102(10). https://doi.org/10.3168/jds.2019-16295
Groenhof, T. K. J., Koers, L. R., Blasse, E., Groot, M. De, Grobbee, D. E., Bots, M. L., Asselbergs, F. W., & Lely, A. T. (2019). Data mining information from electronic health records produced high yield and accuracy for current smoking status. Journal of Clinical Epidemiology, 118(2020), 100–106.
McArdle, J. ., & Ritschard, G. (2014). Contemporary issues in exploratory data mining in the behavioral sciences. In Contemp. issues Explor. data Min. Behav. Sci.
Sow, B., Mukhtar, H., Ahmad, H. F., & Suguri, H. (2019). Assessing the relative importance of social determinants of health in malaria and anemia classification based on machine learning techniques. In Informatics for Health and Social Care (pp. 1–13). Taylor & Francis. https://doi.org/10.1080/17538157.2019.1582056
Wang, Y., Jia, Z., & Yang, J. (2019). An Variable Selection Method of the Significance Multivariate Correlation Competitive Population Analysis for Near-Infrared Spectroscopy in Chemical Modeling. In IEEE Access (Vol. 7). IEEE. https://doi.org/10.1109/ACCESS.2019.2954115
Fisher, A., Rudin, C., & Dominici, F. (2019). All Models are Wrong, but Many are Useful. Journal of Machine Learning Research, 20(1), 1–81.
Bolat, E., Yildirim, H., Altin, S., & Yurtseven, E. (2020). A COMPREHENSIVE COMPARISON OF MACHINE LEARNING ALGORITHMS ON DIAGNOSING ASTHMA DISEASE AND COPD. International Journal of Sciences and Research, 76(3). https://doi.org/10.21506/j.ponte.2020.3.17
Truong, V. N., Li, Z., Alain, C., Loong, Y., Boying, L., & Xiaodie, P. (2019). Predicting customer demand for remanufactured products: A Data Mining Approach. In European Journal of Operational Research. Elsevier B.V. https://doi.org/10.1016/j.ejor.2019.08.015
Gottard, A., Vannucci, G., & Marchetti, G. M. (2020). A note on the interpretation of tree-based regression models. Biometrical Journal, 1–10. https://doi.org/10.1002/bimj.201900195
Masci, C., Johnes, G., & Agasisti, T. (2018). Student and school performance across countries: A Machine learning approach. European Journal of Operational Research, 269(3), 1072–1085.
Aldhyani, T. H., & Joshi, M. R. (2014). Analysis of Dimensionality Reduction in Intrusion Detection. International Journal of Computational Intelligence and Informatics, 4(3), 199–206.
Al Janabi, K. B., & Kadhim, R. (2018). Data Reduction Techniques: A Comparative Study for Attribute Selection Methods. International Journal of Advanced Computer Science and Technology, 8(1), 1–13. http://www.ripublication.com
Es-sabery, F., & Hair, A. (2020). A MapReduce C4.5 Decision Tree Algorithm Based on Fuzzy Rule-Based System. Fuzzy Information and Engineering, 1–28. https://doi.org/10.1080/16168658.2020.1756099
Gowda, K. A., Manjunath, A. S., & Jayaram, M. A. (2010). Comparative study of Attribute Selection Using Gain Ratio. International Journal of Information Technology and Knowledge and Knowledge Management, 2(2), 271–277.
Hoque, N., Singh, M., & Bhattacharyya, D. K. (2018). EFS-MI: an ensemble feature selection method for classification. Complex & Intelligent Systems, 4(2), 105–118.
Gao, X., Wen, J., & Zhang, C. (2019). An Improved Random Forest Algorithm for Predicting Employee Turnover. In Mathematical Problems in Engineering (pp. 1–13). https://doi.org/10.1155/2019/4140707
Thakur, N., & Han, C. (2021). A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method. Journal of Sensor and Actuator Networks, 10(3), 39.
Dua, D., & Graff, C. (2017). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
R Core Team. (2020). R: A language and environment for statistical computing. https://www.r-project.org/
Peng, R. D. (2015). R Programming for Data Science. https://www.cs.upc.edu/∼robert/teaching/estadistica/rprogramming.pdf
Casas, P. (2019). funModeling: Exploratory Data Analysis and Data Preparation Tool-Box (1.9.3). https://cran.r-project.org/package=funModeling
Romanski, P., & Kotthoff, L. (2018). FSelector: Selecting Attributes (0.31). https://cran.r-project.org/package=FSelector
Therneau, T., & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees (4.1-15). https://cran.r-project.org/package=rpart
Aggarwal, C. (2015). Data mining:The Text book. Springer. https://doi.org/10.1007/978-3-319-14142-8 14
Maslove, D. M., Podchiyska, T., & Lowe, H. J. (2013). Discretization of continuous features in clinical datasets. 544–553. https://doi.org/10.1136/amiajnl-2012-000929
Li, R., & Wang, Z. (2002). An entropy-based discretization method for classification rules with inconsistency checking. First International Conference on Machine Learning and Cybernetics, November, 4–5.
Li, G., Zhou, X., Liu, J., Chen, Y., Zhang, H., Chen, Y., Liu, J., Jiang, H., Yang, J., & Nie, S. (2018). Comparison of three data mining models for prediction of advanced schistosomiasis prognosis in the Hubei province. PLoS Neglected Tropical Diseases, 12(2), 1–19. https://doi.org/10.1371/journal.pntd.0006262
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Chapman & Hall.
Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.
Quinlan, J. R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann.
Asha, G. K., Manjunath, A. S., & Jayaram, M. . (2012). A comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inf. Technol. Knowl. Manag, 2, 271–277.
Mienye, I. D., Sun, Y., & Wang, Z. (2019). Prediction performance of improved decision tree-based algorithms: a review. Procedia Manufacturing, 35, 698–703.
Nithya, N. S., & Duraiswamy, K. (2014). Gain ratio based fuzzy weighted association rule mining classifier for medical diagnostic interface. Indian Academy of Sciences, 39, 39–52.
Prasad, N., & Naidu, M. M. (2013). Gain Ratio as Attribute Selection Measure in Elegant Decision Tree to Predict Precipitation. EUROSIM Congress on Modelling and Simulation, 141–150. https://doi.org/10.1109/EUROSIM.2013.35
Bhatt, H., Mehta, S., & D'mello, L. (2015). Use of ID3 Decision Tree Algorithm for Placement Prediction. International Journal of Computer Science and Information Technologies, 6(5), 4785–4789.
Smith, G. (2018). Step away from stepwise. Journal of Big Data, 5(32). https://doi.org/10.1186/s40537-018-0143-6
Hwang, J., & Hu, T. (2014). A stepwise regression algorithm for high-dimensional variable selection. Journal of Statistical Computation and Simulation, 85(9), 1793–1806. https://doi.org/10.1080/00949655.2014.902460
Singh, S., & Gupta, P. (2014). Comparative study ID3 , CART and C4 . 5 Decision tree algorithm: A Survey. International Journal of Advanced Information Science and Technology, 27(27), 97–103.
Ali, M. F. M., Asklany, S. A., El-wahab, M. A., & Hassan, M. A. (2019). Data Mining Algorithms for Weather Forecast Phenomena: Comparative Study. International Journal of Computer Science and Network Security, 19(9), 76–81.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
Wiharto, W., Kusnanto, H., & Herianto, H. (2016). Interpretation of Clinical Data Based on C4.5 Algorithm for the Diagnosis of Coronary Heart Disease. Healthcare Informatics Research, 22(3), 186–195.
Freitas, A. A. (2001). Understanding the Crucial Role of Attribute Interaction in Data Mining. Artificial Intelligence Review, 16, 177–199.
Branco, P., Torgo, L., & Ribeiro, R. P. (2015). A Survey of Predictive Modelling under Imbalanced Distributions.
Jain, S., Kotsampasakou, E., & Ecker, G. F. (2018). Comparing the performance of meta-classifiers — a case study on selected imbalanced data sets relevant for prediction of liver toxicity. Journal of Computer-Aided Molecular Design, 32, 583–590. https://doi.org/10.1007/s10822-018-0116-z

FOOTNOTE

Corresponding Author

¹https://archive.ics.uci.edu/ml/datasets.php

²www.unma.go.ug

³www.ubos.org

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

DSIT 2021, July 23–25, 2021, Shanghai, China

© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-9024-8/21/07…$15.00.
DOI: https://doi.org/10.1145/3478905.3479245