research-article

Open access

A Modified Decision Tree and its Application to Assess Variable Importance

Authors:

Francis Fuller Bbosa,

Ronald Wesonga,

Josephine NabukenyaAuthors Info & Claims

DSIT 2021: 2021 4th International Conference on Data Science and Information Technology

Pages 468 - 475

https://doi.org/10.1145/3478905.3479245

Published: 28 September 2021 Publication History

All formats PDF

Abstract

This paper presents an approach to further improve the data reduction abilities of the traditional C4.5 algorithm by integrating the information gain ratio and forward stepwise regression algorithms. Motivated by the fact that the traditional C4.5 algorithm utilizes a full set of antecedent attributes without taking into consideration irrelevant attributes which is a precursor to spurious predictive model estimates. This study aims to overcome this drawback by developing and evaluating the performance of an importance-based attribute selection algorithm called the C4.5-Forward Stepwise (C4.5-FS) for improving the data reduction abilities of the traditional C4.5 classifiers. Five datasets with dimensionality ranging from 6 to 10,000 attributes were employed to evaluate the model performance. The goodness of fit for the modified and traditional C4.5 classifier was done using k-fold cross-validation based on a confusion matrix. Experimental results revealed that the C4.5-FS algorithm trained on fewer antecedent attributes improved the data reduction capabilities of the traditional C4.5 algorithm trained on a full set of antecedent attributes by achieving higher accuracy.

References

[1]

The authors extend their appreciation to Mr. Douglas Candia and Mr. Frank Namugera who contributed to improving this research. This research was partly funded by Makerere University through the Staff Development, Welfare and Retirement Benefits Committee (SDWRBC).

[2]

REFERENCES

[3]

Araujo, M., & Guisan, A. (2006). Five (or so) challenges for species distribution modelling. Journal of Biogeography, 33, 1677–1688.

[4]

Bedia, J., Busqué, J., & Gutiérrez, J. (2011). Predicting plant species distribution across an alpine rangeland in northern Spain:A comparison of probabilistic methods. Applied Vegetation Science, 14(3), 415–432.

[5]

Khiabani, F. B., Ramezankhani, A., Azizi, F., Hadaegh, F., Steyerberg, E. W., & Khalili, D. (2015). A tutorial on variable selection for clinical prediction models: Feature selection methods in data-mining could improve the results. In Journal of Clinical Epidemiology. Elsevier Ltd. https://doi.org/10.1016/j.jclinepi.2015.10.002

[6]

Greenwell, B. M., Boehmke, B. C., & Mccarthy, A. J. (2018). A Simple and Effective Model-Based Variable Importance Measure (pp. 1–27).

[7]

Basco, A., & Senthilkumar, N. C. (2017). Real-time analysis of healthcare using big data analytics. IOP Conference Series: Materials Science and Engineering, 263(4). https://doi.org/10.1088/1757-899X/263/4/042056

[8]

Hu, C. H., Lee, H. S., Lara, E., & Gan, S. (2018). The Ensemble and Model Comparison Approaches for Big Data Analytics in Social Sciences. Practical Assessment, Research & Evaluation, 23(17).

[9]

Garg, A., & Tai, K. (2013). Comparison of statistical and machine learning methods in modelling of data with multicollinearity. Int. J. Modelling, Identification and Control, 18(4).

[10]

Loucoubar, C., Paul, R., Bar-hen, A., Huret, A., Tall, A., Sokhna, C., Trape, J.-F., Ly Badara, A., Faye, J., Diop, A., & Sakuntabhai, A. (2011). An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria. PLoS ONE, 6(9), 1–16. https://doi.org/10.1371/journal.pone.0024085

[11]

Chang, L., & Chien, J. (2013). Analysis of driver injury severity in truck-involved accidents using a non-parametric classification tree model. Safety Science, 51(1), 17–22. https://doi.org/10.1016/j.ssci.2012.06.017

[12]

Ramezankhani, A., Kabir, A., Pournik, O., Azizi, F., & Hadaegh, F. (2016). Classification-based data mining for identification of risk patterns associated with hypertension in Middle Eastern population. Medicine, 95(35).

[13]

Krakovska, O., Christie, G., Sixsmith, A., Ester, M., & Moreno, S. (2019). Performance comparison of linear and non- linear feature selection methods for the analysis of large survey datasets. PLoS ONE, 14(3), 1–17.

[14]

Heide, E. M. M. Van Der, Veerkamp, R. F., Pelt, M. L. Van, Kamphuis, C., Athanasiadis, I., & Ducro, B. J. (2019). Comparing regression, naive Bayes, and random forest methods in the prediction of individual survival to second lactation in Holstein cattle. Journal of Dairy Science, 102(10). https://doi.org/10.3168/jds.2019-16295

[15]

Groenhof, T. K. J., Koers, L. R., Blasse, E., Groot, M. De, Grobbee, D. E., Bots, M. L., Asselbergs, F. W., & Lely, A. T. (2019). Data mining information from electronic health records produced high yield and accuracy for current smoking status. Journal of Clinical Epidemiology, 118(2020), 100–106.

[16]

McArdle, J. ., & Ritschard, G. (2014). Contemporary issues in exploratory data mining in the behavioral sciences. In Contemp. issues Explor. data Min. Behav. Sci.

[17]

Sow, B., Mukhtar, H., Ahmad, H. F., & Suguri, H. (2019). Assessing the relative importance of social determinants of health in malaria and anemia classification based on machine learning techniques. In Informatics for Health and Social Care (pp. 1–13). Taylor & Francis. https://doi.org/10.1080/17538157.2019.1582056

[18]

Wang, Y., Jia, Z., & Yang, J. (2019). An Variable Selection Method of the Significance Multivariate Correlation Competitive Population Analysis for Near-Infrared Spectroscopy in Chemical Modeling. In IEEE Access (Vol. 7). IEEE. https://doi.org/10.1109/ACCESS.2019.2954115

[19]

Fisher, A., Rudin, C., & Dominici, F. (2019). All Models are Wrong, but Many are Useful. Journal of Machine Learning Research, 20(1), 1–81.

[20]

Bolat, E., Yildirim, H., Altin, S., & Yurtseven, E. (2020). A COMPREHENSIVE COMPARISON OF MACHINE LEARNING ALGORITHMS ON DIAGNOSING ASTHMA DISEASE AND COPD. International Journal of Sciences and Research, 76(3). https://doi.org/10.21506/j.ponte.2020.3.17

[21]

Truong, V. N., Li, Z., Alain, C., Loong, Y., Boying, L., & Xiaodie, P. (2019). Predicting customer demand for remanufactured products: A Data Mining Approach. In European Journal of Operational Research. Elsevier B.V. https://doi.org/10.1016/j.ejor.2019.08.015

[22]

Gottard, A., Vannucci, G., & Marchetti, G. M. (2020). A note on the interpretation of tree-based regression models. Biometrical Journal, 1–10. https://doi.org/10.1002/bimj.201900195

[23]

Masci, C., Johnes, G., & Agasisti, T. (2018). Student and school performance across countries: A Machine learning approach. European Journal of Operational Research, 269(3), 1072–1085.

[24]

Aldhyani, T. H., & Joshi, M. R. (2014). Analysis of Dimensionality Reduction in Intrusion Detection. International Journal of Computational Intelligence and Informatics, 4(3), 199–206.

[25]

Al Janabi, K. B., & Kadhim, R. (2018). Data Reduction Techniques: A Comparative Study for Attribute Selection Methods. International Journal of Advanced Computer Science and Technology, 8(1), 1–13. http://www.ripublication.com

[26]

Es-sabery, F., & Hair, A. (2020). A MapReduce C4.5 Decision Tree Algorithm Based on Fuzzy Rule-Based System. Fuzzy Information and Engineering, 1–28. https://doi.org/10.1080/16168658.2020.1756099

[27]

Gowda, K. A., Manjunath, A. S., & Jayaram, M. A. (2010). Comparative study of Attribute Selection Using Gain Ratio. International Journal of Information Technology and Knowledge and Knowledge Management, 2(2), 271–277.

[28]

Hoque, N., Singh, M., & Bhattacharyya, D. K. (2018). EFS-MI: an ensemble feature selection method for classification. Complex & Intelligent Systems, 4(2), 105–118.

[29]

Gao, X., Wen, J., & Zhang, C. (2019). An Improved Random Forest Algorithm for Predicting Employee Turnover. In Mathematical Problems in Engineering (pp. 1–13). https://doi.org/10.1155/2019/4140707

[30]

Thakur, N., & Han, C. (2021). A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method. Journal of Sensor and Actuator Networks, 10(3), 39.

[31]

Dua, D., & Graff, C. (2017). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml

[32]

R Core Team. (2020). R: A language and environment for statistical computing. https://www.r-project.org/

[33]

Peng, R. D. (2015). R Programming for Data Science. https://www.cs.upc.edu/∼robert/teaching/estadistica/rprogramming.pdf

[34]

Casas, P. (2019). funModeling: Exploratory Data Analysis and Data Preparation Tool-Box (1.9.3). https://cran.r-project.org/package=funModeling

[35]

Romanski, P., & Kotthoff, L. (2018). FSelector: Selecting Attributes (0.31). https://cran.r-project.org/package=FSelector

[36]

Therneau, T., & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees (4.1-15). https://cran.r-project.org/package=rpart

[37]

Aggarwal, C. (2015). Data mining:The Text book. Springer. https://doi.org/10.1007/978-3-319-14142-8 14

[38]

Maslove, D. M., Podchiyska, T., & Lowe, H. J. (2013). Discretization of continuous features in clinical datasets. 544–553. https://doi.org/10.1136/amiajnl-2012-000929

[39]

Li, R., & Wang, Z. (2002). An entropy-based discretization method for classification rules with inconsistency checking. First International Conference on Machine Learning and Cybernetics, November, 4–5.

[40]

Li, G., Zhou, X., Liu, J., Chen, Y., Zhang, H., Chen, Y., Liu, J., Jiang, H., Yang, J., & Nie, S. (2018). Comparison of three data mining models for prediction of advanced schistosomiasis prognosis in the Hubei province. PLoS Neglected Tropical Diseases, 12(2), 1–19. https://doi.org/10.1371/journal.pntd.0006262

[41]

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Chapman & Hall.

[42]

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.

Digital Library

[43]

Quinlan, J. R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann.

Digital Library

[44]

Asha, G. K., Manjunath, A. S., & Jayaram, M. . (2012). A comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inf. Technol. Knowl. Manag, 2, 271–277.

[45]

Mienye, I. D., Sun, Y., & Wang, Z. (2019). Prediction performance of improved decision tree-based algorithms: a review. Procedia Manufacturing, 35, 698–703.

[46]

Nithya, N. S., & Duraiswamy, K. (2014). Gain ratio based fuzzy weighted association rule mining classifier for medical diagnostic interface. Indian Academy of Sciences, 39, 39–52.

[47]

Prasad, N., & Naidu, M. M. (2013). Gain Ratio as Attribute Selection Measure in Elegant Decision Tree to Predict Precipitation. EUROSIM Congress on Modelling and Simulation, 141–150. https://doi.org/10.1109/EUROSIM.2013.35

Digital Library

[48]

Bhatt, H., Mehta, S., & D'mello, L. (2015). Use of ID3 Decision Tree Algorithm for Placement Prediction. International Journal of Computer Science and Information Technologies, 6(5), 4785–4789.

[49]

Smith, G. (2018). Step away from stepwise. Journal of Big Data, 5(32). https://doi.org/10.1186/s40537-018-0143-6

[50]

Hwang, J., & Hu, T. (2014). A stepwise regression algorithm for high-dimensional variable selection. Journal of Statistical Computation and Simulation, 85(9), 1793–1806. https://doi.org/10.1080/00949655.2014.902460

[51]

Singh, S., & Gupta, P. (2014). Comparative study ID3, CART and C4 . 5 Decision tree algorithm: A Survey. International Journal of Advanced Information Science and Technology, 27(27), 97–103.

[52]

Ali, M. F. M., Asklany, S. A., El-wahab, M. A., & Hassan, M. A. (2019). Data Mining Algorithms for Weather Forecast Phenomena: Comparative Study. International Journal of Computer Science and Network Security, 19(9), 76–81.

[53]

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.

[54]

Wiharto, W., Kusnanto, H., & Herianto, H. (2016). Interpretation of Clinical Data Based on C4.5 Algorithm for the Diagnosis of Coronary Heart Disease. Healthcare Informatics Research, 22(3), 186–195.

[55]

Freitas, A. A. (2001). Understanding the Crucial Role of Attribute Interaction in Data Mining. Artificial Intelligence Review, 16, 177–199.

Digital Library

[56]

Branco, P., Torgo, L., & Ribeiro, R. P. (2015). A Survey of Predictive Modelling under Imbalanced Distributions.

[57]

Jain, S., Kotsampasakou, E., & Ecker, G. F. (2018). Comparing the performance of meta-classifiers — a case study on selected imbalanced data sets relevant for prediction of liver toxicity. Journal of Computer-Aided Molecular Design, 32, 583–590. https://doi.org/10.1007/s10822-018-0116-z

Recommendations

Constructing X-of-N Attributes for Decision Tree Learning

While many constructive induction algorithms focus on generating new binary attributes, this paper explores novel methods of constructing nominal and numeric attributes. We propose a new constructive operator, X-of-N. An X-of-N representation is a set ...
Mining associative decision rules in decision tables through attribute value reduction
GRC '12: Proceedings of the 2012 IEEE International Conference on Granular Computing (GrC-2012)

There are many algorithms and approaches developed to induce decision rules in decision/information tables. Basically, these methods share a common idea: reduction, including row reduction, column reduction, and cell reduction. Most solutions based on ...
On Influence of Representations of Discretized Data on Performance of a Decision System

When discretization is used for preprocessing datasets in a decision system different representations of data can be taken into consideration. Typical approach is to use data as it is returned by discretizer, namely as nominal values. But in specific ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DSIT 2021: 2021 4th International Conference on Data Science and Information Technology

July 2021

481 pages

ISBN:9781450390248

DOI:10.1145/3478905

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

DSIT 2021

DSIT 2021: 2021 4th International Conference on Data Science and Information Technology

July 23 - 25, 2021

Shanghai, China

Acceptance Rates

Overall Acceptance Rate 114 of 277 submissions, 41%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
184
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)19

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents