skip to main content
10.1145/3478905.3479245acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdsitConference Proceedingsconference-collections
research-article
Open access

A Modified Decision Tree and its Application to Assess Variable Importance

Published: 28 September 2021 Publication History

Abstract

This paper presents an approach to further improve the data reduction abilities of the traditional C4.5 algorithm by integrating the information gain ratio and forward stepwise regression algorithms. Motivated by the fact that the traditional C4.5 algorithm utilizes a full set of antecedent attributes without taking into consideration irrelevant attributes which is a precursor to spurious predictive model estimates. This study aims to overcome this drawback by developing and evaluating the performance of an importance-based attribute selection algorithm called the C4.5-Forward Stepwise (C4.5-FS) for improving the data reduction abilities of the traditional C4.5 classifiers. Five datasets with dimensionality ranging from 6 to 10,000 attributes were employed to evaluate the model performance. The goodness of fit for the modified and traditional C4.5 classifier was done using k-fold cross-validation based on a confusion matrix. Experimental results revealed that the C4.5-FS algorithm trained on fewer antecedent attributes improved the data reduction capabilities of the traditional C4.5 algorithm trained on a full set of antecedent attributes by achieving higher accuracy.

References

[1]
The authors extend their appreciation to Mr. Douglas Candia and Mr. Frank Namugera who contributed to improving this research. This research was partly funded by Makerere University through the Staff Development, Welfare and Retirement Benefits Committee (SDWRBC).
[2]
REFERENCES
[3]
Araujo, M., & Guisan, A. (2006). Five (or so) challenges for species distribution modelling. Journal of Biogeography, 33, 1677–1688.
[4]
Bedia, J., Busqué, J., & Gutiérrez, J. (2011). Predicting plant species distribution across an alpine rangeland in northern Spain:A comparison of probabilistic methods. Applied Vegetation Science, 14(3), 415–432.
[5]
Khiabani, F. B., Ramezankhani, A., Azizi, F., Hadaegh, F., Steyerberg, E. W., & Khalili, D. (2015). A tutorial on variable selection for clinical prediction models: Feature selection methods in data-mining could improve the results. In Journal of Clinical Epidemiology. Elsevier Ltd. https://doi.org/10.1016/j.jclinepi.2015.10.002
[6]
Greenwell, B. M., Boehmke, B. C., & Mccarthy, A. J. (2018). A Simple and Effective Model-Based Variable Importance Measure (pp. 1–27).
[7]
Basco, A., & Senthilkumar, N. C. (2017). Real-time analysis of healthcare using big data analytics. IOP Conference Series: Materials Science and Engineering, 263(4). https://doi.org/10.1088/1757-899X/263/4/042056
[8]
Hu, C. H., Lee, H. S., Lara, E., & Gan, S. (2018). The Ensemble and Model Comparison Approaches for Big Data Analytics in Social Sciences. Practical Assessment, Research & Evaluation, 23(17).
[9]
Garg, A., & Tai, K. (2013). Comparison of statistical and machine learning methods in modelling of data with multicollinearity. Int. J. Modelling, Identification and Control, 18(4).
[10]
Loucoubar, C., Paul, R., Bar-hen, A., Huret, A., Tall, A., Sokhna, C., Trape, J.-F., Ly Badara, A., Faye, J., Diop, A., & Sakuntabhai, A. (2011). An Exhaustive, Non-Euclidean, Non-Parametric Data Mining Tool for Unraveling the Complexity of Biological Systems – Novel Insights into Malaria. PLoS ONE, 6(9), 1–16. https://doi.org/10.1371/journal.pone.0024085
[11]
Chang, L., & Chien, J. (2013). Analysis of driver injury severity in truck-involved accidents using a non-parametric classification tree model. Safety Science, 51(1), 17–22. https://doi.org/10.1016/j.ssci.2012.06.017
[12]
Ramezankhani, A., Kabir, A., Pournik, O., Azizi, F., & Hadaegh, F. (2016). Classification-based data mining for identification of risk patterns associated with hypertension in Middle Eastern population. Medicine, 95(35).
[13]
Krakovska, O., Christie, G., Sixsmith, A., Ester, M., & Moreno, S. (2019). Performance comparison of linear and non- linear feature selection methods for the analysis of large survey datasets. PLoS ONE, 14(3), 1–17.
[14]
Heide, E. M. M. Van Der, Veerkamp, R. F., Pelt, M. L. Van, Kamphuis, C., Athanasiadis, I., & Ducro, B. J. (2019). Comparing regression, naive Bayes, and random forest methods in the prediction of individual survival to second lactation in Holstein cattle. Journal of Dairy Science, 102(10). https://doi.org/10.3168/jds.2019-16295
[15]
Groenhof, T. K. J., Koers, L. R., Blasse, E., Groot, M. De, Grobbee, D. E., Bots, M. L., Asselbergs, F. W., & Lely, A. T. (2019). Data mining information from electronic health records produced high yield and accuracy for current smoking status. Journal of Clinical Epidemiology, 118(2020), 100–106.
[16]
McArdle, J. ., & Ritschard, G. (2014). Contemporary issues in exploratory data mining in the behavioral sciences. In Contemp. issues Explor. data Min. Behav. Sci.
[17]
Sow, B., Mukhtar, H., Ahmad, H. F., & Suguri, H. (2019). Assessing the relative importance of social determinants of health in malaria and anemia classification based on machine learning techniques. In Informatics for Health and Social Care (pp. 1–13). Taylor & Francis. https://doi.org/10.1080/17538157.2019.1582056
[18]
Wang, Y., Jia, Z., & Yang, J. (2019). An Variable Selection Method of the Significance Multivariate Correlation Competitive Population Analysis for Near-Infrared Spectroscopy in Chemical Modeling. In IEEE Access (Vol. 7). IEEE. https://doi.org/10.1109/ACCESS.2019.2954115
[19]
Fisher, A., Rudin, C., & Dominici, F. (2019). All Models are Wrong, but Many are Useful. Journal of Machine Learning Research, 20(1), 1–81.
[20]
Bolat, E., Yildirim, H., Altin, S., & Yurtseven, E. (2020). A COMPREHENSIVE COMPARISON OF MACHINE LEARNING ALGORITHMS ON DIAGNOSING ASTHMA DISEASE AND COPD. International Journal of Sciences and Research, 76(3). https://doi.org/10.21506/j.ponte.2020.3.17
[21]
Truong, V. N., Li, Z., Alain, C., Loong, Y., Boying, L., & Xiaodie, P. (2019). Predicting customer demand for remanufactured products: A Data Mining Approach. In European Journal of Operational Research. Elsevier B.V. https://doi.org/10.1016/j.ejor.2019.08.015
[22]
Gottard, A., Vannucci, G., & Marchetti, G. M. (2020). A note on the interpretation of tree-based regression models. Biometrical Journal, 1–10. https://doi.org/10.1002/bimj.201900195
[23]
Masci, C., Johnes, G., & Agasisti, T. (2018). Student and school performance across countries: A Machine learning approach. European Journal of Operational Research, 269(3), 1072–1085.
[24]
Aldhyani, T. H., & Joshi, M. R. (2014). Analysis of Dimensionality Reduction in Intrusion Detection. International Journal of Computational Intelligence and Informatics, 4(3), 199–206.
[25]
Al Janabi, K. B., & Kadhim, R. (2018). Data Reduction Techniques: A Comparative Study for Attribute Selection Methods. International Journal of Advanced Computer Science and Technology, 8(1), 1–13. http://www.ripublication.com
[26]
Es-sabery, F., & Hair, A. (2020). A MapReduce C4.5 Decision Tree Algorithm Based on Fuzzy Rule-Based System. Fuzzy Information and Engineering, 1–28. https://doi.org/10.1080/16168658.2020.1756099
[27]
Gowda, K. A., Manjunath, A. S., & Jayaram, M. A. (2010). Comparative study of Attribute Selection Using Gain Ratio. International Journal of Information Technology and Knowledge and Knowledge Management, 2(2), 271–277.
[28]
Hoque, N., Singh, M., & Bhattacharyya, D. K. (2018). EFS-MI: an ensemble feature selection method for classification. Complex & Intelligent Systems, 4(2), 105–118.
[29]
Gao, X., Wen, J., & Zhang, C. (2019). An Improved Random Forest Algorithm for Predicting Employee Turnover. In Mathematical Problems in Engineering (pp. 1–13). https://doi.org/10.1155/2019/4140707
[30]
Thakur, N., & Han, C. (2021). A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method. Journal of Sensor and Actuator Networks, 10(3), 39.
[31]
Dua, D., & Graff, C. (2017). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
[32]
R Core Team. (2020). R: A language and environment for statistical computing. https://www.r-project.org/
[33]
Peng, R. D. (2015). R Programming for Data Science. https://www.cs.upc.edu/∼robert/teaching/estadistica/rprogramming.pdf
[34]
Casas, P. (2019). funModeling: Exploratory Data Analysis and Data Preparation Tool-Box (1.9.3). https://cran.r-project.org/package=funModeling
[35]
Romanski, P., & Kotthoff, L. (2018). FSelector: Selecting Attributes (0.31). https://cran.r-project.org/package=FSelector
[36]
Therneau, T., & Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees (4.1-15). https://cran.r-project.org/package=rpart
[37]
Aggarwal, C. (2015). Data mining:The Text book. Springer. https://doi.org/10.1007/978-3-319-14142-8 14
[38]
Maslove, D. M., Podchiyska, T., & Lowe, H. J. (2013). Discretization of continuous features in clinical datasets. 544–553. https://doi.org/10.1136/amiajnl-2012-000929
[39]
Li, R., & Wang, Z. (2002). An entropy-based discretization method for classification rules with inconsistency checking. First International Conference on Machine Learning and Cybernetics, November, 4–5.
[40]
Li, G., Zhou, X., Liu, J., Chen, Y., Zhang, H., Chen, Y., Liu, J., Jiang, H., Yang, J., & Nie, S. (2018). Comparison of three data mining models for prediction of advanced schistosomiasis prognosis in the Hubei province. PLoS Neglected Tropical Diseases, 12(2), 1–19. https://doi.org/10.1371/journal.pntd.0006262
[41]
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. Chapman & Hall.
[42]
Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.
[43]
Quinlan, J. R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann.
[44]
Asha, G. K., Manjunath, A. S., & Jayaram, M. . (2012). A comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inf. Technol. Knowl. Manag, 2, 271–277.
[45]
Mienye, I. D., Sun, Y., & Wang, Z. (2019). Prediction performance of improved decision tree-based algorithms: a review. Procedia Manufacturing, 35, 698–703.
[46]
Nithya, N. S., & Duraiswamy, K. (2014). Gain ratio based fuzzy weighted association rule mining classifier for medical diagnostic interface. Indian Academy of Sciences, 39, 39–52.
[47]
Prasad, N., & Naidu, M. M. (2013). Gain Ratio as Attribute Selection Measure in Elegant Decision Tree to Predict Precipitation. EUROSIM Congress on Modelling and Simulation, 141–150. https://doi.org/10.1109/EUROSIM.2013.35
[48]
Bhatt, H., Mehta, S., & D'mello, L. (2015). Use of ID3 Decision Tree Algorithm for Placement Prediction. International Journal of Computer Science and Information Technologies, 6(5), 4785–4789.
[49]
Smith, G. (2018). Step away from stepwise. Journal of Big Data, 5(32). https://doi.org/10.1186/s40537-018-0143-6
[50]
Hwang, J., & Hu, T. (2014). A stepwise regression algorithm for high-dimensional variable selection. Journal of Statistical Computation and Simulation, 85(9), 1793–1806. https://doi.org/10.1080/00949655.2014.902460
[51]
Singh, S., & Gupta, P. (2014). Comparative study ID3, CART and C4 . 5 Decision tree algorithm: A Survey. International Journal of Advanced Information Science and Technology, 27(27), 97–103.
[52]
Ali, M. F. M., Asklany, S. A., El-wahab, M. A., & Hassan, M. A. (2019). Data Mining Algorithms for Weather Forecast Phenomena: Comparative Study. International Journal of Computer Science and Network Security, 19(9), 76–81.
[53]
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
[54]
Wiharto, W., Kusnanto, H., & Herianto, H. (2016). Interpretation of Clinical Data Based on C4.5 Algorithm for the Diagnosis of Coronary Heart Disease. Healthcare Informatics Research, 22(3), 186–195.
[55]
Freitas, A. A. (2001). Understanding the Crucial Role of Attribute Interaction in Data Mining. Artificial Intelligence Review, 16, 177–199.
[56]
Branco, P., Torgo, L., & Ribeiro, R. P. (2015). A Survey of Predictive Modelling under Imbalanced Distributions.
[57]
Jain, S., Kotsampasakou, E., & Ecker, G. F. (2018). Comparing the performance of meta-classifiers — a case study on selected imbalanced data sets relevant for prediction of liver toxicity. Journal of Computer-Aided Molecular Design, 32, 583–590. https://doi.org/10.1007/s10822-018-0116-z

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DSIT 2021: 2021 4th International Conference on Data Science and Information Technology
July 2021
481 pages
ISBN:9781450390248
DOI:10.1145/3478905
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. C4.5
  2. Machine learning
  3. big data
  4. data reduction
  5. modified
  6. significant

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DSIT 2021

Acceptance Rates

Overall Acceptance Rate 114 of 277 submissions, 41%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 184
    Total Downloads
  • Downloads (Last 12 months)115
  • Downloads (Last 6 weeks)19
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media