research-article

RaFIO: a random forest I/O-aware algorithm

Authors:

Camélia Slimani,

Yuan-Hao Chang,

Stéphane Rubini,

Jalil BoukhobzaAuthors Info & Claims

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

Pages 521 - 528

https://doi.org/10.1145/3412841.3441932

Published: 22 April 2021 Publication History

Abstract

Random Forest based classification is a widely used Machine Learning algorithm. Training a random forest consists of building several decision trees that classify elements of the input dataset according to their features. This process is memory intensive. When datasets are larger than the available memory, the number of I/O operations grows significantly, causing a dramatic performance drop. Our experiments showed that, for a dataset that is 8 times larger than the available memory workspace, training a random forest is 25 times slower than the case when the dataset can fit in memory. In this paper, we revisit the tree building algorithm to optimize the performance for the datasets larger than the memory workspace. The proposed strategy aims at reducing the number of I/O operations by smartly taking benefit from the temporal locality exhibited by the random forest building algorithm. Experiments showed that our method reduced the execution time of the tree building by up to 90% and by 60% on average as compared to a state-of-the-art method, when the datasets are larger than the main memory workspace.

References

[1]

A. Anghel, N. Ioannou, T. P. Parnell, N. Papandreou, C. Mendler-Dünner, and H. Pozidis. 2019. Breadth-first, Depth-next Training of Random Forests. ArXiv abs/1910.06853 (2019).

[2]

P.J. Bickel, F. Götze, and W.R. Van Zwet. 1997. Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica 7, 1 (1997), 1--31.

[3]

J. Boukhobza, S. Rubini, R. Chen, and Z. Shao. 2018. Emerging NVM: A Survey on Architectural Integration and Research Challenges. ACM Trans. Design Autom. Electr. Syst. 23, 2 (2018), 14:1--14:32.

[4]

L. Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001).

Digital Library

[5]

X. Chen and X. Lin. 2014. Big Data Deep Learning: Challenges and Perspectives. IEEE Access 2 (2014), 514--525.

[6]

J. Gantz D. Reinsel and J. Rydning. 2018. The Digitization of the World from Edge to Core. (2018), 28.

[7]

D. Dua and C. Graff. 2017. UCI Machine Learning Repository.

[8]

Y. T. Ho, C. Wu, M. Yang, T. Chen, and Y. Chang. 2019. Replanting Your Forest: NVM-friendly Bagging Strategy for Random Forest. In 2019 IEEE Non-Volatile Memory Systems and Applications Symposium.

[9]

A. Kleiner, A. Talwalkar, P. Sarkar, and M.I. Jordan. 2014. A scalable bootstrap for massive data. Journal of the Royal Statistical Society. Series B: Statistical Methodology 76, 4 (2014).

[10]

N. Kukreja, A. Shilova, O. Beaumont, J. Huckelheim, N. Ferrier, P. Hovland, and G. Gorman. 2019. Training on the Edge: The why and the how. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops.

[11]

M. Kumari and S. Godara. 2011. Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction 1.

[12]

P. Menage. 2004. CGROUPS. https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt, last accessed on 28/07/20.

[13]

O. Mutlu, S. Ghose, J. Gömez-Luna, and R. Ausavarungnirun. 2019. Enabling Practical Processing in and near Memory for Data-Intensive Computing. In Proceedings of the 56th Annual Design Automation Conference (DAC '19). ACM.

[14]

O. Mutlu, S. Ghose, J. Gömez-Luna, and R. Ausavarungnirun. 2019. Processing Data Where It Makes Sense: Enabling In-Memory Computation. Microprocessors and Microsystems 67 (2019).

[15]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[16]

J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng. 2016. A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing (2016).

[17]

D. Rosenberg. 2015. Bagging and Random Forests.

[18]

C. Slimani, S. Rubini, and J. Boukhobza. 2019. K -MLIO: Enabling K -Means for Large Data-Sets and Memory Constrained Embedded Systems. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. 262--268.

[19]

A. Verikas, A. Gelzinis, and M. Bacauskiene. 2011. Mining data with random forests: A survey and results of new tests. Pattern Recognition 44 (2011).

[20]

M. Wright and A. Ziegler. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, Articles 77 (2017).

[21]

C. Zhang and Y. Ma. 2012. Ensemble machine learning: methods and applications. Springer.

Cited By

Putri AWiratama JSanjaya SWijaya SJohan MFaza A(2024)Web URLs Phishing Detection Model with Random Forest Algorithm2024 5th International Conference on Big Data Analytics and Practices (IBDAP)10.1109/IBDAP62940.2024.10689685(1-5)Online publication date: 23-Aug-2024
https://doi.org/10.1109/IBDAP62940.2024.10689685
Slimani CWu CRubini SChang YBoukhobza J(2023)Accelerating Random Forest on Memory-Constrained Devices Through Data Storage OptimizationIEEE Transactions on Computers10.1109/TC.2022.321589872:6(1595-1609)Online publication date: 1-Jun-2023
https://doi.org/10.1109/TC.2022.3215898

Index Terms

RaFIO: a random forest I/O-aware algorithm

Recommendations

A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems

Phase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA'16

Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

March 2021

2075 pages

ISBN:9781450381048

DOI:10.1145/3412841

Conference Chairs:
Chih-Cheng Hung
Kennesaw State University
,
Jiman Hong
Soongsil University, South Korea
,
Program Chairs:
Alessio Bechini
University of Pisa, Italy
,
Eunjee Song
Baylor University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SAC '21

Sponsor:

SIGAPP

SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing

March 22 - 26, 2021

Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
67
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Putri AWiratama JSanjaya SWijaya SJohan MFaza A(2024)Web URLs Phishing Detection Model with Random Forest Algorithm2024 5th International Conference on Big Data Analytics and Practices (IBDAP)10.1109/IBDAP62940.2024.10689685(1-5)Online publication date: 23-Aug-2024
https://doi.org/10.1109/IBDAP62940.2024.10689685
Slimani CWu CRubini SChang YBoukhobza J(2023)Accelerating Random Forest on Memory-Constrained Devices Through Data Storage OptimizationIEEE Transactions on Computers10.1109/TC.2022.321589872:6(1595-1609)Online publication date: 1-Jun-2023
https://doi.org/10.1109/TC.2022.3215898

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten