skip to main content
10.1145/3412841.3441932acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

RaFIO: a random forest I/O-aware algorithm

Published: 22 April 2021 Publication History

Abstract

Random Forest based classification is a widely used Machine Learning algorithm. Training a random forest consists of building several decision trees that classify elements of the input dataset according to their features. This process is memory intensive. When datasets are larger than the available memory, the number of I/O operations grows significantly, causing a dramatic performance drop. Our experiments showed that, for a dataset that is 8 times larger than the available memory workspace, training a random forest is 25 times slower than the case when the dataset can fit in memory. In this paper, we revisit the tree building algorithm to optimize the performance for the datasets larger than the memory workspace. The proposed strategy aims at reducing the number of I/O operations by smartly taking benefit from the temporal locality exhibited by the random forest building algorithm. Experiments showed that our method reduced the execution time of the tree building by up to 90% and by 60% on average as compared to a state-of-the-art method, when the datasets are larger than the main memory workspace.

References

[1]
A. Anghel, N. Ioannou, T. P. Parnell, N. Papandreou, C. Mendler-Dünner, and H. Pozidis. 2019. Breadth-first, Depth-next Training of Random Forests. ArXiv abs/1910.06853 (2019).
[2]
P.J. Bickel, F. Götze, and W.R. Van Zwet. 1997. Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica 7, 1 (1997), 1--31.
[3]
J. Boukhobza, S. Rubini, R. Chen, and Z. Shao. 2018. Emerging NVM: A Survey on Architectural Integration and Research Challenges. ACM Trans. Design Autom. Electr. Syst. 23, 2 (2018), 14:1--14:32.
[4]
L. Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001).
[5]
X. Chen and X. Lin. 2014. Big Data Deep Learning: Challenges and Perspectives. IEEE Access 2 (2014), 514--525.
[6]
J. Gantz D. Reinsel and J. Rydning. 2018. The Digitization of the World from Edge to Core. (2018), 28.
[7]
D. Dua and C. Graff. 2017. UCI Machine Learning Repository.
[8]
Y. T. Ho, C. Wu, M. Yang, T. Chen, and Y. Chang. 2019. Replanting Your Forest: NVM-friendly Bagging Strategy for Random Forest. In 2019 IEEE Non-Volatile Memory Systems and Applications Symposium.
[9]
A. Kleiner, A. Talwalkar, P. Sarkar, and M.I. Jordan. 2014. A scalable bootstrap for massive data. Journal of the Royal Statistical Society. Series B: Statistical Methodology 76, 4 (2014).
[10]
N. Kukreja, A. Shilova, O. Beaumont, J. Huckelheim, N. Ferrier, P. Hovland, and G. Gorman. 2019. Training on the Edge: The why and the how. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops.
[11]
M. Kumari and S. Godara. 2011. Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction 1.
[12]
P. Menage. 2004. CGROUPS. https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt, last accessed on 28/07/20.
[13]
O. Mutlu, S. Ghose, J. Gömez-Luna, and R. Ausavarungnirun. 2019. Enabling Practical Processing in and near Memory for Data-Intensive Computing. In Proceedings of the 56th Annual Design Automation Conference (DAC '19). ACM.
[14]
O. Mutlu, S. Ghose, J. Gömez-Luna, and R. Ausavarungnirun. 2019. Processing Data Where It Makes Sense: Enabling In-Memory Computation. Microprocessors and Microsystems 67 (2019).
[15]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[16]
J. Qiu, Q. Wu, G. Ding, Y. Xu, and S. Feng. 2016. A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing (2016).
[17]
D. Rosenberg. 2015. Bagging and Random Forests.
[18]
C. Slimani, S. Rubini, and J. Boukhobza. 2019. K -MLIO: Enabling K -Means for Large Data-Sets and Memory Constrained Embedded Systems. In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. 262--268.
[19]
A. Verikas, A. Gelzinis, and M. Bacauskiene. 2011. Mining data with random forests: A survey and results of new tests. Pattern Recognition 44 (2011).
[20]
M. Wright and A. Ziegler. 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, Articles 77 (2017).
[21]
C. Zhang and Y. Ma. 2012. Ensemble machine learning: methods and applications. Springer.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
March 2021
2075 pages
ISBN:9781450381048
DOI:10.1145/3412841
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SAC '21
Sponsor:
SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing
March 22 - 26, 2021
Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media