research-article

Open access

Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU

Authors:

Dong LiAuthors Info & Claims

EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems

Pages 426 - 440

https://doi.org/10.1145/3447786.3456251

Published: 21 April 2021 Publication History

Abstract

Decision trees are widely used and often assembled as a forest to boost prediction accuracy. However, using decision trees for inference on GPU is challenging, because of irregular memory access patterns and imbalance workloads across threads. This paper proposes Tahoe, a tree structure-aware high performance inference engine for decision tree ensemble. Tahoe rearranges tree nodes to enable efficient and coalesced memory accesses; Tahoe also rearranges trees, such that trees with similar structures are grouped together in memory and assigned to threads in a balanced way. Besides memory access efficiency, we introduce a set of inference strategies, each of which uses shared memory differently and has different implications on reduction overhead. We introduce performance models to guide the selection of the inference strategies for arbitrary forests and data set. Tahoe consistently outperforms the state-of-the-art industry-quality library FIL by 3.82x, 2.59x, and 2.75x on three generations of NVIDIA GPUs (Kepler, Pascal, and Volta), respectively.

References

[1]

Hervé Abdi. 2010. Coefficient of variation. Encyclopedia of research design 1 (2010), 169--171.

[2]

Yali Amit and Donald Geman. 1994. Randomized Inquiries About Shape: An Application to Handwritten Digit Recognition. Technical Report. CHICAGO UNIV IL DEPT OF STATISTICS.

[3]

Kellie J Archer and Ryan V Kimes. 2008. Empirical characterization of random forest variable importance measures. Computational Statistics & Data Analysis 52, 4 (2008), 2249--2260.

Digital Library

[4]

A. Asuncion and D.J. Newman. 2007. UCI Machine Learning Repository. http://www.ics.uci.edu/$\sim$mlearn/{MLR}epository.html

[5]

Ahmad Taher Azar and Shereen M El-Metwally. 2013. Decision tree classifiers for automated medical diagnosis. Neural Computing and Applications 23, 7-8 (2013), 2387--2403.

[6]

Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23--581 (2010), 81.

[7]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 27.

Digital Library

[8]

Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 380--388.

Digital Library

[9]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785--794.

Digital Library

[10]

NVIDIA Corporation. 2020. RAPIDS | NVIDIA Developer. https://developer.nvidia.com/rapids. [Online; accessed 7-Jar-2020].

[11]

Qun Dai. 2013. A competitive ensemble pruning approach based on cross-validation technique. Knowledge-Based Systems 37 (2013), 394--414.

Digital Library

[12]

Dursun Delen, Cemil Kuzey, and Ali Uyar. 2013. Measuring firm performance using financial ratios: A decision tree approach. Expert Systems with Applications 40, 10 (2013), 3970--3983.

Digital Library

[13]

Wenqian Dong, Jie Liu, Zhen Xie, and Dong Li. 2019. Adaptive neural network-based approximation to accelerate eulerian fluid simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.

Digital Library

[14]

Wenqian Dong, Zhen Xie, Gokcen Kestor, and Dong Li. 2020. Smart-PGSim: using neural network to accelerate AC-OPF power grid simulation. arXiv preprint arXiv:2008.11827 (2020).

[15]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[16]

Christian Henke, Carsten Schmoll, and Tanja Zseby. 2008. Empirical evaluation of hash functions for multipoint measurements. ACM SIGCOMM Computer Communication Review 38, 3 (2008), 39--50.

Digital Library

[17]

Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, Vol. 1. IEEE, 278--282.

Digital Library

[18]

Karl Jansson, Håkan Sundell, and Henrik Boström. 2014. gpuRF and gpuERT: efficient and scalable GPU algorithms for decision tree ensembles. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. IEEE, 1612--1621.

Digital Library

[19]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146--3154.

Digital Library

[20]

Sotiris B Kotsiantis. 2013. Decision trees: a recent overview. Artificial Intelligence Review 39, 4 (2013), 261--283.

Digital Library

[21]

Brian Kulis and Kristen Grauman. 2009. Kernelized locality-sensitive hashing for scalable image search. In ICCV, Vol. 9. 2130--2137.

[22]

Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18--22.

[23]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. 339--350.

Digital Library

[24]

Wei-Yin Loh. 2014. Classification and regression tree methods. Wiley StatsRef: Statistics Reference Online (2014).

[25]

Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 950--961.

Digital Library

[26]

Ke Meng, Jiajia Li, Guangming Tan, and Ninghui Sun. 2019. A pattern based algorithmic autotuner for graph processing on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 201--213.

Digital Library

[27]

Aziz Nasridinov, Yangsun Lee, and Young-Ho Park. 2014. Decision tree construction on GPU: ubiquitous parallel computing approach. Computing 96, 5 (2014), 403--413.

[28]

Alexey Natekin and Alois Knoll. 2013. Gradient boosting machines, a tutorial. Frontiers in neurorobotics 7 (2013), 21.

[29]

NVIDIA. [n.d.]. Profiler User's Guide, v10.2.89, 2020.

[30]

NVlabs. 2018. CUB: a flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https://github.com/NVlabs/cub.

[31]

Thais Mayumi Oshiro, Pedro Santoro Perez, and José Augusto Baranauskas. 2012. How many trees in a random forest?. In International workshop on machine learning and data mining in pattern recognition. Springer, 154--168.

Digital Library

[32]

Mahesh Pal. 2005. Random forest classifier for remote sensing classification. International Journal of Remote Sensing 26, 1 (2005), 217--222.

[33]

Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. 2018. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018).

[34]

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems. 6638--6648.

Digital Library

[35]

Bin Ren, Gagan Agrawal, James R Larus, Todd Mytkowicz, Tomi Poutanen, and Wolfram Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 1--10.

Digital Library

[36]

Bin Ren, Todd Mytkowicz, and Gagan Agrawal. 2014. A portable optimization engine for accelerating irregular data-traversal applications on SIMD architectures. ACM Transactions on Architecture and Code Optimization (TACO) 11, 2 (2014), 1--31.

Digital Library

[37]

Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. 2015. Global refinement of random forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 723--730.

[38]

Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. International Journal of Approximate Reasoning 50, 7 (2009), 969--978.

Digital Library

[39]

Robert E Schapire. 2003. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification. Springer, 149--171.

[40]

Toby Sharp. 2008. Implementing decision trees and forests on a GPU. In European conference on computer vision. Springer, 595--608.

[41]

Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. 2017. Gradient boosted decision trees for high dimensional sparse output. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3182--3190.

Digital Library

[42]

Yan-Yan Song and LU Ying. 2015. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry 27, 2 (2015), 130.

[43]

Jyoti Soni, Ujma Ansari, Dipesh Sharma, and Sunita Soni. 2011. Predictive data mining for medical diagnosis: An overview of heart disease prediction. International Journal of Computer Applications 17, 8 (2011), 43--48.

[44]

Souhaib Ben Taieb and Rob J Hyndman. 2014. A gradient boosting approach to the Kaggle load forecasting competition. International journal of forecasting 30, 2 (2014), 382--394.

[45]

Ferhan Ture, Tamer Elsayed, and Jimmy Lin. 2011. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 943--952.

Digital Library

[46]

Zhen Xie, Zheng Cao, Zhan Wang, Dawei Zang, En Shao, and Ninghui Sun. 2016. Modeling traffic of big data platform for large scale data-center networks. In 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 224--231.

[47]

Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IASpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication. In Proceedings of the ACM International Conference on Supercomputing. 94--105.

Digital Library

[48]

Xie Zhen, Tan Guangming, and Sun Ninghui. 2021. Research on Optimal Performance of Sparse Matrix-Vector Multiplication and Convoulution Using the Probability-Process-Ram Model. Journal of Computer Research and Development 58, 3 (2021), 445.

[49]

Xie Zhen, Tan Guangming, Liu Weifeng, and Sun Ninghui. 2020. PRF: a process-RAM-feedback performance model to reveal bottlenecks and propose optimizations. HIGH TECHNOLOGY LETTERS 3 (2020), 285--298.

[50]

Ke Zhou, Shuang-Hong Yang, and Hongyuan Zha. 2011. Functional matrix factorizations for cold-start recommendation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 315--324.

Digital Library

Cited By

Xie ZEmani MYu XTao DHe XSu PZhou KVishwanath VBagchi SZhang Y(2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692065
Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Pedretti GMoon JBruel PSerebryakov SRoth RBuonanno LGajjar AZhao LZiegler TXu CFoltin MFaraboschi PIgnowski JGraves C(2024)X-TIME: Accelerating Large Tree Ensembles Inference for Tabular Data With Analog CAMsIEEE Journal on Exploratory Solid-State Computational Devices and Circuits10.1109/JXCDC.2024.349563410(116-124)Online publication date: 2024
https://doi.org/10.1109/JXCDC.2024.3495634
Show More Cited By

Index Terms

Tahoe: tree structure-aware high performance inference engine for decision tree ensemble on GPU
1. Computing methodologies
  1. Machine learning
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Treebeard: An Optimizing Compiler for Decision Tree Based ML Inference
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

Decision tree ensembles are among the most commonly used machine learning models. These models are used in a wide range of applications and are deployed at scale. Decision tree ensemble inference is usually performed with libraries such as XGBoost, ...
SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree Inference
SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

The proliferation of machine learning together with the rapid evolution of the hardware ecosystem has led to a surge in the demand for model inference on a variety of hardware. Decision tree based models are the most popular models on tabular data. This ...
Model-Tree Ensembles for noise-tolerant system identification

We introduce a novel method for modeling dynamic systems.Comparable predictive performance to identification methods from control engineering.Output error evaluation on realistic process engineering case studies.Output error results show noise ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems

April 2021

631 pages

ISBN:9781450383349

DOI:10.1145/3447786

General Chairs:
Antonio Barbalace
The University of Edinburgh
,
Pramod Bhatotia
Technical University of Munich
,
Program Chairs:
Lorenzo Alvisi
Cornell University
,
Cristian Cadar
Imperial College London

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

CNS
CCF

Conference

EuroSys '21

Sponsor:

SIGOPS

EuroSys '21: Sixteenth European Conference on Computer Systems

April 26 - 28, 2021

Online Event, United Kingdom

Acceptance Rates

EuroSys '21 Paper Acceptance Rate 38 of 181 submissions, 21%;

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
1,059
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)39

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xie ZEmani MYu XTao DHe XSu PZhou KVishwanath VBagchi SZhang Y(2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692065
Prasad ARajendra SRajan KGovindarajan RBondhugula UWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)SilvanForge: A Schedule-Guided Retargetable Compiler for Decision Tree InferenceProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695958(488-504)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695958
Pedretti GMoon JBruel PSerebryakov SRoth RBuonanno LGajjar AZhao LZiegler TXu CFoltin MFaraboschi PIgnowski JGraves C(2024)X-TIME: Accelerating Large Tree Ensembles Inference for Tabular Data With Analog CAMsIEEE Journal on Exploratory Solid-State Computational Devices and Circuits10.1109/JXCDC.2024.349563410(116-124)Online publication date: 2024
https://doi.org/10.1109/JXCDC.2024.3495634
Srinivas MSheet D(2024)Constant Time Decision Trees and Random ForestPattern Recognition10.1007/978-3-031-78169-8_30(456-470)Online publication date: 30-Nov-2024
https://doi.org/10.1007/978-3-031-78169-8_30
Penha JSilva ABarros OMoreira INacif JFerreira R(2023)Avaliação de Estilos de Código para Árvores de Decisão em GPU com MicrobenchmarksAnais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD 2023)10.5753/wscad.2023.235903(277-288)Online publication date: 17-Oct-2023
https://doi.org/10.5753/wscad.2023.235903
Dong WRen JCostan ANicolae BSato K(2023)AutoConstruct: Automated Neural Surrogate Model Building and Deployment for HPC ApplicationsProceedings of the 13th Workshop on AI and Scientific Computing at Scale using Flexible Computing10.1145/3589013.3596677(33-40)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1145/3589013.3596677
Dong WKestor GLi DButt AMi NChard K(2023)Auto-HPCnet: An Automatic Framework to Build Neural Network-based Surrogate for High-Performance Computing ApplicationsProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3592985(31-44)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3592985
Xie ZLiu JLi JLi DDehnavi MKulkarni MKrishnamoorthy S(2023)MerchandiserProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577497(204-217)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577497
van Lunteren J(2023)Accelerating Decision-Tree-based Inference through Adaptive ParallelizationProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00023(176-186)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1109/PACT58117.2023.00023
Papandreou Nvan Lunteren JAnghel AParnell TPetermann MStanisavljevic MLichtenau CSica ARöhm DTzortzatos EPozidis H(2023)Acceleration of Decision-Tree Ensemble Models on the IBM Telum Processor2023 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS46773.2023.10181908(1-5)Online publication date: 21-May-2023
https://doi.org/10.1109/ISCAS46773.2023.10181908
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten