research-article

MatRox: modular approach for improving data locality in hierarchical (Mat)rix App(Rox)imation

Authors:

Michelle Mills Strout,

Maryam Mehri DehnaviAuthors Info & Claims

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 389 - 402

https://doi.org/10.1145/3332466.3374548

Published: 19 February 2020 Publication History

Abstract

Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. MatRox is a framework that uses novel structure analysis strategies with code specialization and a storage format to improve locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98X, 1.60X, and 5.98X faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64X improvement with MatRox over five changes to accuracy using GOFMM.

References

[1]

Amirhossein Aminfar, Sivaram Ambikasaran, and Eric Darve. 2016. A fast block low-rank dense solver with applications to finite-element matrices. J. Comput. Phys. 304 (2016), 170--188.

Digital Library

[2]

Kevin Bache and Moshe Lichman. 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California. School of information and computer science 28 (2013).

[3]

Mario Bebendorf and Sergej Rjasanow. 2003. Adaptive low-rank approximation of collocation matrices. Computing 70, 1 (2003), 1--24.

Digital Library

[4]

Jeroen Bédorf, Evghenii Gaburov, and Simon Portegies Zwart. 2012. A sparse octree gravitational N-body code that runs entirely on the GPU processor. J. Comput. Phys. 231, 7 (2012), 2825--2839.

Digital Library

[5]

Steffen Börm and Jochen Garcke. 2007. Approximating Gaussian Processes with H²-Matrices. In European Conference on Machine Learning. Springer, 42--53.

[6]

Steffen Börm, Lars Grasedyck, and Wolfgang Hackbusch. 2003. Introduction to hierarchical matrices with applications. Engineering analysis with boundary elements 27, 5 (2003), 405--422.

[7]

William L Briggs, Steve F McCormick, et al. 2000. A multigrid tutorial. Vol. 72. Siam.

[8]

Difeng Cai, Edmond Chow, Lucas Erlandson, Yousef Saad, and Yuanzhe Xi. 2018. SMASH: Structured matrix approximation by separation and hierarchy. Numerical Linear Algebra with Applications 25, 6 (2018), e2204.

[9]

Nicola Cancedda, Eric Gaussier, Cyril Goutte, and Jean-Michel Renders. 2003. Word-sequence kernels. Journal of machine learning research 3, Feb (2003), 1059--1082.

[10]

Tony F Chan. 1987. Rank revealing QR factorizations. Linear algebra and its applications 88 (1987), 67--82.

[11]

Shiv Chandrasekaran, Ming Gu, and Timothy Pals. 2006. A fast ULV decomposition solver for hierarchically semiseparable representations. SIAM J. Matrix Anal. Appl. 28, 3 (2006), 603--622.

Digital Library

[12]

Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2017. Sympiler: transforming sparse matrix codes by decoupling symbolic analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 13.

Digital Library

[13]

Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, and Maryam Mehri Dehnavi. 2018. ParSy: inspection and transformation of sparse matrix computations for parallelism. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 62.

Digital Library

[14]

Edward G Coffman, Jr, Michael R Garey, and David S Johnson. 1978. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput. 7, 1 (1978), 1--17.

Digital Library

[15]

Sanjoy Dasgupta and Yoav Freund. 2008. Random projection trees and low dimensional manifolds. In STOC, Vol. 8. Citeseer, 537--546.

Digital Library

[16]

Yi Ding, Risi Kondor, and Jonathan Eskreis-Winkler. 2017. Multiresolution kernel approximation for Gaussian process regression. In Advances in Neural Information Processing Systems. 3740--3748.

[17]

Tingxing Dong, Veselin Dobrev, Tzanio Kolev, Robert Rieben, Stanimire Tomov, and Jack Dongarra. 2014. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 972--981.

Digital Library

[18]

Shai Fine and Katya Scheinberg. 2001. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research 2, Dec (2001), 243--264.

[19]

Pieter Ghysels, Xiaoye Sherry Li, Christopher Gorman, and François-Henry Rouet. 2017. A robust parallel preconditioner for indefinite systems using hierarchical matrices and randomized sampling. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 897--906.

[20]

Pieter Ghysels, Xiaoye S Li, François-Henry Rouet, Samuel Williams, and Artem Napov. 2016. An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM Journal on Scientific Computing 38, 5 (2016), S358--S384.

Digital Library

[21]

R Govindarajan and Jayvant Anantpur. 2013. Runtime dependence computation and execution of loops on heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, 1--10.

Digital Library

[22]

Lars Grasedyck, Ronald Kriemann, and Sabine Le Borne. 2008. Parallel black box H-LU preconditioning for elliptic boundary value problems. Computing and visualization in science 11, 4-6 (2008), 273--291.

[23]

Leslie Greengard and Vladimir Rokhlin. 1987. A fast algorithm for particle simulations. Journal of computational physics 73, 2 (1987), 325--348.

Digital Library

[24]

Wolfgang Hackbusch. 1999. A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices. Computing 62, 2 (1999), 89--108.

Digital Library

[25]

Wolfgang Hackbusch. 2015. Hierarchical matrices: algorithms and analysis. Vol. 49. Springer.

[26]

Wolfgang Hackbusch and Steffen Börm. 2002. Data-sparse approximation by adaptive H2-matrices. Computing 69, 1 (2002), 1--35.

Digital Library

[27]

W Hackbusch, B Khoromskij, and SA Sauter. 2000. On H2-matrices: Lectures on applied mathematics.

[28]

Wolfgang Hackbusch, Boris N Khoromskij, and Ronald Kriemann. 2004. Hierarchical matrices based on a weak admissibility criterion. Computing 73, 3 (2004), 207--243.

[29]

John L Hennessy and David A Patterson. 2017. Computer architecture: a quantitative approach. Elsevier.

Digital Library

[30]

Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. 2008. Kernel methods in machine learning. The annals of statistics (2008), 1171--1220.

[31]

Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications 18, 1 (2004), 135--158.

Digital Library

[32]

Ronald Kriemann. 2005. Parallel-matrix arithmetics on shared memory systems. Computing 74, 3 (2005), 273--297.

Digital Library

[33]

Weifeng Liu, Ang Li, Jonathan Hogg, Iain S Duff, and Brian Vinter. 2016. A synchronization-free algorithm for parallel sparse triangular solves. In European Conference on Parallel Processing. Springer, 617--630.

Digital Library

[34]

William B March and George Biros. 2017. Far-field compression for fast kernel summation methods in high dimensions. Applied and Computational Harmonic Analysis 43, 1 (2017), 39--75.

[35]

William B March, Bo Xiao, and George Biros. 2015. ASKIT: Approximate skeletonization kernel-independent treecode in high dimensions. SIAM Journal on Scientific Computing 37, 2 (2015), A1089--A1110.

[36]

William B March, Bo Xiao, D Yu Chenhan, and George Biros. 2015. An algebraic parallel treecode in arbitrary dimensions. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 571--580.

Digital Library

[37]

William B March, Bo Xiao, Sameer Tharakan, D Yu Chenhan, and George Biros. 2015. A kernel-independent FMM in general dimensions. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for. IEEE, 1--12.

[38]

William B March, Bo Xiao, Sameer Tharakan, Chenhan D Yu, and George Biros. 2015. Robust treecode approximation for kernel machines. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 775--784.

Digital Library

[39]

William B March, Bo Xiao, Chenhan D Yu, and George Biros. 2016. ASKIT: an efficient, parallel library for high-dimensional kernel summations. SIAM Journal on Scientific Computing 38, 5 (2016), S720--S749.

Digital Library

[40]

Per-Gunnar Martinsson. 2011. A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Anal. Appl. 32, 4 (2011), 1251--1274.

Digital Library

[41]

Per-Gunnar Martinsson and Vladimir Rokhlin. 2005. A fast direct solver for boundary integral equations in two dimensions. J. Comput. Phys. 205, 1 (2005), 1--23.

Digital Library

[42]

Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. 2011. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis 30, 1 (2011), 47--68.

[43]

Yohei Miki and Masayuki Umemura. 2017. GOTHIC: Gravitational oct-tree code accelerated by hierarchical time step controlling. New Astronomy 52 (2017), 65--81.

[44]

Mahdi Soltan Mohammadi, Tomofumi Yuki, Kazem Cheshmi, Eddie C Davis, Mary Hall, Maryam Mehri Dehnavi, Payal Nandy, Catherine Olschanowsky, Anand Venkat, and Michelle Mills Strout. 2019. Sparse computation data dependence simplification for efficient compiler-generated inspectors. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 594--609.

Digital Library

[45]

Vlad I Morariu, Balaji V Srinivasan, Vikas C Raykar, Ramani Duraiswami, and Larry S Davis. 2009. Automatic online tuning for fast Gaussian summation. In Advances in neural information processing systems. 1113--1120.

[46]

Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech. Rep. NVR-2011 1 (2011).

[47]

Stephen M Omohundro. 1989. Five balltree construction algorithms. International Computer Science Institute Berkeley.

[48]

Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In International Supercomputing Conference. Springer, 124--140.

Digital Library

[49]

Lawrence Rauchwerger, Nancy M Amato, and David A Padua. 1995. Run-time methods for parallelizing partially parallel loops. In Proceedings of the 9th international conference on Supercomputing. ACM, 137--146.

Digital Library

[50]

Elizaveta Rebrova, Gustavo Chávez, Yang Liu, Pieter Ghysels, and Xiaoye Sherry Li. 2018. A study of clustering techniques and hierarchical matrix formats for kernel ridge regression. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 883--892.

[51]

François-Henry Rouet, Xiaoye S Li, Pieter Ghysels, and Artem Napov. 2016. A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Transactions on Mathematical Software (TOMS) 42, 4 (2016), 27.

Digital Library

[52]

Ana R Teixeira, Ana Maria Tomé, and Elmar Wolfgang Lang. 2008. Feature extraction using low-rank approximations of the kernel matrix. In International Conference Image Analysis and Recognition. Springer, 404--412.

Digital Library

[53]

Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, 157--173.

[54]

J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science and Engineering 16, 5 (Sept.-Oct. 2014), 62--74.

[55]

Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating wavefront parallelization for sparse matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 41.

[56]

Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167--188.

[57]

Christopher KI Williams and Carl Edward Rasmussen. 1996. Gaussian processes for regression. In Advances in neural information processing systems. 514--520.

[58]

Christopher KI Williams and Matthias Seeger. 2001. Using the Nyström method to speed up kernel machines. In Advances in neural information processing systems. 682--688.

[59]

Yuanzhe Xi and Jianlin Xia. 2016. On the stability of some hierarchical rank structured matrix algorithms. SIAM J. Matrix Anal. Appl. 37, 3 (2016), 1279--1303.

[60]

Jianlin Xia, Shivkumar Chandrasekaran, Ming Gu, and Xiaoye S Li. 2010. Fast algorithms for hierarchically semiseparable matrices. Numerical Linear Algebra with Applications 17, 6 (2010), 953--976.

[61]

Ichitaro Yamazaki and Xiaoye S Li. 2010. On techniques to improve robustness and scalability of a parallel hybrid linear solver. In International Conference on High Performance Computing for Computational Science. Springer, 421--434.

[62]

Chenhan D Yu, James Levitt, Severin Reiz, and George Biros. 2017. Geometry-oblivious FMM for compressing dense SPD matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 53.

Digital Library

[63]

Chenhan D Yu, Severin Reiz, and George Biros. 2018. Distributed-memory hierarchical compression of dense SPD matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 15.

Digital Library

Cited By

Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Chen YXiao GOzsu MTang ZZomaya ALi K(2022)Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00234(2522-2535)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00234
Xiao GYin CChen YDuan MLi K(2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080

Index Terms

MatRox: modular approach for improving data locality in hierarchical (Mat)rix App(Rox)imation
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

On the HSS iteration methods for positive definite Toeplitz linear systems

We study the HSS iteration method for large sparse non-Hermitian positive definite Toeplitz linear systems, which first appears in Bai, Golub and Ng's paper published in 2003 [Z.-Z. Bai, G.H. Golub, M.K. Ng, Hermitian and skew-Hermitian splitting ...
Accurate Eigenvalues and SVDs of Totally Nonnegative Matrices

We consider the class of totally nonnegative (TN) matrices---matrices all of whose minors are nonnegative. Any nonsingular TN matrix factors as a product of nonnegative bidiagonal matrices. The entries of the bidiagonal factors parameterize the set of ...
Shifted SSOR-like preconditioner for non-Hermitian positive definite matrices

Based on the SSOR-like iteration method of Bai (Numer. Linear Algebra Appl. 23, 37-60, 2016), we give a shifted SSOR-like preconditioner which is positive definite for solving the non-Hermitian positive definite linear system with a dominant Hermitian ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2020

454 pages

ISBN:9781450368186

DOI:10.1145/3332466

General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Qualifiers

Research-article

Funding Sources

Canada Research Chairs program
NSERC
U.S. National Science Foundation (NSF)

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22 - 26, 2020

California, San Diego

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
403
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)4

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Chen YXiao GOzsu MTang ZZomaya ALi K(2022)Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00234(2522-2535)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00234
Xiao GYin CChen YDuan MLi K(2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents