skip to main content
10.1145/2967938.2967943acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Sparso: Context-driven Optimizations of Sparse Linear Algebra

Published: 11 September 2016 Publication History

Abstract

The sparse matrix is a key data structure in various domains such as high-performance computing, machine learning, and graph analytics. To maximize performance of sparse matrix operations, it is especially important to optimize across the operations and not just within individual operations. While a straightforward per-operation mapping to library routines misses optimization opportunities, manually optimizing across the boundary of library routines is time-consuming and error-prone, sacrificing productivity.
This paper introduces Sparso, a framework that automates such optimizations, enabling both high performance and high productivity. In Sparso, a compiler and sparse linear algebra libraries collaboratively discover and exploit context, which we define as the invariant properties of matrices and relationships between them in a program. We present compiler analyses, namely collective reordering analysis and matrix property discovery, to discover the context. The context discovered from these analyses drives key optimizations across library routines and matrices.
We have implemented Sparso with the Julia language, Intel MKL and SpMP libraries. We evaluate our context-driven optimizations in 6 representative sparse linear algebra algorithms. Compared with a baseline that invokes high-performance libraries without context optimizations, Sparso results in 1.2~17x (average 5.7x) speedups. Our approach of compiler-library collaboration and context-driven optimizations should be also applicable to other productivity languages such as Matlab, Python, and R.

References

[1]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.
[2]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith. Efficient Management of Parallelism in Object Oriented Numerical Software Libraries. In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, pages 163--202. Birkhäuser Press, 1997.
[3]
S. Benkner. VFC: The Vienna Fortran Compiler. Scientific Programming, 7(1):67--81, Jan. 1999.
[4]
J. Bezanson, S. Karpinski, V. B. Shah, and A. Edelman. Julia: A Fast Dynamic Language for Technical Computing. Computing Research Repository, abs/1209.5145, 2012.
[5]
T. Brandes and F. Zimmermann. Adaptor -- A Transformation Tool for HPF Programs. In Programming Environments for Massively Parallel Distributed Systems: Working Conference of the IFIP WG 10.3, pages 91--96. Birkhäuser Basel, 1994.
[6]
S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings of the Seventh International Conference on World Wide Web, WWW7, pages 107--117. Elsevier Science Publishers B. V., 1998.
[7]
A. Buluç and J. R. Gilbert. The Combinatorial BLAS: Design, Implementation, and Applications. International Journal of High Performance Computing Applications, 25(4):496--509, Nov. 2011.
[8]
C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1--27:27, May 2011.
[9]
Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam. Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. ACM Transactions on Mathematical Software (TOMS), 35(3):22:1--22:14, Oct. 2008.
[10]
C. P. Colby. Semantics-based Program Analysis via Symbolic Composition of Transfer Relations. PhD thesis, Carnegie Mellon University, 1996. AAI9813822.
[11]
E. Cuthill and J. McKee. Reducing the Bandwidth of Sparse Symmetric Matrices. In Proceedings of the 1969 24th National Conference, pages 157--172. ACM, 1969.
[12]
S. Dalton, N. Bell, L. Olson, and M. Garland. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. http://cusplibrary.github.io, 2014. Version 0.5.0.
[13]
T. A. Davis. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). Society for Industrial and Applied Mathematics, 2006.
[14]
T. A. Davis and Y. Hu. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1:1--1:25, Dec. 2011.
[15]
L. De Rose and D. Padua. Techniques for the Translation of MATLAB Programs into Fortran 90. ACM Transactions on Programming Languages and Systems, 21(2):286--323, Mar. 1999.
[16]
L. A. De Rose. Compiler Techniques for Matlab Programs. PhD thesis, University of Illinois at Urbana-Champaign, 1996. AAI9712420.
[17]
C. Ding and K. Kennedy. Improving Cache Performance in Dynamic Applications Through Data and Computation Reorganization at Run Time. In Proceedings of the ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pages 229--241, 1999.
[18]
J. Dongarra and M. A. Heroux. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report 4744, Sandia National Laboratories, 2013.
[19]
J. Dongarra, A. Lumsdaine, X. Niu, R. Pozo, and K. Remington. LAPACK Working Note 74: A Sparse Matrix Library in C+ for High Performance Architectures. Technical Report CS-94--236, University of Tennessee, 1994.
[20]
I. S. Duff. On Algorithms for Obtaining a Maximum Transversal. ACM Transactions on Mathematical Software (TOMS), 7(3):315--330, 1981.
[21]
I. S. Duff, M. A. Heroux, and R. Pozo. An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum. ACM Transactions on Mathematical Software (TOMS), 28(2):239--267, June 2002.
[22]
R. D. Falgout, J. E. Jones, and U. M. Yang. Pursuing Scalability for Hypre's Conceptual Interfaces. ACM Transactions on Mathematical Software (TOMS), 31(3):326--350, Sept. 2005.
[23]
E. Farhi, J. Goldstone, S. Gutmann, and M. Sipser. Quantum Computation by Adiabatic Evolution. Technical Report MIT-CTP-2936, Massachusetts Institute of Technology, 2000.
[24]
S. Filippone and M. Colajanni. PSBLAS: A Library for Parallel Linear Algebra Computation on Sparse Matrices. ACM Transactions on Mathematical Software (TOMS), 26(4):527--550, 2000.
[25]
J. R. Gilbert, C. Moler, and R. Schreiber. Sparse Matrices in MATLAB: Design and Implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333--356, 1992.
[26]
W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Towards Realistic Performance Bounds for Implicit CFD Codes. In Parallel Computational Fluid Dynamics: towards Teraflops, Optimization, and Novel Formulations, pages 241--248. Elsevier, 1999.
[27]
V. E. Henson and U. M. Yang. BoomerAMG: a Parallel Algebraic Multigrid Solver and Preconditioner. Applied Numerical Mathematics, 41:155--177, 2000.
[28]
M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley. An Overview of the Trilinos Project. ACM Transactions on Mathematical Software (TOMS), 31(3):397--423, Sept. 2005.
[29]
Intel™. Math Kernel Library Inspector-executor Sparse BLAS Routines. https://software.intel.com/sites/default/files/managed/0a/81/Documentation_inspector_executor_sparse_blas_mkl113b.pdf, 2015.
[30]
Intelℜ. MKL PARDISO - Parallel Direct Sparse Solver Interface. https://software.intel.com/en-us/node/470282, 2015.
[31]
G. A. Kildall. A Unified Approach to Global Program Optimization. In Proceedings of ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL), pages 194--206, 1973.
[32]
A. R. Krommer. Parallel Sparse Matrix Computations in the Industrial Strength PINEAPL Library. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems, pages 281--285. Springer-Verlag, 1998.
[33]
X. S. Li. An Overview of SuperLU: Algorithms, Implementation, and User Interface. ACM Transactions on Mathematical Software (TOMS), 31(3):302--325, 2005.
[34]
M. Lichman. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml, 2013.
[35]
D. C. Liu and J. Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming, 45(1), 1989.
[36]
A. Lugowski, A. Buluc, J. Gilbert, and S. Reinhardt. Scalable complex graph analysis with the knowledge discovery toolbox. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5345--5348, March 2012.
[37]
I. J. Lustig and E. Rothberg. Gigaflops in Linear Programming. Operations Research Letters, 18(4):157--165, Feb. 1996.
[38]
J. D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, Dec. 1995.
[39]
R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nico, and K. Crowley. Principles of Runtime Support for Parallel Processors. In Proceedings of the International Conference on Supercomputing (ICS), pages 140--152. ACM, 1988.
[40]
M. Naumov. Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS. https://developer.nvidia.com/cusparse.
[41]
A. M. Niklasson. Expansion Algorithm for the Density Matrix. Physical Review B, 66(15):155115, Oct. 2002.
[42]
P. K. S. Norman E. Gibbs, William G. Poole. An Algorithm for Reducing the Bandwidth and Profile of a Sparse Matrix. SIAM Journal on Numerical Analysis, 13(2):236--250, 1976.
[43]
J. Park. Sparse Matrix Preprocessing Library. https://github.com/IntelLabs/SpMP.
[44]
J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey. Sparsifying Synchronization for High-Performance Shared-Memory Sparse Triangular Solver. In Proceedings of International Supercomputing Conference (ISC), pages 124--140. Springer International Publishing, 2014.
[45]
J. Park, M. Smelyanskiy, K. Vaidyanathan, A. Heinecke, D. D. Kalamkar, X. Liu, M. M. A. Patwary, Y. Lu, and P. Dubey. Efficient Shared-memory Implementation of High-performance Conjugate Gradient Benchmark and Its Application to Unstructured Matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 945--955, 2014.
[46]
M. M. A. Patwary, N. R. Satish, N. Sundaram, J. Park, M. J. Anderson, S. G. Vadlamudi, D. Das, S. G. Pudov, V. O. Pirogov, and P. Dubey. Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms. In Proceedings of International Supercomputing Conference (ISC), pages 48--57. Springer International Publishing, 2015.
[47]
R. Ponnusamy, J. Saltz, and A. Choudhary. Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 361--370. ACM, 1993.
[48]
W. Pugh and T. Shpeisman. SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations. In Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing, LCPC '98, pages 213--229. Springer-Verlag, 1999.
[49]
L. Rauchwerger. Run-time Parallelization: Its Time Has Come. Parallel Computing - Special issues on languages and compilers for parallel computers, 24(3--4):527--556, May 1998.
[50]
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 693--701. Curran Associates, Inc., 2011.
[51]
J. K. Reid and J. A. Scott. Reducing the Total Bandwidth of a Sparse Unsymmetric Matrix. SIAM Journal on Matrix Analysis and Applications, 28(3):805--821, Aug. 2006.
[52]
Y. Saad. Numerical Method for Large Eigenvalue Problems. Society for Industrial and Applied Mathematics, 2011.
[53]
Y. Saad and M. H. Schultz. GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems. SIAM Journal on Scientific and Statistical Computing, 7(3):856--869, July 1986.
[54]
N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the Maze of Graph Analytics Frameworks Using Massive Graph Datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 979--990. ACM, 2014.
[55]
M. Smelyanskiy, V. W. Lee, D. Kim, A. D. Nguyen, and P. Dubey. Scaling Performance of Interior-point Method on Large-scale Chip Multiprocessor System. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 22:1--22:11. ACM, 2007.
[56]
M. M. Strout, G. Georg, and C. Olschanowsky. Set and Relation Manipulation for the Sparse Polyhedral Framework. In Proceedings of Languages and Compilers for Parallel Computing: 25th International Workshop, LCPC, Tokyo, Japan, September 11--13, Revised Selected Papers, pages 61--75. Springer Berlin Heidelberg, 2013.
[57]
M. M. Strout, A. LaMielle, L. Carter, J. Ferrante, B. Kreaseck, and C. Olschanowsky. An Approach for Code Generation in the Sparse Polyhedral Framework. Technical Report CS-13-109, Colorado State University, December 2013.
[58]
A. Venkat, M. Hall, and M. Strout. Loop and Data Transformations for Sparse Matrix Code. In Proceedings of the ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pages 521--532, 2015.
[59]
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A Library of Automatically Tuned Sparse Matrix Kernels. Journal of Physics: Conference Series, 16(1):521, 2005.
[60]
M. N. Wegman and F. K. Zadeck. Constant Propagation with Conditional Branches. ACM Transactions on Programming Languages and Systems, 13(2):181--210, Apr. 1991.
[61]
S. Williams, A. Waterman, and D. Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4):65--76, Apr. 2009.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compiler
  2. data-flow analysis
  3. high-performance computing
  4. inspector-executor
  5. reordering
  6. sparse linear algebra

Qualifiers

  • Research-article

Conference

PACT '16
Sponsor:
  • IFIP WG 10.3
  • IEEE TCCA
  • SIGARCH
  • IEEE CS TCPP

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media