research-article

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures

Authors:

Li WuAuthors Info & Claims

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 338 - 353

https://doi.org/10.1145/3178487.3178513

Published: 10 February 2018 Publication History

Abstract

Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work.

In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.

Supplementary Material

Artifacts Available (swsptrsv-swsptrsv_1.0.zip)

Scripts, Source Files

Download
2.67 MB

References

[1]

2017. https://www.top500.org/. (2017).

[2]

Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 1, 01 (1989), 73--95.

Digital Library

[3]

Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, Haohuan Fu, Fang-fang Liu, Lin Gan, Ping Xu, and Wenjing Ma. 2017. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 535--544.

[4]

Martin Bättig and Thomas R. Gross. 2017. Synchronized-by-Default Concurrency for Shared-Memory Systems. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). 299--312.

Digital Library

[5]

Andrew M. Bradley. 2016. A Hybrid Multithreaded Direct Sparse Triangular Solver. In 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. 13--22.

[6]

Nathan G. Bronson, Jared Casper, Hassan Chafi, and Kunle Olukotun. 2010. A Practical Concurrent Binary Search Tree. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '10). 257--268.

Digital Library

[7]

R. Das, M. Uysal, J. Saltz, and Y.S. Hwang. 1994. Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures. J. Parallel and Distrib. Comput. 22, 3 (1994), 462 -- 478.

Digital Library

[8]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1--1:25.

Digital Library

[9]

Jack Dongarra. 2016. Report on the sunway taihulight system. www.netlib.org. Retrieved June 20 (2016).

[10]

Alexandre X. Duchateau, David Padua, and Denis Barthou. 2013. Hydra: Automatic Algorithm Exploration from Linear Algebra Equations. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). 1--10.

Digital Library

[11]

Iain S. Duff, Albert M. Erisman, and John K. Reid. 2017. Direct Methods for Sparse Matrices (2nd ed.). Oxford University Press, Inc.

[12]

Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, Weijie Zheng, and Guangwen Yang. 2017. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 615--624.

[13]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 072001.

[14]

Elad Gidron, Idit Keidar, Dmitri Perelman, and Yonathan Perez. 2012. SALSA: Scalable and Low Synchronization NUMA-aware Algorithm for Producer-consumer Pools. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). 151--160.

Digital Library

[15]

Hwansoo Han and Chau-Wen Tseng. 2006. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems 17, 7 (2006), 606--618.

Digital Library

[16]

Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. 2017. Fast Segmented Sort on GPUs. In Proceedings of the International Conference on Supercomputing (ICS '17). Article 12, 12:1--12:10 pages.

Digital Library

[17]

Eun-Jin Im and Katherine Yelick. 1998. Model-based memory hierarchy optimizations for sparse matrices. In Workshop on Profile and Feedback-Directed Compilation, Vol. 139.

[18]

Lijuang Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, and Peng Zhang. 2017. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In International Conference on Parallel Processing (ICPP), 2017 IEEE International. IEEE.

[19]

Humayun Kabir, Joshua Dennis Booth, Guillaume Aupy, Anne Benoit, Yves Robert, and Padma Raghavan. 2015. STS-k: a multilevel sparse triangular solution scheme for NUMA multicores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, 55.

Digital Library

[20]

R. Kaleem, A. Venkat, S. Pai, M. Hall, and K. Pingali. 2016. Synchronization Trade-Offs in GPU Implementations of Graph Algorithms. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 514--523.

[21]

Saurabh Kalikar and Rupesh Nasre. 2016. DomLock: A New Multi-granularity Locking Technique for Hierarchies. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). 23:1--23:12.

Digital Library

[22]

Alex Kogan and Maurice Herlihy. 2014. The Future(s) of Shared Data Structures. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing (PODC '14). 30--39.

Digital Library

[23]

Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). 26:1--26:14.

Digital Library

[24]

Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2013. SMAT: An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). 117--126.

Digital Library

[25]

Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466.

Digital Library

[26]

Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, and Weimin Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 635--645.

[27]

James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2017. Optimizations of Two Compute-bound Scientific Kernels on the SW26010 Many-core Processor. In International Conference on Parallel Processing (ICPP), 2017 IEEE International. IEEE.

[28]

Weifeng Liu. 2015. Parallel and Scalable Sparse Basic Linear Algebra Subprograms. Ph.D. Dissertation. University of Copenhagen.

[29]

Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In European Conference on Parallel Processing. 617--630.

Digital Library

[30]

Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides. Concurrency and Computation: Practice and Experience 29, 21 (2017), e4244-n/a.

[31]

Weifeng Liu and Brian Vinter. 2015. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15). 339--350.

Digital Library

[32]

Weifeng Liu and Brian Vinter. 2015. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors. J. Parallel and Distrib. Comput. 85, C (Nov. 2015), 47--61.

Digital Library

[33]

Weifeng Liu and Brian Vinter. 2015. Speculative Segmented Sum for Sparse Matrix-vector Multiplication on Heterogeneous Processors. Parallel Comput. 49, C (Nov. 2015), 179--193.

Digital Library

[34]

Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu. 2017. Concurrent Data Structures for Near-Memory Computing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '17). 235--245.

Digital Library

[35]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A New Parallel Framework for Machine Learning. In Conference on Uncertainty in Artificial Intelligence (UAI).

Digital Library

[36]

Zoltan Majo and Thomas R. Gross. 2017. A Library for Portable and Composable Data Locality Optimizations for NUMA Systems. ACM Trans. Parallel Comput. 3, 4 (March 2017), 20:1--20:32.

Digital Library

[37]

Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 4 (2009), 291--312.

Digital Library

[38]

Adam Morrison. 2016. Scaling Synchronization in Multicore Programs. Commun. ACM 59, 11 (2016), 44--51.

Digital Library

[39]

Maxim Naumov. 2011. Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Corp., Westford, MA, USA, Tech. Rep. NVR-2011 1 (2011).

[40]

Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In International Supercomputing Conference. 124--140.

Digital Library

[41]

A. Picciau, G. E. Inggs, J. Wickerson, E. C. Kerrigan, and G. A. Constantinides. 2016. Balancing Locality and Concurrency: Solving Sparse Triangular Systems on GPUs. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). 183--192.

[42]

I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kelly, and D. Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265--1278.

Digital Library

[43]

B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10.

Digital Library

[44]

Yousef Saad. 2003. Iterative methods for sparse linear systems. Siam.

Digital Library

[45]

Joel H Saltz. 1990. Aggregation methods for solving sparse triangular systems on multiprocessors. SIAM journal on scientific and statistical computing 11, 1 (1990), 123--144.

[46]

Robert Schreiber and Wei-Pei Tang. 1982. Vectorizing the conjugate gradient method. Unpublished manuscript, Department of Computer Science, Stanford University (1982).

[47]

Michelle Mills Strout, Larry Carter, and Jeanne Ferrante. 2003. Compile-time Composition of Run-time Data and Iteration Reorderings. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). 91--102.

Digital Library

[48]

Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck. 2002. Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling. In Languages and Compilers for Parallel Computing: 15th Workshop, LCPC 2002, College Park, MD, USA, July 25-27, 2002. Revised Papers, Bill Pugh and Chau-Wen Tseng (Eds.). 90--110.

Digital Library

[49]

Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse Tiling for Stationary Iterative Methods. The International Journal of High Performance Computing Applications 18, 1 (2004), 95--113.

Digital Library

[50]

Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine Olschanowsky. 2016. An Approach for Code Generation in the Sparse Polyhedral Framework. Parallel Comput. 53, C (April 2016), 32--57.

Digital Library

[51]

Brad Suchoski, Caleb Severn, Manu Shantharam, and Padma Raghavan. 2012. Adapting sparse triangular solution to GPUs. In 2012 41st International Conference on Parallel Processing Workshops. IEEE, 140--148.

Digital Library

[52]

Guangming Tan, Junhong Liu, and Jiajia Li. 2018. Design and Implementation of Adaptive SpMV Library for Multicore and Manycore Architecture. ACM Trans. Math. Softw. (2018).

[53]

D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericas. 2017. Trends in Data Locality Abstractions for HPC Systems. IEEE Transactions on Parallel and Distributed Systems PP, 99 (2017), 1--1.

[54]

Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15). 521--532.

Digital Library

[55]

Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating Wavefront Parallelization for Sparse Matrix Computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 41:1--41:12.

Digital Library

[56]

Richard Vuduc. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Ph.D. Dissertation. University of California, Berkeley.

Digital Library

[57]

Richard Vuduc, Aparna Chandramowlishwaran, Jee Choi, Murat Guney, and Aashay Shringarpure. 2010. On the Limits of GPU Acceleration. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Parallelism (HotPar'10). 13--13.

Digital Library

[58]

Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance Optimizations and Bounds for Sparse Matrix-vector Multiply. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC '02). 1--35.

Digital Library

[59]

Richard Vuduc, Shoaib Kamil, Jen Hsu, Rajesh Nishtala, James W Demmel, and Katherine A Yelick. 2002. Automatic Performance Tuning and Analysis of Sparse Triangular Solve. In ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries.

[60]

Hao Wang, Weifeng Liu, Kaixi Hou, and Wu-chun Feng. 2016. Parallel Transposition of Sparse Data Structures. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). 33:1--33:13.

Digital Library

[61]

Xin Wang, Weihua Zhang, Zhaoguo Wang, Ziyun Wei, Haibo Chen, and Wenyun Zhao. 2017. Eunomia: Scaling Concurrent Search Trees under Contention Using HTM. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 385--399.

Digital Library

[62]

Michael M Wolf, Michael A Heroux, and Erik G Boman. 2010. Factors impacting performance of multithreaded sparse triangular solve. In International Conference on High Performance Computing for Computational Science. Springer, 32--44.

Digital Library

[63]

Zhigeng Xu, James Lin, and Satoshi Matsuoka. 2017. Benchmarking SW26010 Many-Core Processor. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, 743--752.

[64]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet Another SpMV Framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). 107--118.

Digital Library

[65]

Shengen Yan, Guoping Long, and Yunquan Zhang. 2013. StreamScan: Fast Scan Algorithms for GPUs Without Global Barrier Synchronization. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). 229--238.

Digital Library

[66]

Chao Yang, Wei Xue, Haohuan Fu, Hongtao You, Xinliang Wang, Yu-long Ao, Fangfang Liu, Lin Gan, Ping Xu, Lanning Wang, et al. 2016. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). 6:1--6:12.

Digital Library

[67]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). 369--380.

Digital Library

[68]

Weihua Zhang, Xin Wang, Shiyu Ji, Ziyun Wei, Zhaoguo Wang, and Haibo Chen. 2017. Scaling Concurrent Index Structures under Contention Using HTM. IEEE Transactions on Parallel and Distributed Systems (2017).

[69]

Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut Kandemir. 2011. Optimizing Data Layouts for Parallel Computation on Multicores. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT '11). 143--154.

Digital Library

[70]

Yue Zhao, Jiajia Li, Chunhua Liao, and Xipeng Shen. 2018. Bridging the Gap between Deep Learning and Sparse Matrix Format Selection. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18).

Digital Library

Cited By

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
He ZSong LLucas RCong JZhang ZPutnam A(2024)LevelST: Stream-based Accelerator for Sparse Triangular SolverProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637568(67-77)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637568
Freire MDufrechou EEzzatti P(2024)A new level-set analysis and sparse storage format for the SPTRSV in GPUs2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00014(59-69)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00014
Show More Cited By

Index Terms

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
1. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms

Recommendations

Efficient Block Algorithms for Parallel Sparse Triangular Solve
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

The sparse triangular solve (SpTRSV) kernel is an important building block for a number of linear algebra routines such as sparse direct and iterative solvers. The major challenge of accelerating SpTRSV lies in the difficulties of finding higher ...
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
PPoPP '18

Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...
Towards Efficient SpMV on Sunway Manycore Architectures
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Sparse Matrix-Vector Multiplication (SpMV) is an essential computation kernel for many data-analytic workloads running in both supercomputers and data centers. The intrinsic irregularity in SpMV is challenging to achieve high performance, especially ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2018

442 pages

ISBN:9781450349826

DOI:10.1145/3178487

General Chair:
Andreas Krall
Vienna University of Technology, Austria
,
Program Chair:
Thomas R. Gross
ETH Zürich, Switzerland

ACM SIGPLAN Notices Volume 53, Issue 1
PPoPP '18
January 2018
426 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3200691
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 10 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China
National Natural Science Foundation of China
European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie project

Conference

PPoPP '18

Sponsor:

PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
665
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)3

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu ZSun JLi ZSun G(2024)AG-SpTRSV: An Automatic Framework to Optimize Sparse Triangular Solve on GPUsACM Transactions on Architecture and Code Optimization10.1145/367491121:4(1-25)Online publication date: 25-Jun-2024
https://dl.acm.org/doi/10.1145/3674911
He ZSong LLucas RCong JZhang ZPutnam A(2024)LevelST: Stream-based Accelerator for Sparse Triangular SolverProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637568(67-77)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1145/3626202.3637568
Freire MDufrechou EEzzatti P(2024)A new level-set analysis and sparse storage format for the SPTRSV in GPUs2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00014(59-69)Online publication date: 13-Nov-2024
https://doi.org/10.1109/SBAC-PAD63648.2024.00014
Li JLi LWang QXue WLiang JShi J(2024)Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architectureParallel Computing10.1016/j.parco.2024.103080120(103080)Online publication date: Jun-2024
https://doi.org/10.1016/j.parco.2024.103080
Li JLiang JXue WHu ZLi LShi J(2024)Toward efficient structured-grid triangular solver on sunway many-core processorsThe Journal of Supercomputing10.1007/s11227-023-05802-280:8(10610-10636)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s11227-023-05802-2
Liu TChen RLiu ZTian MGuo YPan J(2023)A Set of New Optimization Methods Based on Sunway Many-core ProcessorProceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications10.1145/3606043.3606056(92-100)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3606043.3606056
Liu YDing NSao PWilliams SLi XMohror KArnold DBadia R(2023)Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607092(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607092
Ren YXu Y(2023)Parallel Optimization of BLAS on a New-Generation Sunway SupercomputerJournal of Circuits, Systems and Computers10.1142/S021812662350290032:17Online publication date: 23-May-2023
https://doi.org/10.1142/S0218126623502900
Shi RZuo YZhang KYan H(2023)Hardware Acceleration Linear Matrix Solvor Based on FPGA2023 IEEE 15th International Conference on ASIC (ASICON)10.1109/ASICON58565.2023.10396603(1-4)Online publication date: 24-Oct-2023
https://doi.org/10.1109/ASICON58565.2023.10396603
Li MLiu CLiao JZheng XYang HSun RXu JGan LYang GLuan ZQian D(2023)Towards optimized tensor code generation for deep learning on sunway many-core processorFrontiers of Computer Science10.1007/s11704-022-2440-718:2Online publication date: 13-Sep-2023
https://doi.org/10.1007/s11704-022-2440-7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents