research-article

Open access

IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUs

Authors:

Ji QiAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 66, Pages 1 - 11

https://doi.org/10.1145/3545008.3545032

Published: 13 January 2023 Publication History

All formats PDF

Abstract

Recently the mainstream basic linear algebra libraries have delivered high performance on large scale General Matrix Multiplication(GEMM) and Triangular System Solve(TRSM). However, these libraries are still insufficient to provide sustained performance for batch operations on large groups of fixed-size small matrices on specific architectures, which are extensively used in various scientific computing applications. In this paper, we propose IATF, an input-aware tuning framework for optimizing large group of fixed-size small GEMM and TRSM to boost near-optimal performance on ARMv8 architecture. The IATF contains two stages: install-time stage and run-time stage. In the install-time stage, based on SIMD-friendly data layout, we propose computing kernel templates for high-performance GEMM and TRSM, analyze optimal kernel sizes to increase computational instruction ratio, and design kernel optimization strategies to improve kernel execution efficiency. Furthermore, an optimized data packing strategy is also presented for computing kernels to minimize the cost of memory accessing overhead. In the run-time stage, we present an input-aware tuning method to generate an efficient execution plan for large group of fixed-size small GEMM and TRSM, according to the input matrix properties. The experimental results show that IATF could achieve significant performance improvements in GEMM and TRSM compared with other mainstream BLAS libraries.

Supplementary Material

Appendix (a66-appendix.pdf)

Download
1.92 MB

References

[1]

[n.d.]. ARM PERFORMANCE LIBRARIES. https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-compiler-for-linux/arm-performance-libraries

[2]

[n.d.]. Intel oneAPI Math Kernel Library. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html

[3]

[n.d.]. OpenBLAS:An optimized BLAS library. http://www.openblas.net/

[4]

Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, design, and autotuning of batched GEMM for GPUs. In International Conference on High Performance Computing. Springer, 21–38.

[5]

Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43.

Digital Library

[6]

Jean-Guillaume Dumas, Clément Pernet, and Jean-Louis Roch. 2006. Adaptive triangular system solving. In Challenges in Symbolic Computation Software. 770.

[7]

Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, and Moritz Diehl. 2018. BLASFEO: Basic linear algebra subroutines for embedded optimization. ACM Transactions on Mathematical Software (TOMS) 44, 4 (2018), 1–30.

Digital Library

[8]

Gianluca Frison, Tommaso Sartor, Andrea Zanelli, and Moritz Diehl. 2020. The BLAS API of BLASFEO: Optimizing performance for small matrices. ACM Transactions on Mathematical Software (TOMS) 46, 2 (2020), 1–36.

Digital Library

[9]

Kazushige Goto and Robert A van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34, 3 (2008), 1–25.

Digital Library

[10]

Kazushige Goto and Robert Van De Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software (TOMS) 35, 1 (2008), 1–14.

Digital Library

[11]

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: accelerating small matrix multiplications by runtime code generation. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 981–991.

[12]

Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. 2018. Matrix capsules with EM routing. In International conference on learning representations.

[13]

Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 109–123.

Digital Library

[14]

Kyungjoo Kim, Timothy B Costa, Mehmet Deveci, Andrew M Bradley, Simon D Hammond, Murat E Guney, Sarah Knepper, Shane Story, and Sivasankaran Rajamanickam. 2017. Designing vector-friendly compact BLAS and LAPACK kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.

Digital Library

[15]

Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–22.

Digital Library

[16]

Haidong Lan, Jintao Meng, Christian Hundt, Bertil Schmidt, Minwen Deng, Xiaoning Wang, Weiguo Liu, Yu Qiao, and Shengzhong Feng. 2019. FeatherCNN: Fast inference computation with TensorGEMM on ARM architectures. IEEE Transactions on Parallel and Distributed Systems 31, 3 (2019), 580–594.

[17]

Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 1049–1059.

Digital Library

[18]

Field G Van Zee and Robert A Van De Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software (TOMS) 41, 3 (2015), 1–33.

Digital Library

[19]

Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs. In SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.

Digital Library

[20]

Bartosz D Wozniak, Freddie D Witherden, Francis P Russell, Peter E Vincent, and Paul HJ Kelly. 2016. GiMMiK—Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics. Computer Physics Communications 202 (2016), 12–22.

[21]

Jing Xia, Chuanning Cheng, Xiping Zhou, Yuxing Hu, and Peter Chun. 2021. Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services. IEEE Micro 41, 5 (2021), 67–75.

Digital Library

[22]

Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. In 2012 IEEE 18th international conference on parallel and distributed systems. IEEE, 684–691.

Digital Library

[23]

Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, and Zheng Wang. 2021. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

Cited By

Liu HShi SWang XJiang ZChen Q(2024)Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546786(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546786
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: Sep-2024
https://doi.org/10.1109/TPDS.2024.3432579
Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3350368
Show More Cited By

Index Terms

IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Input-aware auto-tuning for directive-based GPU programming
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

The difficulties posed by GPGPU programming and the need to increase productivity have guided research towards directive-based high-level programs for accelerators. This effort has led to the definition of the OpenACC industry standard. It significantly ...
Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs
General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing. While the mainstream Basic Linear Algebra Subprograms (BLAS) libraries can deliver good performance on large and regular-shaped GEMMs, they are inadequate for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
535
Total Downloads

Downloads (Last 12 months)321
Downloads (Last 6 weeks)46

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu HShi SWang XJiang ZChen Q(2024)Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546786(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546786
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: Sep-2024
https://doi.org/10.1109/TPDS.2024.3432579
Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3350368
Guo RJia HZhang YDeng MWei CChang WZhao X(2023)SA_TRSM: A Shape-Aware Auto-Tuning Framework for Small-Scale Irregular-Shaped TRSM2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00116(765-774)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00116
Wei CJia HZhang YLi KWang L(2022)LBBGEMM: A Load-balanced Batch GEMM Framework on ARM CPU s2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00042(59-66)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00042

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten