research-article

Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor

Authors:

Lieven EeckhoutAuthors Info & Claims

ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers

Article No.: 7, Pages 1 - 7

https://doi.org/10.1145/2612262.2612268

Published: 10 June 2014 Publication History

Get Access

Abstract

Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has to be balanced against other, potentially negative factors such as inter-thread competition for cache capacity and increased synchronization overheads.

In this paper, we extend CRUST (ClusteR-aware Undersubscribed Scheduling of Threads), a technique for finding the optimum thread count of OpenMP applications running on clustered cache architectures, to take the behavior of simultaneous multithreading on the Xeon Phi into account. CRUST can automatically find the optimum thread count at sub-application granularity by exploiting application phase behavior at OpenMP parallel section boundaries, and uses hardware performance counter information to gain insight into the application's behavior. We implement a CRUST prototype inside the Intel OpenMP runtime library and show its efficiency running on real Xeon Phi hardware.

References

[1]

Intel OpenMP runtime library. Available at http://www.openmprtl.org/.

Google Scholar

[2]

M. Curtis-Maury, F. Blagojevic, C. Antonopoulos, and D. Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scientific codes. Parallel and Distributed Systems, IEEE Transactions on, 19(10):1396--1410, Oct. 2008.

Digital Library

Google Scholar

[3]

S. Eyerman, L. Eeckhout, T. Karkhanis, and J. Smith. A top-down approach to architecting CPI component performance counters. Micro, IEEE, 27(1):84--93, 2007.

Digital Library

Google Scholar

[4]

W. Heirman, T. E. Carlson, K. Van Craeynest, I. Hur, A. Jaleel, and L. Eeckhout. Undersubscribed threading on clustered cache architectures. In International Symposium on High Performance Computer Architecture (HPCA), Feb. 2014.

Crossref

Google Scholar

[5]

M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.

Google Scholar

[6]

H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In ACM SIGARCH Computer Architecture News, volume 20, pages 136--145. ACM, 1992.

Digital Library

Google Scholar

[7]

H. Jin, M. Frumkin, and J. Yan. The OpenMP implementation of NAS Parallel Benchmarks and its performance. Technical report, NASA Ames Research Center, Oct. 1999.

Google Scholar

[8]

M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 277--286, 2008.

Digital Library

Google Scholar

[9]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In ACM SIGARCH Computer Architecture News, volume 23, pages 392--403. ACM, 1995.

Digital Library

Google Scholar

[10]

A. Yasin. A top-down method for performance analysis and counters architecture. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013.

Google Scholar

Cited By

View all

Chandrasekar KKale L(2024)Dynamic Tuning of Core Counts to Maximize Performance in Object-Based Runtime SystemsAsynchronous Many-Task Systems and Applications10.1007/978-3-031-61763-8_9(92-104)Online publication date: 30-May-2024
https://doi.org/10.1007/978-3-031-61763-8_9
Ortega CAlvarez LCasas MBertran RBuyuktosunoglu AEichenberger ABose PMoreto M(2021)Intelligent Adaptation of Hardware Knobs for Improving Performance and Power ConsumptionIEEE Transactions on Computers10.1109/TC.2020.298023070:1(1-16)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2980230
Ju TZhang YZhang XDu XDong X(2019)Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread CountEnergies10.3390/en1207134612:7(1346)Online publication date: 8-Apr-2019
https://doi.org/10.3390/en12071346
Show More Cited By

Recommendations

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge---Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition

Comments

Information & Contributors

Information

Published In

ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers

June 2014

76 pages

ISBN:9781450329507

DOI:10.1145/2612262

Conference Chairs:
Kamil Iskra
Argonne National Laboratory
,
Torsten Hoefler
ETH Zurich, Switzerland

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Research Council

Conference

ROSS '14

Sponsor:

SPCL

ROSS '14: Runtime and Operating Systems for Supercomputers

June 10, 2014

Munich, Germany

Acceptance Rates

ROSS '14 Paper Acceptance Rate 9 of 16 submissions, 56%;

Overall Acceptance Rate 58 of 169 submissions, 34%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
176
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chandrasekar KKale L(2024)Dynamic Tuning of Core Counts to Maximize Performance in Object-Based Runtime SystemsAsynchronous Many-Task Systems and Applications10.1007/978-3-031-61763-8_9(92-104)Online publication date: 30-May-2024
https://doi.org/10.1007/978-3-031-61763-8_9
Ortega CAlvarez LCasas MBertran RBuyuktosunoglu AEichenberger ABose PMoreto M(2021)Intelligent Adaptation of Hardware Knobs for Improving Performance and Power ConsumptionIEEE Transactions on Computers10.1109/TC.2020.298023070:1(1-16)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TC.2020.2980230
Ju TZhang YZhang XDu XDong X(2019)Energy-Efficient Thread Mapping for Heterogeneous Many-Core Systems via Dynamically Adjusting the Thread CountEnergies10.3390/en1207134612:7(1346)Online publication date: 8-Apr-2019
https://doi.org/10.3390/en12071346
Cruz EDiener MSerpa MNavaux PPilla LKoren I(2018)Improving Communication and Load Balancing with Thread Mapping in Manycore Systems2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)10.1109/PDP2018.2018.00021(93-100)Online publication date: Mar-2018
https://doi.org/10.1109/PDP2018.2018.00021
Liu TLiu YQian CQian D(2017)IOPA: I/O-aware parallelism adaption for parallel programsPLOS ONE10.1371/journal.pone.017303812:3(e0173038)Online publication date: 9-Mar-2017
https://doi.org/10.1371/journal.pone.0173038
Ortega CMoreto MCasas MBertran RBuyuktosunoglu AEichenberger ABose PGropp WBeckman PLi ZCazorla F(2017)libPRISMProceedings of the International Conference on Supercomputing10.1145/3079079.3079101(1-10)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079101
Jia ZXue CChen GZhan JZhang LLin YHofstee PZaks AMendelson BRauchwerger LHwu W(2016)Auto-tuning Spark Big Data Workloads on POWER8Proceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967957(387-400)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967957
Deepika HMangala NBabu S(2016)Automatic program generation for heterogeneous architectures2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI)10.1109/ICACCI.2016.7732032(102-109)Online publication date: Sep-2016
https://doi.org/10.1109/ICACCI.2016.7732032
Tao Ju Weiguo Wu Heng Chen Zhengdong Zhu Xiaoshe Dong (2015)Thread Count Prediction ModelProceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2015.64(456-464)Online publication date: 14-Dec-2015
https://dl.acm.org/doi/10.1109/ICPADS.2015.64

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Explicit Fourth-Order Runge---Kutta Method on Intel Xeon Phi Coprocessor

Evaluation of Rodinia Codes on Intel Xeon Phi

Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition