skip to main content
10.1145/2612262.2612268acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor

Published: 10 June 2014 Publication History

Abstract

Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has to be balanced against other, potentially negative factors such as inter-thread competition for cache capacity and increased synchronization overheads.
In this paper, we extend CRUST (ClusteR-aware Undersubscribed Scheduling of Threads), a technique for finding the optimum thread count of OpenMP applications running on clustered cache architectures, to take the behavior of simultaneous multithreading on the Xeon Phi into account. CRUST can automatically find the optimum thread count at sub-application granularity by exploiting application phase behavior at OpenMP parallel section boundaries, and uses hardware performance counter information to gain insight into the application's behavior. We implement a CRUST prototype inside the Intel OpenMP runtime library and show its efficiency running on real Xeon Phi hardware.

References

[1]
Intel OpenMP runtime library. Available at http://www.openmprtl.org/.
[2]
M. Curtis-Maury, F. Blagojevic, C. Antonopoulos, and D. Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scientific codes. Parallel and Distributed Systems, IEEE Transactions on, 19(10):1396--1410, Oct. 2008.
[3]
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. Smith. A top-down approach to architecting CPI component performance counters. Micro, IEEE, 27(1):84--93, 2007.
[4]
W. Heirman, T. E. Carlson, K. Van Craeynest, I. Hur, A. Jaleel, and L. Eeckhout. Undersubscribed threading on clustered cache architectures. In International Symposium on High Performance Computer Architecture (HPCA), Feb. 2014.
[5]
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.
[6]
H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In ACM SIGARCH Computer Architecture News, volume 20, pages 136--145. ACM, 1992.
[7]
H. Jin, M. Frumkin, and J. Yan. The OpenMP implementation of NAS Parallel Benchmarks and its performance. Technical report, NASA Ames Research Center, Oct. 1999.
[8]
M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 277--286, 2008.
[9]
D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In ACM SIGARCH Computer Architecture News, volume 23, pages 392--403. ACM, 1995.
[10]
A. Yasin. A top-down method for performance analysis and counters architecture. In Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ROSS '14: Proceedings of the 4th International Workshop on Runtime and Operating Systems for Supercomputers
June 2014
76 pages
ISBN:9781450329507
DOI:10.1145/2612262
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • SPCL: Scalable Parallel Computing Laboratory

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OpenMP
  2. auto-tuning
  3. simultaneous multithreading

Qualifiers

  • Research-article

Funding Sources

Conference

ROSS '14
Sponsor:
  • SPCL

Acceptance Rates

ROSS '14 Paper Acceptance Rate 9 of 16 submissions, 56%;
Overall Acceptance Rate 58 of 169 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media