Article

affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system

Authors:

Sverker HolmgrenAuthors Info & Claims

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

Pages 387 - 392

https://doi.org/10.1145/1088149.1088201

Published: 20 June 2005 Publication History

Abstract

The non-uniform memory access times of modern cc-NUMA systems often impair performance for shared memory applications. This is especially true for applications exhibiting complex access patterns. To improve performance, a mechanism for co-locating threads and data during the execution is needed. In this paper, we study how an affinity-on-next-touch procedure can be used to attain this goal. Such a procedure can increase thread-data affinity by migrating data across nodes to better match the access pattern. The migration is triggered by a directive and it can often be implemented as a re-invocation of a standard first-touch page placement procedure. We study an industrial-class scientific application where the thread-data affinity is small due to serial initializations of data structures accessed indirectly. Adding a single affinity-on-next-touch directive, we observed a performance improvement of 69% for 22 threads. We also perform experiments to study the scalability of the affinity-on-next-touch procedure. Our results indicate that the overhead associated with the procedure is highly dependent on the efficiency of the mechanism used to keep TLBs consistent. Using larger but fewer memory pages in the virtual memory sub-system we measured a total performance improvement of 166% compared to the original code.

References

[1]

J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. A. Nelson, and C. D. Offner. Extending OpenMP for NUMA machines. Scientific Programming, 8:163--181, 2000.

Digital Library

[2]

T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA, Sept 1993.

Digital Library

[3]

J. M. Bull and C. Johnson. Data Distribution, Migration and Replication on a cc-NUMA Architecture. In Proceedings of the Fourth European Workshop on OpenMP. http://www.caspur.it/ewomp2002/, 2002.

[4]

R. Chandra, D.-K. Chen, R. Cox, D. E. Maydan, N. Nedeljkovic, and J. M. Anderson. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, pages 334--345. ACM Press, 1997.

Digital Library

[5]

R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and page migration for multiprocessor compute servers. In Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, pages 12--24. ACM Press, 1994.

Digital Library

[6]

A. Charlesworth. The Sun Fireplane System Interconnect. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pages 7--7. ACM Press, 2001.

Digital Library

[7]

J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000. In Proceedings of the 17th annual international conference on Supercomputing, pages 121--129. ACM Press, 2003.

Digital Library

[8]

D. Culler, J. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufman, 1998.

Digital Library

[9]

F. Edelvik. Hybrid Solvers for the Maxwell Equations in Time-Domain. Doctoral thesis, Mathematics and Computer Science, Department of Information Technology, University of Uppsala, may 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-2156.

[10]

N. E. Gibbs, J. William G. Poole, and P. K. Stockmeyer. An Algorithm for Reducing the Bandwith and Profile of a Sparse Matrix. SIAM Journal on Numerical Analysis, 13(2):236--250, April 1976.

[11]

S. Holmgren, M. Nordén, J. Rantakokko, and D. Wallin. Performance of PDE Solvers on a Self-Optimizing NUMA Architecture. Parallel Algorithms and Applications, 17(4):285--299, 2002.

[12]

H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. NAS Technical Report NAS-99-011, NASA Ames Research Center, 1999.

[13]

H. Löf, M. Nordén, and S. Holmgren. Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers. In Computational Science - ICCS 2004: 4th International Conference, Kraków, Poland, June 6-9, 2004, Proceedings, Part II, volume 3037 of LNCS, pages 9--16, http://www.springerlink.com/openurl.asp?genre= article&issn=0302-9743&vo%lume=3037&spage=9, 2004.

[14]

H. Löf and J. Rantakokko. Algorithmic Optimizations of a Parallel Industrial CEM Solver. Technical report, Dept. of Information Technology, Uppsala University, 2004.

[15]

T. G. Mattson. How good is OpenMP. Scientific Programming, 11:81--93, 2003.

Digital Library

[16]

D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguade. A transparent runtime data distribution engine for OpenMP. Scientific Programming, 8:143--162, 2000.

Digital Library

[17]

L. Noordergraaf and R. van der Pas. Performance experiences on Sun's Wildfire prototype. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM). page 38. ACM Press, 1999.

Digital Library

[18]

Sun Microsystems, http://www.sun.com/servers/wp/ docs/mpo_v7_CUSTOMER.pdf. Solaris Memory Placement Optimization and Sun Fire servers, January 2003.

[19]

P. J. Teller. Tranlation-Lookaside Buffer Consistency. Computer, 23(6):26--36, 1990.

Digital Library

[20]

M. M. Tikir and J. K. Hollingsworth. Using Hardware Counters to Automatically Improve Memory Performance. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 46, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[21]

B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 279--289. ACM Press, 1996.

Digital Library

[22]

K. M. Wilson and B. B. Aglietti. Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In Supercomputing '01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pages 33--33, New York, NY, USA, 2001. ACM Press.

Digital Library

Cited By

Zhou ZLi CYang FSun G(2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071005
Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Kislal OKotra JTang XKandemir MJung M(2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192386
Show More Cited By

Index Terms

Recommendations

Dynamic data migration for structured AMR solvers

On cc-NUMA multi-processors, the non-uniformity of main memory latencies motivates the need for co-location of threads and data. We call this special form of data locality, geographical locality. In this article, we study the performance of a parallel ...
Page Size Aware Cache Prefetching
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching ...
A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors
IPDPS '02: Proceedings of the 16th International Symposium on Parallel and Distributed Processing

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware and the network interface/router. In this work we exploit such integration scale, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

June 2005

414 pages

ISBN:1595931678

DOI:10.1145/1088149

General Chair:
Arvind
MIT
,
Program Chair:
Larry Rudolph
MIT

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS05

Sponsor:

SIGARCH

ICS05: International Conference on Supercomputing 2005

June 20 - 22, 2005

Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou ZLi CYang FSun G(2023)DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071005(302-316)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071005
Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Kislal OKotra JTang XKandemir MJung M(2018)Enhancing computation-to-core assignment with physical location informationACM SIGPLAN Notices10.1145/3296979.319238653:4(312-327)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192386
Ul Mustafa NO’Riordan MRogers SOzturk O(2018)Exploiting architectural features of a computer vision platform towards reducing memory stallsJournal of Real-Time Image Processing10.1007/s11554-018-0830-8Online publication date: 9-Oct-2018
https://doi.org/10.1007/s11554-018-0830-8
H. M. Cruz EDiener MO. A. Navaux PH. M. Cruz EDiener MO. A. Navaux P(2018)State-of-the-Art Sharing-Aware Mapping MethodsThread and Data Mapping for Multicore Systems10.1007/978-3-319-91074-1_4(35-48)Online publication date: 5-Jul-2018
https://doi.org/10.1007/978-3-319-91074-1_4
Tang XKislal OKandemir MKarakoy MHunter HMoreno JEmer JSanchez D(2017)Data movement aware computation partitioningProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123954(730-744)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123954
Diener MCruz EAlves MBorin ENavaux PGiorgi RBecchi MPalumbo F(2017)Optimizing memory affinity with a hybrid compiler/OS approachProceedings of the Computing Frontiers Conference10.1145/3075564.3075566(221-229)Online publication date: 15-May-2017
https://dl.acm.org/doi/10.1145/3075564.3075566
Diener MCruz EAlves MNavaux PKoren I(2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.1145/3006385
Cruz EDiener MPilla LNavaux P(2016)Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/297558713:3(1-28)Online publication date: 17-Sep-2016
https://dl.acm.org/doi/10.1145/2975587
Cruz EDiener MPilla LNavaux P(2016)A Sharing-Aware Memory Management Unit for Online Mapping in Multi-core ArchitecturesProceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 983310.1007/978-3-319-43659-3_36(490-501)Online publication date: 24-Aug-2016
https://dl.acm.org/doi/10.1007/978-3-319-43659-3_36
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents