skip to main content
10.1145/1088149.1088201acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system

Published: 20 June 2005 Publication History

Abstract

The non-uniform memory access times of modern cc-NUMA systems often impair performance for shared memory applications. This is especially true for applications exhibiting complex access patterns. To improve performance, a mechanism for co-locating threads and data during the execution is needed. In this paper, we study how an affinity-on-next-touch procedure can be used to attain this goal. Such a procedure can increase thread-data affinity by migrating data across nodes to better match the access pattern. The migration is triggered by a directive and it can often be implemented as a re-invocation of a standard first-touch page placement procedure. We study an industrial-class scientific application where the thread-data affinity is small due to serial initializations of data structures accessed indirectly. Adding a single affinity-on-next-touch directive, we observed a performance improvement of 69% for 22 threads. We also perform experiments to study the scalability of the affinity-on-next-touch procedure. Our results indicate that the overhead associated with the procedure is highly dependent on the efficiency of the mechanism used to keep TLBs consistent. Using larger but fewer memory pages in the virtual memory sub-system we measured a total performance improvement of 166% compared to the original code.

References

[1]
J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. A. Nelson, and C. D. Offner. Extending OpenMP for NUMA machines. Scientific Programming, 8:163--181, 2000.
[2]
T. Brecht. On the Importance of Parallel Application Placement in NUMA Multiprocessors. In Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA, Sept 1993.
[3]
J. M. Bull and C. Johnson. Data Distribution, Migration and Replication on a cc-NUMA Architecture. In Proceedings of the Fourth European Workshop on OpenMP. http://www.caspur.it/ewomp2002/, 2002.
[4]
R. Chandra, D.-K. Chen, R. Cox, D. E. Maydan, N. Nedeljkovic, and J. M. Anderson. Data distribution support on distributed shared memory multiprocessors. In Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, pages 334--345. ACM Press, 1997.
[5]
R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum. Scheduling and page migration for multiprocessor compute servers. In Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, pages 12--24. ACM Press, 1994.
[6]
A. Charlesworth. The Sun Fireplane System Interconnect. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pages 7--7. ACM Press, 2001.
[7]
J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000. In Proceedings of the 17th annual international conference on Supercomputing, pages 121--129. ACM Press, 2003.
[8]
D. Culler, J. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufman, 1998.
[9]
F. Edelvik. Hybrid Solvers for the Maxwell Equations in Time-Domain. Doctoral thesis, Mathematics and Computer Science, Department of Information Technology, University of Uppsala, may 2002. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-2156.
[10]
N. E. Gibbs, J. William G. Poole, and P. K. Stockmeyer. An Algorithm for Reducing the Bandwith and Profile of a Sparse Matrix. SIAM Journal on Numerical Analysis, 13(2):236--250, April 1976.
[11]
S. Holmgren, M. Nordén, J. Rantakokko, and D. Wallin. Performance of PDE Solvers on a Self-Optimizing NUMA Architecture. Parallel Algorithms and Applications, 17(4):285--299, 2002.
[12]
H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. NAS Technical Report NAS-99-011, NASA Ames Research Center, 1999.
[13]
H. Löf, M. Nordén, and S. Holmgren. Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers. In Computational Science - ICCS 2004: 4th International Conference, Kraków, Poland, June 6-9, 2004, Proceedings, Part II, volume 3037 of LNCS, pages 9--16, http://www.springerlink.com/openurl.asp?genre= article&issn=0302-9743&vo%lume=3037&spage=9, 2004.
[14]
H. Löf and J. Rantakokko. Algorithmic Optimizations of a Parallel Industrial CEM Solver. Technical report, Dept. of Information Technology, Uppsala University, 2004.
[15]
T. G. Mattson. How good is OpenMP. Scientific Programming, 11:81--93, 2003.
[16]
D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguade. A transparent runtime data distribution engine for OpenMP. Scientific Programming, 8:143--162, 2000.
[17]
L. Noordergraaf and R. van der Pas. Performance experiences on Sun's Wildfire prototype. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM). page 38. ACM Press, 1999.
[18]
Sun Microsystems, http://www.sun.com/servers/wp/ docs/mpo_v7_CUSTOMER.pdf. Solaris Memory Placement Optimization and Sun Fire servers, January 2003.
[19]
P. J. Teller. Tranlation-Lookaside Buffer Consistency. Computer, 23(6):26--36, 1990.
[20]
M. M. Tikir and J. K. Hollingsworth. Using Hardware Counters to Automatically Improve Memory Performance. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 46, Washington, DC, USA, 2004. IEEE Computer Society.
[21]
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 279--289. ACM Press, 1996.
[22]
K. M. Wilson and B. B. Aglietti. Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In Supercomputing '01: Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pages 33--33, New York, NY, USA, 2001. ACM Press.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OpenMP
  2. TLB shoot-down
  3. cc-NUMA
  4. computational electro-magnetics
  5. conjugate gradients
  6. large pages
  7. page migration
  8. sparse matrices

Qualifiers

  • Article

Conference

ICS05
Sponsor:
ICS05: International Conference on Supercomputing 2005
June 20 - 22, 2005
Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media