research-article

Cache injection for parallel applications

Authors:

Edgar A. León,

Kurt B. Ferreira,

Arthur B. MaccabeAuthors Info & Claims

HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

Pages 15 - 26

https://doi.org/10.1145/1996130.1996135

Published: 08 June 2011 Publication History

Abstract

For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor's cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwidth utilization. These improvements are significant for data-intensive applications whose performance is dominated by compulsory cache misses.

We present an empirical evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. We demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, we show that injecting message payloads to the L3 cache can improve the performance of network-bandwidth limited applications. In addition, we show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). Our study shows negligible pollution to the target caches.

References

[1]

AMD Inc. Opteron 875, Opteron 8393SE, and Opteron 8439SE. http://www.amd.com/, Mar. 2010.

[2]

G. Anselmi, B. Blanchard, Y. Cho, C. Hales, and M. Quezada. IBM Power 770 and 780 technical overview and introduction. Technical Report REDP-4639-00, IBM Corp., Mar. 2010.

[3]

G. Anselmi, G. Linzmeier, W. Seiwald, P. Vandamme, and S. Vetter. IBM system p5 570 technical overview and introduction. Technical Report REDP-9117-01, IBM Corp., Sept. 2006.

[4]

J. Appavoo, M. Auslander, M. Burtico, D. D. Silva, O. Krieger, M. Mergen, M. Ostrowski, B. Rosenburg, R. W. Wisniewski, and J. Xenidis. K42: an open-source Linux-compatible scalable operating system kernel. IBM Systems Journal, 44(2):427--440, 2005.

Digital Library

[5]

B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS high-performance interconnect. In Symposium on High-Performance Interconnects (Hot Interconnects), Aug. 2010.

Digital Library

[6]

J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In ACM/IEEE conference on Supercomputing (SC), pages 176--186, Albuquerque, New Mexico, 1991.

Digital Library

[7]

M. Barnett, L. Shuler, S. Gupta, D. G. Payne, R. A. van de Geijn, and J. Watts. Building a high-performance collective communication library. In Supercomputing, pages 107--116, 1994.

Digital Library

[8]

P. Bohrer, M. Elnozahy, A. Gheith, C. Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R. Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang. Mambo -- a full system simulator for the PowerPC architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):8--12, Mar. 2004.

Digital Library

[9]

P. Bohrer, R. Rajamony, and H. Shafi. Method and apparatus for accelerating Input/Output processing using cache injections, Mar. 2004. US Patent No. US 6,711,650 B1.

[10]

J. Bruck, C.-T. Ho, S. Kipnis, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In Symposium on Parallel Algorithms and Architectures (SPAA), pages 298--309, 1994.

Digital Library

[11]

C. Cler and C. Costantini. IBM Power 595 technical overview and introduction. Technical Report REDP-4440-00, IBM Corp., Aug. 2008.

[12]

E. A. Leon, K. B. Ferreira, and A. B. Maccabe. Reducing the impact of the memory wall for I/O using cache injection. In Symposium on High-Performance Interconnects (Hot Interconnects), Palo Alto, CA, Aug. 2007.

Digital Library

[13]

E. A. Leon, R. Riesen, A. B. Maccabe, and P. G. Bridges. Instruction-level simulation of a cluster at scale. In International Conference on High-Performance Computing, Networking, Storage and Analysis (SC), Portland, OR, Nov. 2009.

Digital Library

[14]

K. B. Ferreira, R. Brightwell, and P. G. Bridges. Characterizing application sensitivity to OS interference using kernel-level noise injection. In 2008 ACM/IEEE Conference on Supercomputing (SC), November 2008.

Digital Library

[15]

R. L. Graham, T. S. Woodall, and J. M. Squyres. Open MPI: A flexible high performance MPI. In 6th Annual International Conference on Parallel Processing and Applied Mathematics, Poznan, Poland, September 2005.

Digital Library

[16]

V. E. Henson and U. M. Yang. Boomeramg: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41:155--177, 2000.

Digital Library

[17]

R. Huggahalli, R. Iyer, and S. Tetrick. Direct cache access for high bandwidth network I/O. In 32nd Annual International Symposium on Computer Architecture (ISCA'05), pages 50--59, Madison, WI, June 2005.

Digital Library

[18]

Intel Corp. Xeon E5502, Xeon X5667, and Xeon X7560. http://ark.intel.com/, Mar. 2010.

[19]

F. Khunjush and N. J. Dimopoulos. Comparing direct-to-cache transfer policies to TCP/IP and M-VIA during receive operations in mpi environments. In 5th International Symposium on Parallel and Distributed Processing and Applications (ISPA'07), Niagara Falls, Canada, Aug. 2007.

Digital Library

[20]

A. Kumar and R. Huggahalli. Impact of cache coherence protocols on the processing of network traffic. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07), pages 161--171, Chicago, IL, Dec. 2007. IEEE Computer Society.

Digital Library

[21]

A. Kumar, R. Huggahalli, and S. Makineni. Characterization of direct cache access on multi-core systems and 10GbE. In 15th International Symposium on High-Performance Computer Architecture (HPCA'09), Raleigh, NC, Feb. 2009.

[22]

Lawrence Livermore National Laboratory. ASC Sequoia benchmark codes. https://asc.llnl.gov/sequoia/benchmarks/, Apr. 2008.

[23]

P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas, J. Kepner, J. McCalpin, D. Bailey, and D. Takahashi. Introduction to the HPC challenge benchmark suite, Mar. 2005.

[24]

S. A. McKee, S. A. Moyer, and W. A. Wulf. Increasing memory bandwidth for vector computations. In International Conference on Programming Languages and System Architectures, pages 87--104, Zurich, Switzerland, Mar. 1994.

Digital Library

[25]

L. McVoy and C. Staelin. lmbench: Portable tools for performance analysis. In USENIX Annual Technical Conference, pages 279--294, Jan. 1996.

Digital Library

[26]

T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87--106, 1991.

Digital Library

[27]

S. A. Moyer. Access Ordering and Effective Memory Bandwidth. PhD thesis, Department of Computer Science, University of Virginia, Apr. 1993.

Digital Library

[28]

R. Murphy. On the effects of memory latency and bandwidth on supercomputer application performance. In IEEE International Symposium on Workload Characterization (IISWC'07), Boston, MA, Sept. 2007.

Digital Library

[29]

R. C. Murphy and P. M. Kogge. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Transactions on Computers, 56(7):937--945, July 2007.

Digital Library

[30]

J. Ousterhout. Why aren't operating systems getting faster as fast as hardware. In USENIX Annual Technical Conference, pages 247--256, 1990.

[31]

G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. TCP onloading for data center servers. Computer, 37(11):48--58, Nov. 2004.

Digital Library

[32]

R. Riesen. A hybrid MPI simulator. In IEEE International Conference on Cluster Computing (Cluster'06), Barcelona, Spain, Sept. 2006.

[33]

B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM Journal of Research and Development, 49(4/5), 2005.

Digital Library

[34]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19:49--66, 2005.

Digital Library

[35]

J. S. Vetter and M. O. McCracken. Statistical scalability analysis of communication operations in distributed applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'01), Snowbird, UT, July 2001.

Digital Library

[36]

W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 3(1):20--24, Mar. 1995.

Digital Library

Cited By

Wu QBeard JEkanayake AGerstlauer AJohn L(2021)Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00027(182-191)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00027
Jiang THou RDong JChai LMcKee STian BZhang LSun N(2015)Adapting Memory Hierarchies for Emerging Datacenter InterconnectsJournal of Computer Science and Technology10.1007/s11390-015-1507-430:1(97-109)Online publication date: 21-Jan-2015
https://doi.org/10.1007/s11390-015-1507-4
Hanford NAhuja VBalman MFarrens MGhosal DPouyoul ETierney BBalman MByna STierney B(2013)Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flowsProceedings of the Third International Workshop on Network-Aware Data Management10.1145/2534695.2534697(1-10)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2534695.2534697
Show More Cited By

Index Terms

Cache injection for parallel applications

Recommendations

An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...
Page Size Aware Cache Prefetching
MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching ...
A PAB-based multi-prefetcher mechanism

Aggressive prefetching mechanisms improve performance of some important applications, but substantially increase bus traffic and "pressure" on cache tag arrays. They may even reduce performance of applications that are not memory bounded. We introduce a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

June 2011

296 pages

ISBN:9781450305525

DOI:10.1145/1996130

General Chair:
Arthur "Barney" Maccabe
Oak Ridge National Lab, USA
,
Program Chair:
Douglas Thain
University of Notre Dame, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC '11

Sponsor:

University of Arizona
SIGARCH

HPDC '11: The 20th International Symposium on High-Performance Parallel and Distributed Computing

June 8 - 11, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu QBeard JEkanayake AGerstlauer AJohn L(2021)Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00027(182-191)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00027
Jiang THou RDong JChai LMcKee STian BZhang LSun N(2015)Adapting Memory Hierarchies for Emerging Datacenter InterconnectsJournal of Computer Science and Technology10.1007/s11390-015-1507-430:1(97-109)Online publication date: 21-Jan-2015
https://doi.org/10.1007/s11390-015-1507-4
Hanford NAhuja VBalman MFarrens MGhosal DPouyoul ETierney BBalman MByna STierney B(2013)Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flowsProceedings of the Third International Workshop on Network-Aware Data Management10.1145/2534695.2534697(1-10)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1145/2534695.2534697
Zhong QGuan XHuang TCheng XWang KOssowski SLecca P(2012)Affinity-aware DMA buffer management for reducing off-chip memory accessProceedings of the 27th Annual ACM Symposium on Applied Computing10.1145/2245276.2232031(1588-1593)Online publication date: 26-Mar-2012
https://dl.acm.org/doi/10.1145/2245276.2232031

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents