skip to main content
10.1145/1996130.1996135acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Cache injection for parallel applications

Published: 08 June 2011 Publication History

Abstract

For two decades, the memory wall has affected many applications in their ability to benefit from improvements in processor speed. Cache injection addresses this disparity for I/O by writing data into a processor's cache directly from the I/O bus. This technique reduces data latency and, unlike data prefetching, improves memory bandwidth utilization. These improvements are significant for data-intensive applications whose performance is dominated by compulsory cache misses.
We present an empirical evaluation of three injection policies and their effect on the performance of two parallel applications and several collective micro-benchmarks. We demonstrate that the effectiveness of cache injection on performance is a function of the communication characteristics of applications, the injection policy, the target cache, and the severity of the memory wall. For example, we show that injecting message payloads to the L3 cache can improve the performance of network-bandwidth limited applications. In addition, we show that cache injection improves the performance of several collective operations, but not all-to-all operations (implementation dependent). Our study shows negligible pollution to the target caches.

References

[1]
AMD Inc. Opteron 875, Opteron 8393SE, and Opteron 8439SE. http://www.amd.com/, Mar. 2010.
[2]
G. Anselmi, B. Blanchard, Y. Cho, C. Hales, and M. Quezada. IBM Power 770 and 780 technical overview and introduction. Technical Report REDP-4639-00, IBM Corp., Mar. 2010.
[3]
G. Anselmi, G. Linzmeier, W. Seiwald, P. Vandamme, and S. Vetter. IBM system p5 570 technical overview and introduction. Technical Report REDP-9117-01, IBM Corp., Sept. 2006.
[4]
J. Appavoo, M. Auslander, M. Burtico, D. D. Silva, O. Krieger, M. Mergen, M. Ostrowski, B. Rosenburg, R. W. Wisniewski, and J. Xenidis. K42: an open-source Linux-compatible scalable operating system kernel. IBM Systems Journal, 44(2):427--440, 2005.
[5]
B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS high-performance interconnect. In Symposium on High-Performance Interconnects (Hot Interconnects), Aug. 2010.
[6]
J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In ACM/IEEE conference on Supercomputing (SC), pages 176--186, Albuquerque, New Mexico, 1991.
[7]
M. Barnett, L. Shuler, S. Gupta, D. G. Payne, R. A. van de Geijn, and J. Watts. Building a high-performance collective communication library. In Supercomputing, pages 107--116, 1994.
[8]
P. Bohrer, M. Elnozahy, A. Gheith, C. Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R. Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang. Mambo -- a full system simulator for the PowerPC architecture. ACM SIGMETRICS Performance Evaluation Review, 31(4):8--12, Mar. 2004.
[9]
P. Bohrer, R. Rajamony, and H. Shafi. Method and apparatus for accelerating Input/Output processing using cache injections, Mar. 2004. US Patent No. US 6,711,650 B1.
[10]
J. Bruck, C.-T. Ho, S. Kipnis, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In Symposium on Parallel Algorithms and Architectures (SPAA), pages 298--309, 1994.
[11]
C. Cler and C. Costantini. IBM Power 595 technical overview and introduction. Technical Report REDP-4440-00, IBM Corp., Aug. 2008.
[12]
E. A. Leon, K. B. Ferreira, and A. B. Maccabe. Reducing the impact of the memory wall for I/O using cache injection. In Symposium on High-Performance Interconnects (Hot Interconnects), Palo Alto, CA, Aug. 2007.
[13]
E. A. Leon, R. Riesen, A. B. Maccabe, and P. G. Bridges. Instruction-level simulation of a cluster at scale. In International Conference on High-Performance Computing, Networking, Storage and Analysis (SC), Portland, OR, Nov. 2009.
[14]
K. B. Ferreira, R. Brightwell, and P. G. Bridges. Characterizing application sensitivity to OS interference using kernel-level noise injection. In 2008 ACM/IEEE Conference on Supercomputing (SC), November 2008.
[15]
R. L. Graham, T. S. Woodall, and J. M. Squyres. Open MPI: A flexible high performance MPI. In 6th Annual International Conference on Parallel Processing and Applied Mathematics, Poznan, Poland, September 2005.
[16]
V. E. Henson and U. M. Yang. Boomeramg: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41:155--177, 2000.
[17]
R. Huggahalli, R. Iyer, and S. Tetrick. Direct cache access for high bandwidth network I/O. In 32nd Annual International Symposium on Computer Architecture (ISCA'05), pages 50--59, Madison, WI, June 2005.
[18]
Intel Corp. Xeon E5502, Xeon X5667, and Xeon X7560. http://ark.intel.com/, Mar. 2010.
[19]
F. Khunjush and N. J. Dimopoulos. Comparing direct-to-cache transfer policies to TCP/IP and M-VIA during receive operations in mpi environments. In 5th International Symposium on Parallel and Distributed Processing and Applications (ISPA'07), Niagara Falls, Canada, Aug. 2007.
[20]
A. Kumar and R. Huggahalli. Impact of cache coherence protocols on the processing of network traffic. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07), pages 161--171, Chicago, IL, Dec. 2007. IEEE Computer Society.
[21]
A. Kumar, R. Huggahalli, and S. Makineni. Characterization of direct cache access on multi-core systems and 10GbE. In 15th International Symposium on High-Performance Computer Architecture (HPCA'09), Raleigh, NC, Feb. 2009.
[22]
Lawrence Livermore National Laboratory. ASC Sequoia benchmark codes. https://asc.llnl.gov/sequoia/benchmarks/, Apr. 2008.
[23]
P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas, J. Kepner, J. McCalpin, D. Bailey, and D. Takahashi. Introduction to the HPC challenge benchmark suite, Mar. 2005.
[24]
S. A. McKee, S. A. Moyer, and W. A. Wulf. Increasing memory bandwidth for vector computations. In International Conference on Programming Languages and System Architectures, pages 87--104, Zurich, Switzerland, Mar. 1994.
[25]
L. McVoy and C. Staelin. lmbench: Portable tools for performance analysis. In USENIX Annual Technical Conference, pages 279--294, Jan. 1996.
[26]
T. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87--106, 1991.
[27]
S. A. Moyer. Access Ordering and Effective Memory Bandwidth. PhD thesis, Department of Computer Science, University of Virginia, Apr. 1993.
[28]
R. Murphy. On the effects of memory latency and bandwidth on supercomputer application performance. In IEEE International Symposium on Workload Characterization (IISWC'07), Boston, MA, Sept. 2007.
[29]
R. C. Murphy and P. M. Kogge. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Transactions on Computers, 56(7):937--945, July 2007.
[30]
J. Ousterhout. Why aren't operating systems getting faster as fast as hardware. In USENIX Annual Technical Conference, pages 247--256, 1990.
[31]
G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. TCP onloading for data center servers. Computer, 37(11):48--58, Nov. 2004.
[32]
R. Riesen. A hybrid MPI simulator. In IEEE International Conference on Cluster Computing (Cluster'06), Barcelona, Spain, Sept. 2006.
[33]
B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM Journal of Research and Development, 49(4/5), 2005.
[34]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19:49--66, 2005.
[35]
J. S. Vetter and M. O. McCracken. Statistical scalability analysis of communication operations in distributed applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'01), Snowbird, UT, July 2001.
[36]
W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News, 3(1):20--24, Mar. 1995.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing
June 2011
296 pages
ISBN:9781450305525
DOI:10.1145/1996130
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache injection
  2. memory wall

Qualifiers

  • Research-article

Conference

HPDC '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00027(182-191)Online publication date: May-2021
  • (2015)Adapting Memory Hierarchies for Emerging Datacenter InterconnectsJournal of Computer Science and Technology10.1007/s11390-015-1507-430:1(97-109)Online publication date: 21-Jan-2015
  • (2013)Characterizing the impact of end-system affinities on the end-to-end performance of high-speed flowsProceedings of the Third International Workshop on Network-Aware Data Management10.1145/2534695.2534697(1-10)Online publication date: 17-Nov-2013
  • (2012)Affinity-aware DMA buffer management for reducing off-chip memory accessProceedings of the 27th Annual ACM Symposium on Applied Computing10.1145/2245276.2232031(1588-1593)Online publication date: 26-Mar-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media