Article

Free access

The interaction of software prefetching with ILP processors in shared-memory systems

Authors:

Parthasarathy Ranganathan,

Hazim Abdel-Shafi,

Sarita V. AdveAuthors Info & Claims

ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture

Pages 144 - 156

https://doi.org/10.1145/264107.264158

Published: 01 May 1997 Publication History

Abstract

Current microprocessors aggressively exploit instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work has shown that memory latency remains a significant performance bottleneck for shared-memory multiprocessor systems built of such processors.This paper provides the first study of the effectiveness of software-controlled non-binding prefetching in shared memory multiprocessors built of state-of-the-art ILP-based processors. We find that software prefetching results in significant reductions in execution time (12% to 31%) for three out of five applications on an ILP system. However, compared to previous-generation system, software prefetching is significantly less effective in reducing the memory stall component of execution time on an ILP system. Consequently, even after adding software prefetching, memory stall time accounts for over 30% of the total execution time in four out of five applications on our ILP system.This paper also investigates the interaction of software prefetching with memory consistency models on ILP-based multiprocessors. In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC). We find that even with software prefetching, for three out of five applications, RC provides a significant reduction in execution time (15% to 40%) compared to SC.

References

[1]

H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture, 1997.

Digital Library

[2]

J. E. Bennett and M. J. Flynn. Latency Tolerance for Dynamic Processors. Stanford University, CSL-TR-96-687, 1996.

Digital Library

[3]

J. E. Bennett and M. J. Flynn. Reducing Cache Miss Rates Using Prediction Caches. Stanford University, CSGTR-96- 707,1996.

Digital Library

[4]

D. Callahan, K. Kennedy, and A. Porterfleld. Software Prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.

Digital Library

[5]

T.-F. Chen and J.-L. Baer. Reducing Memory Latency via Non-Blocking and Prefetching Caches. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.

Digital Library

[6]

W. Y. Chen et al. Data Access Microarchitecturesfor Superscalar Processors with Compiler-Assisted Data Prefetching. In Proceedings of the 24th Annual International Symposisum on Microarchitecture, 1991.

Digital Library

[7]

K. Farkas, N. Jouppi, and P. Chow. How Useful are Non- Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors? In Proceedings of the 1st International Conference on High-Performance Computer Architecture, 1995.

Digital Library

[8]

K. Fletcher. Compiler-hardware cooperation in prefetching for shared-memory multiprocessors. Ph.D. Thesis Proposal, Rice University, September 1995.

[9]

K. Gharachorloo, A. Gupta, and J. Hennessy. Performance Evaluation of Memory Consistency Models for Shared- Memory Multiprocessors. In Proceedings of the 4th Inernational Conference on Architectural Support for Programming Languages and Operating Systems, 1991.

Digital Library

[10]

K. Gharachorloo, A. Gupta, and J. Hennessy. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proceedings of the International Conference on Parallel Processing, 1991.

[11]

K. Gharachorloo, A. Gupta, and J. Hennessy. Hiding Memory Latency Using Dynamic Scheduling in Shared-Memory Multiprocessors. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992.

Digital Library

[12]

K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, 1990.

Digital Library

[13]

E. H. Gomish. Adaptive and Integrated Data Cache Prejetching for Shared-Memory Multiprocessors. PhD thesis, University of Illinois at Urbana-Champaign, 1995.

[14]

A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.- D. Weber. ComparativeEvaluationofLatency Reducing and Tolerating Techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991.

Digital Library

[15]

M. D. Hill, J. R. Larus, S. K. Reinhardt, and D. A. Wood. Cooperative Shared Memory: Software and Hardware Support for Scalable Multiprocessors. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.

Digital Library

[16]

A. C. Klaiber and H. M. Levy. An Architecture for Software- Controlled Data Prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture, 1991.

Digital Library

[17]

D. Kroft. Lockup-FreeInstructionFetch/Prefetch Cache Organization. In Proceedings of the 8th International Symposium on Computer Architecture, 1981.

Digital Library

[18]

L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Trans. on Computers, C-28(9):690-691, 1979.

Digital Library

[19]

C.-K. Luk and T. C. Mowry. Compiler-Based Prefetchlng for Recursive Data Structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996.

Digital Library

[20]

N. McIntosh. Private communication. Rice University, February 1997.

[21]

N. McIntosh, K. Fletcher, K. Cooper, and K. Kennedy. Compiler Techniques for Software Prefetchingon Cache-Cohorent Shared-Memory Multiprocessors. Center for Research on Parallel Computation, Rice University, CRPC-TR9667G-S, 1997.

[22]

MIPS Technologies, Inc. RIO000 Microprocessor User's Manual, Version 1.1, 1996.

[23]

T. Mowry. Tolerating Latency through Software-Controlled Data Prejetching. PhD thesis, Stanford University, 1994.

Digital Library

[24]

T. Mowry and A. Gupta. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106,199l.

Digital Library

[25]

T. C. Mowry, M. S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In Proceedings ojthe 5th Infernational Conference on Architectural Support for Programming Languages and Operating Systems, 1992.

Digital Library

[26]

V. S. Pai, P. Ranganathan, and S. V. Adve. RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessorsand Uniprocessors. In Proceedings of the 3rd Workshop on Computer Architectum Education, 1997.

[27]

V. S. Pai, P. Ranganathan, and S. V. Adve. The Impact of Instruction Level Parallelism on Multiprocessor Performance and Simulation Methodology. In Proceedinga of the 3rd International Symposium on High Performance Computer Architecture, 1997.

Digital Library

[28]

V. S. Pal, P. Rangenathan, S. V. Adve, and T. Harton. An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP Processors. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996.

Digital Library

[29]

D. Poulsen. Memory Latency Reduction via Data Prefetching and Data Forwarding in Shared-Memory Multiprocessors. PhD thesis, University of Illinois at Urbana- Champaign, 1994.

Digital Library

[30]

M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchol, and A. Gupta. The Impact of Architectural Trends on Operating System Performance. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, 1995.

Digital Library

[31]

J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory. Computer Architecture News, 20(1):5-44, 1992.

Digital Library

[32]

SparcInternational. The SPARC Architecture Manual, version 9, 1993.

[33]

D. Tullsen and S. Eggers. Effective Cache Prefetching on Bus-Based Multiprocessors. ACM l?ansactions on Computer Systems, 13(1):57-88,1995.

Digital Library

[34]

S. C. Woo, M. Ohara, E. Torrle, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterizntion and Mothodological Considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995.

Digital Library

Cited By

Suleman MMutlu OJoao JKhubaib Patt Y(2010)Data marshaling for multi-core architecturesACM SIGARCH Computer Architecture News10.1145/1816038.181602038:3(441-450)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1145/1816038.1816020
Suleman MMutlu OJoao JKhubaib Patt YSeznec AWeiser URonen R(2010)Data marshaling for multi-core architecturesProceedings of the 37th annual international symposium on Computer architecture10.1145/1815961.1816020(441-450)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1145/1815961.1816020
Suleman MMutlu OQureshi MPatt Y(2009)Accelerating critical section execution with asymmetric multi-core architecturesACM SIGARCH Computer Architecture News10.1145/2528521.150827437:1(253-264)Online publication date: 7-Mar-2009
https://dl.acm.org/doi/10.1145/2528521.1508274
Show More Cited By

Index Terms

Recommendations

The interaction of software prefetching with ILP processors in shared-memory systems
Special Issue: Proceedings of the 24th annual international symposium on Computer architecture (ISCA '97)

Current microprocessors aggressively exploit instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work has shown that memory latency remains a significant performance ...
An evaluation of memory consistency models for shared-memory systems with ILP processors

Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP)...
An evaluation of memory consistency models for shared-memory systems with ILP processors

Relaxed consistency models have been shown to significantly outperform sequential consistency for single-issue, statically scheduled processors with blocking reads. However, current microprocessors aggressively exploit instruction-level parallelism (ILP)...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '97: Proceedings of the 24th annual international symposium on Computer architecture

June 1997

350 pages

ISBN:0897919017

DOI:10.1145/264107

Chairmen:
Andrew R. Pleszkun
Univ. of Colorado-Boulder, CO
,
Trevor Mudge
Univ. of Michigan

ACM SIGARCH Computer Architecture News Volume 25, Issue 2
Special Issue: Proceedings of the 24th annual international symposium on Computer architecture (ISCA '97)
May 1997
349 pages
ISSN:0163-5964
DOI:10.1145/384286
Editors:
Andrew R. Pleszkun
Univ. of Colorado-Boulder, CO
,
Trevor Mudge
Univ. of Michigan
Issue’s Table of Contents

Copyright © 1997 Authors.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 1997

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

ISCA97

Sponsor:

SIGARCH

ISCA97: International Conference on Computer Architecture

June 1 - 4, 1997

Colorado, Denver, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
649
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)28

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Suleman MMutlu OJoao JKhubaib Patt Y(2010)Data marshaling for multi-core architecturesACM SIGARCH Computer Architecture News10.1145/1816038.181602038:3(441-450)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1145/1816038.1816020
Suleman MMutlu OJoao JKhubaib Patt YSeznec AWeiser URonen R(2010)Data marshaling for multi-core architecturesProceedings of the 37th annual international symposium on Computer architecture10.1145/1815961.1816020(441-450)Online publication date: 19-Jun-2010
https://dl.acm.org/doi/10.1145/1815961.1816020
Suleman MMutlu OQureshi MPatt Y(2009)Accelerating critical section execution with asymmetric multi-core architecturesACM SIGARCH Computer Architecture News10.1145/2528521.150827437:1(253-264)Online publication date: 7-Mar-2009
https://dl.acm.org/doi/10.1145/2528521.1508274
Suleman MMutlu OQureshi MPatt Y(2009)Accelerating critical section execution with asymmetric multi-core architecturesACM SIGPLAN Notices10.1145/1508284.150827444:3(253-264)Online publication date: 7-Mar-2009
https://dl.acm.org/doi/10.1145/1508284.1508274
Suleman MMutlu OQureshi MPatt YSoffa MIrwin M(2009)Accelerating critical section execution with asymmetric multi-core architecturesProceedings of the 14th international conference on Architectural support for programming languages and operating systems10.1145/1508244.1508274(253-264)Online publication date: 7-Mar-2009
https://dl.acm.org/doi/10.1145/1508244.1508274
Wenisch TAilamaki AFalsafi BMoshovos A(2007)Mechanisms for store-wait-free multiprocessorsACM SIGARCH Computer Architecture News10.1145/1273440.125069635:2(266-277)Online publication date: 9-Jun-2007
https://dl.acm.org/doi/10.1145/1273440.1250696
Wenisch TAilamaki AFalsafi BMoshovos ATullsen DCalder B(2007)Mechanisms for store-wait-free multiprocessorsProceedings of the 34th annual international symposium on Computer architecture10.1145/1250662.1250696(266-277)Online publication date: 9-Jun-2007
https://dl.acm.org/doi/10.1145/1250662.1250696
Chen HLi KWei B(2005)Memory Performance Optimizations For Real-Time Software HDTV DecodingJournal of VLSI Signal Processing Systems10.1007/s11265-005-6650-741:2(193-207)Online publication date: 1-Sep-2005
https://dl.acm.org/doi/10.1007/s11265-005-6650-7
Dubois MVassiliadis SGaudiot JPiuri V(2004)Fighting the memory wall with assisted executionProceedings of the 1st conference on Computing frontiers10.1145/977091.977116(168-180)Online publication date: 14-Apr-2004
https://dl.acm.org/doi/10.1145/977091.977116
Iyer RFeautrier PGoodman JSeznec A(2004)CQoSProceedings of the 18th annual international conference on Supercomputing10.1145/1006209.1006246(257-266)Online publication date: 26-Jun-2004
https://dl.acm.org/doi/10.1145/1006209.1006246
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten