skip to main content
10.5555/645989.674321acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
Article

The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Published: 22 September 2002 Publication History

Abstract

This work is focused on accelerating upgrade misses in cc-NUMA multiprocessors. These misses are caused by store instructions for which a read-only copy of the line is found in the L2 cache. Upgrade misses require a message sent from the missing node to the directory, a directory lookup in order to find the set of sharers, invalidation messages being sent to the sharers and responses to the invalidations being sent back. Therefore, the penalty paid by these misses is not negligible, mainly if we consider that they account for a high percentage of the total miss rate. We propose the use of prediction as a means of providing cc-NUMA multiprocessors with a more efficient support for upgrade misses by directly invalidating sharersfrom the missing node. Our proposal comprises an effective prediction scheme achieving high hit rates as well as a coherence protocol extended to support the use of prediction. Our work is motivated by two key observations: first, upgrade misses present a repetitive behavior and, second, the total number of sharers being invalidated is small (one, in some cases). Using execution-driven simulations, we show that the use of prediction can significantly accelerateupgrade misses (latency reductions of more than 40% in some cases). These important improvements translate into speed-ups on application performance up to 14%. Finally, these results can be obtained including a predictor with a total size of less than 48 KB in every node.

References

[1]
M. E. Acacio, J. González, J. M. Garda and J. Duato. "A New Scalable Directory Architecture for Large-Scale Multiprocessors". Proc. of the 7th Int'l Symposium on High Performance Computer Architecture, pp. 97-106, January 2001.
[2]
M. E. Acacio, J. Gonzalez, J. M. García and J. Duato. "A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors". Proc. of the 16th Int' I Parallel and Distributed Processing Symposium, April 2002.
[3]
E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill and D. A Wood. "Multicast Snooping: A New Coherence Method Using a Multicast Address Network". Proc. of the 26th Int'l Symposium on Computer Architecture, pp. 294-304, May 1999.
[4]
A. Charlesworth. "Extending the SMP Envelope". IEEE Micro, 18(1):39-49, Jan/Feb 1998.
[5]
D. E. Culler, J. P. Singh and A Gupta. "Parallel Computer Architecture: A Hardware/Software Approach". Morgan Kaufmann Publishers, Inc., 1999.
[6]
D. Dai and D. K. Panda. "Reducing Cache Invalidation Overheads in Wormhole Routed DSMs Using Multidestination Message Passing". Proc. of International Conference on Parallel Processing, 1:138-145, August 1996.
[7]
K. Gharachorloo, M. Sharma, S. Steely and S. V. Doren. "Architecture and Design of AlphaServer GS320". Proc. of International Conference on Architectural Support for Programming Language and Operating Systems, pp. 13- 24 November 2000.
[8]
A' Gonzalez, M. Valero, N. Topham and J. M. Parcerisa. "Eliminating Cache Conflict Misses through XOR-Based Placement Functions". Proc. of the Int'l Conference on Supercomputing, pp. 76-83, 1997.
[9]
A. Gupta and W.-D. Weber. "Cache Invalidation Patterns in Shared-Memory Multiprocessors". IEEE Transactions on Computers, 41(7):794-810, July 1992.
[10]
A. Gupta, W.-D. Weber and T. Mowry. "Reducing Memory iand Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes". Proc. Int'l Conference on Parallel Processing, pp. 312-321, August 1990.
[11]
L. Gwennap. "Alpha 21364 to Ease Memory Bottleneck . Microprocessor Report, pp. 12-15, October 1998.
[12]
M. D. Hill. "Multiprocessors Should Support Simple Memory-Consistency Models". IEEE Computer, 31(8):28- 34, August 1998.
[13]
C. J. Hughes, V. S. Pai, P. Ranganathan and S. V. Adve. "RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors". IEEE Computer, 35(2):40-4:9, February 2002.
[14]
S. Kaxiras and J. R. Goodman. "Improving CC-NUMA Performance Using Instruction-Based Prediction". Proc. of the 5th Int'l High Performance Computer Architecture, pp. 161-170, January 1999.
[15]
S. Kaxiras and C. Young. "Coherence Communication Prediction in Shared-Memory Multiprocessors". Proc. of the 6th Int'l High Performance Computer Architecture, pp. 156-167, January 2000.
[16]
A. C. Lai and B. Falsafi. "Memory Sharing Predictor: The Key to a Speculative DSM". Proc. of the 26th Int'l Symposium on Computer Architecture, pp. 162-171, May 1999.
[17]
A C. Lai and B. Falsafi. "Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction". Proc. of the 27th Int'l Symposium on Computer Architecture, pp. 139-148, May 2000.
[18]
J. Laudon and D. Lenoski. "The SGI Origin: A ccNUMA Highly Scalable Server". Proc. of the 24th Int'l Symposium on Computer Architecture, pp. 241-251, June 1997.
[19]
A. R. Lebeck and D. A Wood. "Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors". Proc. of the 22nd Int'l Symposium on Computer Architecture, pp. 48-59, June 1995.
[20]
M. P. Malumbres, J. Duato and J. Torrellas. "An Efficient Implementation of Tree-based Multicast Routing for Distributed Shared-Memory Multiprocessors". Proc. of the 8th Int'l Symposium on Parallel and Distributed Processing, pp. 186-189, 1996.
[21]
S. S. Mukherjee and M. D. Hill. "Using Prediction to Accelerate Coherence Protocols". Proc. of the 25th Int'l Symposium on Computer Architecture, pp. 179-190. July 1998.
[22]
A. K. Nanda, A.-T. Nguyen, M. M. Michae and D. J. Joseph. "High-Throughput Coherence Controllers". Proc. of the 6th Int'l High Performance Computer Architecture, pp. 145-155, January 2000.
[23]
J. Nilsson and F. Dahlgren. "Reducing Ownership Overhead for Load-Store Sequences in Cache-Coherent Multiprocessors". Proc. of the 14th Int'l Parallel and Distributed Processing Symposium, pp. 684-692, May 2000.
[24]
B. O'Krafka and A. Newton. "An Empirical Evaluation of Two Memory-Efficient Directory Methods". Proc. of the 17th Int'l Symposium on Computer Architecture, pp. 138- 147, May 1990.
[25]
V. S. Pai, P. Ranganathan, H. Abdel-Shafi and S. Adve. "The Impact of Exploiting Instruction-Level Paliallelism on Shared-Memory Multiprocessors". IEEE Transactions on Computers, 48(2):218-226, February 1999.
[26]
J. Singh, W.-D. Weber and A. Gupta. "SPLASH: Stanford Parallel Applications for Shared-Memory". Computer Architecture News, 20:5-44, March 1992.
[27]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta. "The SPLASH-2 Programs: Characterization and Methodological Considerations". Proc. of the 22nd Int'l Symposium on Computer Architecture, pp. 24-36, June 1995.
[28]
Z. Zhou, W. Shi and Z. Tang. "A Novel Multicast Scheme to Reduce Cache Invalidation Overheads in DSM Systems". Proc. of the 19th IEEE Int'l Performance, Computing and Communications Conference, 2000.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
September 2002
168 pages
ISBN:0769516203

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 22 September 2002

Check for updates

Qualifiers

  • Article

Conference

PACT02
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Upcoming Conference

PACT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media