research-article

Open access

Using in-flight chains to build a scalable cache coherence protocol

Authors:

Samantika Subramaniam,

Simon C. Steely,

Will Hasenplaugh,

Tryggve Fossum,

Joel EmerAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Issue 4

Article No.: 28, Pages 1 - 24

https://doi.org/10.1145/2541228.2541235

Published: 01 December 2013 Publication History

Abstract

As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel applications. In this article, we first quantitatively demonstrate that state-of-the-art blocking protocols significantly constrain throughput at large core counts for several parallel applications. Nonblocking protocols address this throughput concern at the expense of scalability in the interconnection network or in the required resource overheads. To address this concern, we enhance nonblocking directory protocols by migrating the point of service of responses. Our approach uses in-flight chains of cores making parallel memory requests to incorporate scalability while maintaining high-throughput. The proposed cache coherence protocol called chained cache coherence, can outperform blocking protocols by up to 20% on scientific and 12% on commercial applications. It also has low resource overheads and simple address ordering requirements making it both a high-performance and scalable protocol. Furthermore, in-flight chains provide a scalable solution to building hierarchical and nonblocking tag directories as well as optimize communication latencies.

References

[1]

Bailey, D. H. 1994. The NAS Parallel Benchmarks. www.davidhbailey.com/dhbpapers/npb-encycpc.pdf.

[2]

Bilir, E. E., Dickson, R. M., Hu, Y., Plakal, M., and Sorin, D. J. 1999. Multicast snooping: A new coherence method using a multicast address network. In Proceedings of the International Symposim on Computer Architecture (ISCA’99).

Digital Library

[3]

Binkert, N., Beckmann, B., Black, G., Reinhardt, S., Saidi, A., Arkaprava, B., et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2, 1--7.

Digital Library

[4]

Bronevetsky, G. G. 2009. CLOMP: Accurately characterizing OpenMP application overheads. Int. J. Parallel Program. 37, 250--265.

Digital Library

[5]

Chaiken, D., Fields, C., Kurihara, K., and Agarwal, A. 1990. Directory-based cache coherence in large-scale multiprocessors. Computer 23, 6, 49--58.

Digital Library

[6]

Chaudhuri, M., Hienrich, M., Holt, C., Singh, J. P., and Hennessy, J. 2003. Latency, occupancy, and bandwidth in DSM multiprocessors: A performance evaluation. IEEE Trans. Comput. 52, 7, 862--880.

Digital Library

[7]

Cheng, L., Carter, J. B., and Dai, D. 2007. An adaptive cache coherence protocol optimized for producer-consumer sharing. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’07).

Digital Library

[8]

Cox, A. and Fowler, R. 1993. Adaptive cache coherency for detecting migratory shared data. In Proceedings of the International Symposium on Computer Architecture.

Digital Library

[9]

Dill, D. L. 1996. The Murphi verification system. In Proceedings of the 8^th International Conference on Computer Aided Verification (CAV’96). 390--393.

Digital Library

[10]

Emer, J., Ahuja, P., Borch, E., Klauser, A., Luk, C., Manne, S., et al. 2002. Asim: A performance model framework. IEEE Comput. 35, 2, 68--76.

Digital Library

[11]

Ferdman, M., Lotfi-Kamran, P., Balet, K., and Falsafi, B. 2011. Cuckoo directory: A scalable directory for many-core systems. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’11).

Digital Library

[12]

Gharachorloo, K., Sharma, M., Steely, S., and Van Doren, S. 2000. Architecture and design of AlphaServer GS320. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00).

Digital Library

[13]

Graham, R. and Shipman, G. 2008. MPI support for multi-core architectures: Optimized shared memory collectives. In Proceedings of the European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface.

Digital Library

[14]

Gustavson, D. and Li, Q. 1996. The scalable coherent interface (SCI). Commun. 34, 8, 52--63.

Digital Library

[15]

Hagersten, E. and Koster, M. 1999. WildFire: A scalable path for SMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’99).

Digital Library

[16]

Jaleel, A., Mattina, M., and Jacob, B. 2006. Last level cache (LLC) performance of data mining workloads on a CMP—a case study of parallel bioinformatics workloads. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’06).

[17]

Jeffers, J. 2012. Intel® many integrated core architecture: An overview and programming models. http://www.olcf.ornl.gov/wp-content/training/electronic-structure-2012/ORNL_Elec_Struct_WS_ 02062012.pdf.

[18]

Johnson, R. E. 1993. Extending the scalable coherent interface for large-scale shared-memory multiprocessors. PhD thesis, University of Wisconsin at Madison.

Digital Library

[19]

Kaxiras, S. G. 1996. The glow cache coherence protocol extensions for widely shared data. In Proceedings of the International Conference on Supercomputing.

Digital Library

[20]

Kaxiras, S. and Georgios, K. 2010. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro, 30, 54--65.

Digital Library

[21]

Kaxiras, S. and Goodman, J. 1999. Improving CC-NUMA performance using instruction-based prediction. In Proceedings of the International Symposium on High-Performance Architecture (HPCA’99).

Digital Library

[22]

Kong, J., Yew, P.-C. Y., and Gyungho, L. 1999. A Non-blocking Directory Protocol for Large-Scale Multiprocessors. Tech. rep. TR 99-012, University of Minnesota.

[23]

Ladan-Mozes, E. and Leiserson, C. 2008. A Consistency Architecture for Hierarchical Shared Caches. In Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA’08).

Digital Library

[24]

Laudon, J. and Lenoski, D. 1997. The SGI origin: A ccNUMA highly-scalable server. In Proceedings of the International Symposium on Computer Architecture (ISCA’97).

Digital Library

[25]

Lenoski, D., Laudon, J., Gharachorloo, K., Weber, W., Gupta, A., Hennessy, J., et al. 1992. The Stanford dash multiprocessor. Computer 25, 3, 63--79.

Digital Library

[26]

Maa, Y.-C. P. 1991. Two economical directory schemes for large-scale cache-coherent multiprocessors. Comput. Archit. News 19, 5, 10--18.

Digital Library

[27]

Martin, M. M., Harper, P. J., Sorin, D. J., Hill, M. D., and Wood, D. A. 2003a. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA’03).

Digital Library

[28]

Martin, M. M., Hill, M. D., and Sorin, D. J. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 78--89.

Digital Library

[29]

Martin, M. M., Hill, M. D., and Wood, D. A. 2003b. Token coherence: Decoupling performance and correctness. In Proceedings of the International Symposium on Computer Architecture (ISCA’03).

Digital Library

[30]

Martin, M. M., Sorin, D. J., Hill, M. D., and Wood, D. A. 2002. Bandwidth adaptive snooping. In Proceedings of the International Symposium on High-Performance Computer Architecture (ISCA’02).

Digital Library

[31]

Marty, M. R. 2008. Cache coherence techniques for multi-core processors. PhD thesis, University of Wisconsin.

Digital Library

[32]

Marty, M. and Hill, M. 2008. Virtual hierarchies. In Proceedings of the International Symposium on Computer Architecture (ISCA’08).

[33]

Marty, M., Bingham, J., Hill, M., Hu, A., Martin, M., and Wood, D. 2005. Improving multiple-CMP systems using token coherence. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’05).

Digital Library

[34]

Morgan, T. P. 2012. Intel teaches Xeon Phi x86 coprossor snappy new tricks: The interconnect rings a bell. http://www.theregister.co.uk/2012/09/05/intel_xeon_phi_coprocessor.

[35]

Mukherjee, S. S. and Hill, M. D. 1998. Using prediction to accelerate coherence protocols. In Proceedings of the International Symposium on Computer Architecture (ISCA’98).

Digital Library

[36]

Nilsson, H. 1992. The scalable tree protocol—a cache coherence approach for large-scale multiprocessors. In Proceedings of the International Symposium on Parallel and Distributed Computing.

Digital Library

[37]

Raghavan, A., Blundell, C., and Martin, M. M. 2008. Token tenure: Patching token counting using directory coherence. In Proceedings of the International Symposium on Microarchitecture (MICRO’08).

Digital Library

[38]

Rajamony, R., Shafi, H., Williams, D., and Wright, K. 2005. Chained cache coherency states for sequential non-homogeneous access to a cache line. US patent US20070083716 Al.

[39]

Sanchez, D. and Kozyrakis, C. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’12).

Digital Library

[40]

Singh, J., Wolf-Dietrich, W., and Gupta, A. 1992. SPLASH: Stanford parallel applications for shared memory. Tech. rep., Stanford University, Stanford, CA.

Digital Library

[41]

Wallach, D. 1992. PHD: A hierarchical cache coherent protocol. Tech. rep., Massachusetts Institute of Technology, Cambridge, MA.

Digital Library

[42]

Woo, S., Ohara, M., Torrie, E., Singh, J., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture (ISCA’95).

Digital Library

[43]

Yang, Q., Thangadurai, G., and Bhuyan, L. 1992. Design of an adaptive cache coherence protocol for large scale multiprocessors. IEEE Trans. Parallel Distrib. Syst. 3, 3, 281--293.

Digital Library

[44]

Yongqin, H., Aidong, Y., Jun, L., and Xiangdong, H. 2009. A novel directory-based non-busy, non-blocking cache coherence. In Proceedings of the International Forum on Computer Science-Technology and Applications (IFCSTA’09). 374--379.

Digital Library

[45]

Zebchuk, J., Srinivasan, V., Qureshi, M., and Moshovos, A. 2009. A tagless coherence directory. In Proceedings of the International Symposium on Microarchitecture (MICRO). International Symposium on Microarchitecture.

Digital Library

[46]

Zhang, M. L. 2010. Fractal coherence: Scalably verifiable cache coherence. In Proceedings of the International Symposium on Microarchitecture.

Digital Library

Cited By

Sung HAdve S(2015)DeNovoSyncACM SIGARCH Computer Architecture News10.1145/2786763.269435643:1(545-559)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2786763.2694356
Sung HAdve S(2015)DeNovoSyncACM SIGPLAN Notices10.1145/2775054.269435650:4(545-559)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2775054.2694356
Sung HAdve SOzturk OEbcioglu KDwarkadas S(2015)DeNovoSyncProceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2694344.2694356(545-559)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2694344.2694356

Index Terms

Using in-flight chains to build a scalable cache coherence protocol
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

The locality-aware adaptive cache coherence protocol
ICSA '13

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 10, Issue 4

December 2013

1046 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2541228

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2013

Revised: 01 September 2013

Accepted: 01 August 2013

Received: 01 June 2013

Published in TACO Volume 10, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
676
Total Downloads

Downloads (Last 12 months)134
Downloads (Last 6 weeks)15

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sung HAdve S(2015)DeNovoSyncACM SIGARCH Computer Architecture News10.1145/2786763.269435643:1(545-559)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2786763.2694356
Sung HAdve S(2015)DeNovoSyncACM SIGPLAN Notices10.1145/2775054.269435650:4(545-559)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2775054.2694356
Sung HAdve SOzturk OEbcioglu KDwarkadas S(2015)DeNovoSyncProceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2694344.2694356(545-559)Online publication date: 14-Mar-2015
https://dl.acm.org/doi/10.1145/2694344.2694356

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents