skip to main content
research-article
Open access

Using in-flight chains to build a scalable cache coherence protocol

Published: 01 December 2013 Publication History

Abstract

As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel applications. In this article, we first quantitatively demonstrate that state-of-the-art blocking protocols significantly constrain throughput at large core counts for several parallel applications. Nonblocking protocols address this throughput concern at the expense of scalability in the interconnection network or in the required resource overheads. To address this concern, we enhance nonblocking directory protocols by migrating the point of service of responses. Our approach uses in-flight chains of cores making parallel memory requests to incorporate scalability while maintaining high-throughput. The proposed cache coherence protocol called chained cache coherence, can outperform blocking protocols by up to 20% on scientific and 12% on commercial applications. It also has low resource overheads and simple address ordering requirements making it both a high-performance and scalable protocol. Furthermore, in-flight chains provide a scalable solution to building hierarchical and nonblocking tag directories as well as optimize communication latencies.

References

[1]
Bailey, D. H. 1994. The NAS Parallel Benchmarks. www.davidhbailey.com/dhbpapers/npb-encycpc.pdf.
[2]
Bilir, E. E., Dickson, R. M., Hu, Y., Plakal, M., and Sorin, D. J. 1999. Multicast snooping: A new coherence method using a multicast address network. In Proceedings of the International Symposim on Computer Architecture (ISCA’99).
[3]
Binkert, N., Beckmann, B., Black, G., Reinhardt, S., Saidi, A., Arkaprava, B., et al. 2011. The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 2, 1--7.
[4]
Bronevetsky, G. G. 2009. CLOMP: Accurately characterizing OpenMP application overheads. Int. J. Parallel Program. 37, 250--265.
[5]
Chaiken, D., Fields, C., Kurihara, K., and Agarwal, A. 1990. Directory-based cache coherence in large-scale multiprocessors. Computer 23, 6, 49--58.
[6]
Chaudhuri, M., Hienrich, M., Holt, C., Singh, J. P., and Hennessy, J. 2003. Latency, occupancy, and bandwidth in DSM multiprocessors: A performance evaluation. IEEE Trans. Comput. 52, 7, 862--880.
[7]
Cheng, L., Carter, J. B., and Dai, D. 2007. An adaptive cache coherence protocol optimized for producer-consumer sharing. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’07).
[8]
Cox, A. and Fowler, R. 1993. Adaptive cache coherency for detecting migratory shared data. In Proceedings of the International Symposium on Computer Architecture.
[9]
Dill, D. L. 1996. The Murphi verification system. In Proceedings of the 8th International Conference on Computer Aided Verification (CAV’96). 390--393.
[10]
Emer, J., Ahuja, P., Borch, E., Klauser, A., Luk, C., Manne, S., et al. 2002. Asim: A performance model framework. IEEE Comput. 35, 2, 68--76.
[11]
Ferdman, M., Lotfi-Kamran, P., Balet, K., and Falsafi, B. 2011. Cuckoo directory: A scalable directory for many-core systems. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’11).
[12]
Gharachorloo, K., Sharma, M., Steely, S., and Van Doren, S. 2000. Architecture and design of AlphaServer GS320. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00).
[13]
Graham, R. and Shipman, G. 2008. MPI support for multi-core architectures: Optimized shared memory collectives. In Proceedings of the European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface.
[14]
Gustavson, D. and Li, Q. 1996. The scalable coherent interface (SCI). Commun. 34, 8, 52--63.
[15]
Hagersten, E. and Koster, M. 1999. WildFire: A scalable path for SMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’99).
[16]
Jaleel, A., Mattina, M., and Jacob, B. 2006. Last level cache (LLC) performance of data mining workloads on a CMP—a case study of parallel bioinformatics workloads. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA’06).
[17]
Jeffers, J. 2012. Intel® many integrated core architecture: An overview and programming models. http://www.olcf.ornl.gov/wp-content/training/electronic-structure-2012/ORNL_Elec_Struct_WS_ 02062012.pdf.
[18]
Johnson, R. E. 1993. Extending the scalable coherent interface for large-scale shared-memory multiprocessors. PhD thesis, University of Wisconsin at Madison.
[19]
Kaxiras, S. G. 1996. The glow cache coherence protocol extensions for widely shared data. In Proceedings of the International Conference on Supercomputing.
[20]
Kaxiras, S. and Georgios, K. 2010. SARC coherence: Scaling directory cache coherence in performance and power. IEEE Micro, 30, 54--65.
[21]
Kaxiras, S. and Goodman, J. 1999. Improving CC-NUMA performance using instruction-based prediction. In Proceedings of the International Symposium on High-Performance Architecture (HPCA’99).
[22]
Kong, J., Yew, P.-C. Y., and Gyungho, L. 1999. A Non-blocking Directory Protocol for Large-Scale Multiprocessors. Tech. rep. TR 99-012, University of Minnesota.
[23]
Ladan-Mozes, E. and Leiserson, C. 2008. A Consistency Architecture for Hierarchical Shared Caches. In Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA’08).
[24]
Laudon, J. and Lenoski, D. 1997. The SGI origin: A ccNUMA highly-scalable server. In Proceedings of the International Symposium on Computer Architecture (ISCA’97).
[25]
Lenoski, D., Laudon, J., Gharachorloo, K., Weber, W., Gupta, A., Hennessy, J., et al. 1992. The Stanford dash multiprocessor. Computer 25, 3, 63--79.
[26]
Maa, Y.-C. P. 1991. Two economical directory schemes for large-scale cache-coherent multiprocessors. Comput. Archit. News 19, 5, 10--18.
[27]
Martin, M. M., Harper, P. J., Sorin, D. J., Hill, M. D., and Wood, D. A. 2003a. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA’03).
[28]
Martin, M. M., Hill, M. D., and Sorin, D. J. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 78--89.
[29]
Martin, M. M., Hill, M. D., and Wood, D. A. 2003b. Token coherence: Decoupling performance and correctness. In Proceedings of the International Symposium on Computer Architecture (ISCA’03).
[30]
Martin, M. M., Sorin, D. J., Hill, M. D., and Wood, D. A. 2002. Bandwidth adaptive snooping. In Proceedings of the International Symposium on High-Performance Computer Architecture (ISCA’02).
[31]
Marty, M. R. 2008. Cache coherence techniques for multi-core processors. PhD thesis, University of Wisconsin.
[32]
Marty, M. and Hill, M. 2008. Virtual hierarchies. In Proceedings of the International Symposium on Computer Architecture (ISCA’08).
[33]
Marty, M., Bingham, J., Hill, M., Hu, A., Martin, M., and Wood, D. 2005. Improving multiple-CMP systems using token coherence. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’05).
[34]
Morgan, T. P. 2012. Intel teaches Xeon Phi x86 coprossor snappy new tricks: The interconnect rings a bell. http://www.theregister.co.uk/2012/09/05/intel_xeon_phi_coprocessor.
[35]
Mukherjee, S. S. and Hill, M. D. 1998. Using prediction to accelerate coherence protocols. In Proceedings of the International Symposium on Computer Architecture (ISCA’98).
[36]
Nilsson, H. 1992. The scalable tree protocol—a cache coherence approach for large-scale multiprocessors. In Proceedings of the International Symposium on Parallel and Distributed Computing.
[37]
Raghavan, A., Blundell, C., and Martin, M. M. 2008. Token tenure: Patching token counting using directory coherence. In Proceedings of the International Symposium on Microarchitecture (MICRO’08).
[38]
Rajamony, R., Shafi, H., Williams, D., and Wright, K. 2005. Chained cache coherency states for sequential non-homogeneous access to a cache line. US patent US20070083716 Al.
[39]
Sanchez, D. and Kozyrakis, C. 2012. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’12).
[40]
Singh, J., Wolf-Dietrich, W., and Gupta, A. 1992. SPLASH: Stanford parallel applications for shared memory. Tech. rep., Stanford University, Stanford, CA.
[41]
Wallach, D. 1992. PHD: A hierarchical cache coherent protocol. Tech. rep., Massachusetts Institute of Technology, Cambridge, MA.
[42]
Woo, S., Ohara, M., Torrie, E., Singh, J., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture (ISCA’95).
[43]
Yang, Q., Thangadurai, G., and Bhuyan, L. 1992. Design of an adaptive cache coherence protocol for large scale multiprocessors. IEEE Trans. Parallel Distrib. Syst. 3, 3, 281--293.
[44]
Yongqin, H., Aidong, Y., Jun, L., and Xiangdong, H. 2009. A novel directory-based non-busy, non-blocking cache coherence. In Proceedings of the International Forum on Computer Science-Technology and Applications (IFCSTA’09). 374--379.
[45]
Zebchuk, J., Srinivasan, V., Qureshi, M., and Moshovos, A. 2009. A tagless coherence directory. In Proceedings of the International Symposium on Microarchitecture (MICRO). International Symposium on Microarchitecture.
[46]
Zhang, M. L. 2010. Fractal coherence: Scalably verifiable cache coherence. In Proceedings of the International Symposium on Microarchitecture.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 4
December 2013
1046 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2541228
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2013
Revised: 01 September 2013
Accepted: 01 August 2013
Received: 01 June 2013
Published in TACO Volume 10, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cache coherence
  2. nonblocking
  3. synchronization
  4. tag directories

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)15
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media