skip to main content
10.1145/2749469.2750411acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

Published: 13 June 2015 Publication History

Abstract

The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards.
This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.

References

[1]
Intel 64 and IA-32 Architectures Software Developer's Manual, 2011.
[2]
NVIDIA CUDA C Programming Guide. Version 4.2, 2012.
[3]
S. V. Adve and H.-J. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware," Communications of the ACM, vol. 53, no. 8, pp. 90--101, 2010.
[4]
M. Alisafaee, "Spatiotemporal Coherence Tracking," in MICRO 45: Proceedings of the 45th International Symposium on Microarchitecture. IEEE Computer Society, 2012, pp. 341--350.
[5]
L. Alvarez, L. Vilanova, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé, "Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories," in SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2012, pp. 89:1--89:11.
[6]
L. Alvarez, L. Vilanova, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé, "Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories," IEEE Transactions on Computers, vol. 64, no. 1, pp. 152--165, 2015.
[7]
O. Avissar, R. Barua, and D. Stewart, "An Optimal Memory Allocation Scheme for Scratch-pad-based Embedded Systems," ACM Transactions on Embedded Computing Systems, vol. 1, no. 1, pp. 6--26, 2002.
[8]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, "The NAS Parallel Benchmarks," in SC '91: Proceedings of the 1991 Conference on Supercomputing. IEEE Computer Society, 1991, pp. 158--165.
[9]
R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad Memory: A Design Alternative for Cache On-chip Memory in Embedded Systems," in CODES '02: Proceedings of the 10th International Symposium on Hardware/Software Codesign. ACM, 2002, pp. 73--78.
[10]
T. B. Berg, "Maintaining I/O Data Coherence in Embedded Multicore Systems," IEEE Micro, vol. 29, no. 3, pp. 10--19, 2009.
[11]
R. Bertran, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé, "Local Memory Design Space Exploration for High-Performance Computing," The Computer Journal, vol. 54, no. 5, pp. 786--799, 2010.
[12]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.
[13]
J. F. Cantin, M. H. Lipasti, and J. E. Smith, "Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking," in ISCA '05: Proceedings of the 32nd International Symposium on Computer Architecture. ACM, 2005, pp. 246--257.
[14]
B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 2011, pp. 155--166.
[15]
H. Cook, K. Asanovic, and D. A. Patterson, "Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments," Electrical Engineering and Computer Sciences Department, University of California at Berkeley, Tech. Rep. UCB/EECS-2009-131, 2009.
[16]
B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato, "Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks," in ISCA' 11: Proceedings of the 38th International Symposium on Computer Architecture. ACM, 2011, pp. 93--104.
[17]
B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato, "Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory Blocks," IEEE Transactions on Computers, vol. 62, no. 3, pp. 482--495, 2013.
[18]
A. Deutsch, "Interprocedural May-Alias Analysis for Pointers: Beyond k-limiting," in PLDI '94: Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation. ACM, 1994, pp. 230--241.
[19]
A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo, "Using Advanced Compiler Technology to Exploit the Performance of the Cell Broadband Engine™ Architecture," IBM Systems Journal, vol. 45, no. 1, pp. 59--84, 2006.
[20]
A. E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind, "Optimizing Compiler for the CELL Processor," in PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2005, pp. 161--172.
[21]
P. N. Glaskowsky, NVIDIA's Fermi: The First Complete GPU Computing Architecture, 2009, white paper.
[22]
M. Gonzàlez, N. Vujic, X. Martorell, E. Ayguadé, A. E. Eichenberger, T. Chen, Z. Sura, T. Zhang, K. O'Brien, and K. O'Brien, "Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture," in PACT '08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2008, pp. 292--302.
[23]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," in ISCA '09: Proceedings of the 36th International Symposium on Computer Architecture. ACM, 2009, pp. 184--195.
[24]
L. J. Hendren, J. Hummell, and A. Nicolau, "Abstractions for Recursive Pointer Data Structures: Improving the Analysis and Transformation of Imperative Programs," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation. ACM, 1992, pp. 249--260.
[25]
N. Jayasena, "Memory Hierarchy Design for Stream Computing," Ph.D. dissertation, Stanford University, 2005.
[26]
J. Kahle, "The Cell Processor Architecture," in MICRO 38: Proceedings of the 38th International Symposium on Microarchitecture. IEEE Computer Society, 2005, pp. 3--4.
[27]
J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel, "Cohesion: An Adaptive Hybrid Memory Model for Accelerators," IEEE Micro, vol. 31, no. 1, pp. 42--55, 2011.
[28]
D. Kim, J. Ahn, J. Kim, and J. Huh, "Subspace Snooping: Filtering Snoops with Operating System Support," in PACT '10: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2010, pp. 111--122.
[29]
M. Kistler, M. Perrone, and F. Petrini, "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro, vol. 26, no. 3, pp. 10--23, 2006.
[30]
W. Landi and B. G. Ryder, "A Safe Approximate Algorithm for Interprocedural Aliasing," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation. ACM, 1992, pp. 473--489.
[31]
J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing Memory Systems for Chip Multiprocessors," in ISCA '07: Proceedings of the 34th Annual International Symposium on Computer Architecture. ACM, 2007, pp. 358--368.
[32]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO 42: Proceedings of the 42nd International Symposium on Microarchitecture. IEEE Computer Society, 2009, pp. 469--480.
[33]
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, "Smart Memories: A Modular Reconfigurable Architecture," in ISCA '00: Proceedings of the 27th International Symposium on Computer Architecture. ACM, 2000, pp. 161--171.
[34]
A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence," in ISCA '05: Proceedings of the 32nd International Symposium on Computer Architecture. ACM, 2005, pp. 234--245.
[35]
R. Murphy, "On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance," in IISWC '07: Proceedings of the 10th International Symposium on Workload Characterization. IEEE Computer Society, 2007, pp. 35--43.
[36]
R. C. Murphy and P. M. Kogge, "On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications," IEEE Transactions on Computers, vol. 56, no. 7, pp. 937--945, 2007.
[37]
Y. Paek, J. Hoeflinger, and D. Padua, "Efficient and Precise Array Access Analysis," ACM Transactions on Programming Languages and Systems, vol. 24, no. 1, pp. 65--109, 2002.
[38]
P. R. Panda, N. D. Dutt, and A. Nicolau, "Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications," in EDTC '97: Proceedings of the 1997 European Conference on Design and Test. IEEE Computer Society, 1997, p. 7.
[39]
A. Ros, M. E. Acacio, and J. M. García, Parallel and Distributed Computing. IN-TECH, 2010, ch. Cache Coherence Protocols for Many-Core CMPs.
[40]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ranganathan, D. Burger, S. W. Keckler, R. G. McDonald, and C. R. Moore, "TRIPS: A Polymorphous Architecture for Exploiting ILP, TLP, and DLP," ACM Transactions on Architecture and Code Optimization, vol. 1, no. 1, pp. 62--93, 2004.
[41]
S. Seo, J. Lee, and Z. Sura, "Design and Implementation of Software-Managed Caches for Multicores with Local Memory," in HPCA '09: Proceedings of the 15th International Conference on High-Performance Computer Architecture. IEEE Computer Society, 2009, pp. 55--66.
[42]
S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel, "Assigning Program and Data Objects to Scratchpad for Energy Reduction," in DATE '02: Proceedings of the conference on Design, Automation and Test in Europe. IEEE Computer Society, 2002, pp. 409--415.
[43]
M. Valero, M. Moreto, M. Casas, E. Ayguadé, and J. Labarta, "Runtime-Aware Architectures: A First Approach," International Journal Supercomputing Frontiers and Innovations, vol. 1, no. 1, pp. 29--44, 2014.
[44]
J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, "Quantifying Locality In The Memory Access Patterns of HPC Applications," in SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005, pp. 50--62.
[45]
R. P. Wilson and M. S. Lam, "Efficient Context-Sensitive Pointer Analysis for C Programs," in PLDI '95: Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation. ACM, 1995, pp. 1--12.

Cited By

View all

Index Terms

  1. Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
      June 2015
      768 pages
      ISBN:9781450334020
      DOI:10.1145/2749469
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Conference

      ISCA '15
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)25
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media