research-article

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures

Authors:

Lluís Vilanova,

Marc Gonzàlez,

Xavier Martorell,

Eduard Ayguadé,

Mateo ValeroAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 720 - 732

https://doi.org/10.1145/2749469.2750411

Published: 13 June 2015 Publication History

Abstract

The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards.

This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.

References

[1]

Intel 64 and IA-32 Architectures Software Developer's Manual, 2011.

[2]

NVIDIA CUDA C Programming Guide. Version 4.2, 2012.

[3]

S. V. Adve and H.-J. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware," Communications of the ACM, vol. 53, no. 8, pp. 90--101, 2010.

Digital Library

[4]

M. Alisafaee, "Spatiotemporal Coherence Tracking," in MICRO 45: Proceedings of the 45th International Symposium on Microarchitecture. IEEE Computer Society, 2012, pp. 341--350.

Digital Library

[5]

L. Alvarez, L. Vilanova, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé, "Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories," in SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2012, pp. 89:1--89:11.

Digital Library

[6]

L. Alvarez, L. Vilanova, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé, "Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories," IEEE Transactions on Computers, vol. 64, no. 1, pp. 152--165, 2015.

[7]

O. Avissar, R. Barua, and D. Stewart, "An Optimal Memory Allocation Scheme for Scratch-pad-based Embedded Systems," ACM Transactions on Embedded Computing Systems, vol. 1, no. 1, pp. 6--26, 2002.

Digital Library

[8]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, "The NAS Parallel Benchmarks," in SC '91: Proceedings of the 1991 Conference on Supercomputing. IEEE Computer Society, 1991, pp. 158--165.

Digital Library

[9]

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, "Scratchpad Memory: A Design Alternative for Cache On-chip Memory in Embedded Systems," in CODES '02: Proceedings of the 10th International Symposium on Hardware/Software Codesign. ACM, 2002, pp. 73--78.

Digital Library

[10]

T. B. Berg, "Maintaining I/O Data Coherence in Embedded Multicore Systems," IEEE Micro, vol. 29, no. 3, pp. 10--19, 2009.

Digital Library

[11]

R. Bertran, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé, "Local Memory Design Space Exploration for High-Performance Computing," The Computer Journal, vol. 54, no. 5, pp. 786--799, 2010.

Digital Library

[12]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011.

Digital Library

[13]

J. F. Cantin, M. H. Lipasti, and J. E. Smith, "Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking," in ISCA '05: Proceedings of the 32nd International Symposium on Computer Architecture. ACM, 2005, pp. 246--257.

Digital Library

[14]

B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 2011, pp. 155--166.

Digital Library

[15]

H. Cook, K. Asanovic, and D. A. Patterson, "Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments," Electrical Engineering and Computer Sciences Department, University of California at Berkeley, Tech. Rep. UCB/EECS-2009-131, 2009.

[16]

B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato, "Increasing the Effectiveness of Directory Caches by Deactivating Coherence for Private Memory Blocks," in ISCA' 11: Proceedings of the 38th International Symposium on Computer Architecture. ACM, 2011, pp. 93--104.

Digital Library

[17]

B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato, "Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory Blocks," IEEE Transactions on Computers, vol. 62, no. 3, pp. 482--495, 2013.

Digital Library

[18]

A. Deutsch, "Interprocedural May-Alias Analysis for Pointers: Beyond k-limiting," in PLDI '94: Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation. ACM, 1994, pp. 230--241.

Digital Library

[19]

A. E. Eichenberger, J. K. O'Brien, K. M. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo, "Using Advanced Compiler Technology to Exploit the Performance of the Cell Broadband Engine™ Architecture," IBM Systems Journal, vol. 45, no. 1, pp. 59--84, 2006.

Digital Library

[20]

A. E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind, "Optimizing Compiler for the CELL Processor," in PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2005, pp. 161--172.

Digital Library

[21]

P. N. Glaskowsky, NVIDIA's Fermi: The First Complete GPU Computing Architecture, 2009, white paper.

[22]

M. Gonzàlez, N. Vujic, X. Martorell, E. Ayguadé, A. E. Eichenberger, T. Chen, Z. Sura, T. Zhang, K. O'Brien, and K. O'Brien, "Hybrid Access-Specific Software Cache Techniques for the Cell BE Architecture," in PACT '08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2008, pp. 292--302.

Digital Library

[23]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," in ISCA '09: Proceedings of the 36th International Symposium on Computer Architecture. ACM, 2009, pp. 184--195.

Digital Library

[24]

L. J. Hendren, J. Hummell, and A. Nicolau, "Abstractions for Recursive Pointer Data Structures: Improving the Analysis and Transformation of Imperative Programs," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation. ACM, 1992, pp. 249--260.

Digital Library

[25]

N. Jayasena, "Memory Hierarchy Design for Stream Computing," Ph.D. dissertation, Stanford University, 2005.

Digital Library

[26]

J. Kahle, "The Cell Processor Architecture," in MICRO 38: Proceedings of the 38th International Symposium on Microarchitecture. IEEE Computer Society, 2005, pp. 3--4.

Digital Library

[27]

J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel, "Cohesion: An Adaptive Hybrid Memory Model for Accelerators," IEEE Micro, vol. 31, no. 1, pp. 42--55, 2011.

Digital Library

[28]

D. Kim, J. Ahn, J. Kim, and J. Huh, "Subspace Snooping: Filtering Snoops with Operating System Support," in PACT '10: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2010, pp. 111--122.

Digital Library

[29]

M. Kistler, M. Perrone, and F. Petrini, "Cell Multiprocessor Communication Network: Built for Speed," IEEE Micro, vol. 26, no. 3, pp. 10--23, 2006.

Digital Library

[30]

W. Landi and B. G. Ryder, "A Safe Approximate Algorithm for Interprocedural Aliasing," in PLDI '92: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation. ACM, 1992, pp. 473--489.

Digital Library

[31]

J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing Memory Systems for Chip Multiprocessors," in ISCA '07: Proceedings of the 34th Annual International Symposium on Computer Architecture. ACM, 2007, pp. 358--368.

Digital Library

[32]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO 42: Proceedings of the 42nd International Symposium on Microarchitecture. IEEE Computer Society, 2009, pp. 469--480.

Digital Library

[33]

K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, "Smart Memories: A Modular Reconfigurable Architecture," in ISCA '00: Proceedings of the 27th International Symposium on Computer Architecture. ACM, 2000, pp. 161--171.

Digital Library

[34]

A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence," in ISCA '05: Proceedings of the 32nd International Symposium on Computer Architecture. ACM, 2005, pp. 234--245.

Digital Library

[35]

R. Murphy, "On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance," in IISWC '07: Proceedings of the 10th International Symposium on Workload Characterization. IEEE Computer Society, 2007, pp. 35--43.

Digital Library

[36]

R. C. Murphy and P. M. Kogge, "On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications," IEEE Transactions on Computers, vol. 56, no. 7, pp. 937--945, 2007.

Digital Library

[37]

Y. Paek, J. Hoeflinger, and D. Padua, "Efficient and Precise Array Access Analysis," ACM Transactions on Programming Languages and Systems, vol. 24, no. 1, pp. 65--109, 2002.

Digital Library

[38]

P. R. Panda, N. D. Dutt, and A. Nicolau, "Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications," in EDTC '97: Proceedings of the 1997 European Conference on Design and Test. IEEE Computer Society, 1997, p. 7.

Digital Library

[39]

A. Ros, M. E. Acacio, and J. M. García, Parallel and Distributed Computing. IN-TECH, 2010, ch. Cache Coherence Protocols for Many-Core CMPs.

[40]

K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, N. Ranganathan, D. Burger, S. W. Keckler, R. G. McDonald, and C. R. Moore, "TRIPS: A Polymorphous Architecture for Exploiting ILP, TLP, and DLP," ACM Transactions on Architecture and Code Optimization, vol. 1, no. 1, pp. 62--93, 2004.

Digital Library

[41]

S. Seo, J. Lee, and Z. Sura, "Design and Implementation of Software-Managed Caches for Multicores with Local Memory," in HPCA '09: Proceedings of the 15th International Conference on High-Performance Computer Architecture. IEEE Computer Society, 2009, pp. 55--66.

[42]

S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel, "Assigning Program and Data Objects to Scratchpad for Energy Reduction," in DATE '02: Proceedings of the conference on Design, Automation and Test in Europe. IEEE Computer Society, 2002, pp. 409--415.

Digital Library

[43]

M. Valero, M. Moreto, M. Casas, E. Ayguadé, and J. Labarta, "Runtime-Aware Architectures: A First Approach," International Journal Supercomputing Frontiers and Innovations, vol. 1, no. 1, pp. 29--44, 2014.

Digital Library

[44]

J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, "Quantifying Locality In The Memory Access Patterns of HPC Applications," in SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005, pp. 50--62.

Digital Library

[45]

R. P. Wilson and M. S. Lam, "Efficient Context-Sensitive Pointer Analysis for C Programs," in PLDI '95: Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation. ACM, 1995, pp. 1--12.

Digital Library

Cited By

Pavón JValdivieso IMarimon JFigueras RMoll FUnsal OValero MCristal A(2023)VAQUERO: A Scratchpad-based Vector Accelerator for Query Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070958(1289-1302)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070958
Marinelli TGómez Pérez JTenllado CCatthoor F(2023)COMPADJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.103022145:COnline publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1016/j.sysarc.2023.103022
Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Show More Cited By

Index Terms

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

Scratchpad-Memory Management for Multi-Threaded Applications on Many-Core Architectures
Special Issue on MEMOCODE 2017 and Regular Papers

Contemporary many-core architectures, such as Adapteva Epiphany and Sunway TaihuLight, employ per-core software-controlled Scratchpad Memory (SPM) rather than caches for better performance-per-watt and predictability. In these architectures, a core is ...
Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures
ISCA'15

The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad ...
Scratchpad Memories for Parallel Applications in Multi-core Architectures
WSCAD-SSC '11: Proceedings of the 2011 Simpósio em Sistemas Computacionais

Scratchpad memories are largely used in embedded processorsdue to their reduced energy consumption and areacompared to traditional cache memories. In multi-core architectures, these memories are an interesting solution forthe storage of shared data and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
687
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pavón JValdivieso IMarimon JFigueras RMoll FUnsal OValero MCristal A(2023)VAQUERO: A Scratchpad-based Vector Accelerator for Query Processing2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070958(1289-1302)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070958
Marinelli TGómez Pérez JTenllado CCatthoor F(2023)COMPADJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2023.103022145:COnline publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1016/j.sysarc.2023.103022
Dalmia PMahapatra RSinclair M(2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00056
Lugo TLozano SFernandez JCarretero J(2022)A Survey of Techniques for Reducing Interference in Real-Time Applications on Multicore PlatformsIEEE Access10.1109/ACCESS.2022.315189110(21853-21882)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3151891
Nagarajan VSorin DHill MWood DNagarajan VSorin DHill MWood D(2022)Consistency and Coherence for Heterogeneous SystemsA Primer on Memory Consistency and Cache Coherence10.1007/978-3-031-01764-3_10(211-251)Online publication date: 28-Mar-2022
https://doi.org/10.1007/978-3-031-01764-3_10
Pavon JValdivieso IBarredo AMarimon JMoreto MMoll FUnsal OValero MCristal A(2021)VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00081(921-934)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00081
Dimić VMoretó MCasas MValero M(2021)PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory HierarchyEuro-Par 2021: Parallel Processing10.1007/978-3-030-85665-6_37(599-615)Online publication date: 25-Aug-2021
https://doi.org/10.1007/978-3-030-85665-6_37
Nagarajan VSorin DHill MWood D(2020)A Primer on Memory Consistency and Cache Coherence, Second EditionSynthesis Lectures on Computer Architecture10.2200/S00962ED2V01Y201910CAC04915:1(1-294)Online publication date: 4-Feb-2020
https://doi.org/10.2200/S00962ED2V01Y201910CAC049
Kornaros G(2020)RSMCC: Enabling Ring-based Software Managed Cache-Coherent Embedded SoCs2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00026(131-135)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00026
Kumar MMaass SKashyap SVeselý JYan ZKim TBhattacharjee AKrishna T(2018)LATRACM SIGPLAN Notices10.1145/3296957.317319853:2(651-664)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173198
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents