research-article

Exploring multi-banked shared-L1 program cache on ultra-low power, tightly coupled processor clusters

Authors:

Germain Haugou,

Michael Gautschi,

Luca BeniniAuthors Info & Claims

CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers

Article No.: 64, Pages 1 - 8

https://doi.org/10.1145/2742854.2747288

Published: 06 May 2015 Publication History

Abstract

L1 instruction caches in many-core systems represent a sizable fraction of the total power consumption. Although large instruction caches can significantly improve performance, they have the potential to increase power consumption. Private caches are usually able to achieve higher speed, due to their simpler design, but the smaller L1 memory space seen by each core induces a high miss ratio. Shared instruction cache can be seen as an attractive solution to improve performance and energy efficiency while reducing area. In this paper we propose a multi-banked, shared instruction cache architecture suitable for ultra-low power multicore systems, where parallelism and near threshold operation is used to achieve minimum energy. We implemented the cluster architecture with different configurations of cache sharing, utilizing the 28nm UTBB FD-SOI from STMicroelectronics as reference technology. Experimental results, based on several real-life applications, demonstrate that sharing mechanisms have no impact on the system operating frequency, and allow to reduce the energy consumption of the cache subsystem by up to 10%, while keeping the same area footprint, or reducing by 2× the overall shared cache area, while keeping the same performance and energy efficiency with respect to a cluster of processing elements with private program caches.

References

[1]

N. E. Bellas, I. N. Hajj, and C. D. Polychronopoulos. Using dynamic cache management techniques to reduce energy in general purpose processors. IEEE Transactions on Very Large Scale Integrated Systems, 8(6): 693--708, Dec 2000.

Digital Library

[2]

L. Benini, E. Flamand, D. Fuin, and D. Melpignano. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 983--987, March 2012.

Digital Library

[3]

D. Bortolotti, F. Paterna, C. Pinto, A. Marongiu, M. Ruggiero, and L. Benini. Exploring instruction caching strategies for tightly-coupled shared-memory clusters. In International Symposium on System on Chip (SoC), pages 34--41, Oct 2011.

[4]

M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. Bulldozer: An approach to multithreaded compute performance. IEEE Micro, 31(2): 6--15, Apr 2011.

Digital Library

[5]

A. Dogan, R. Braojos, J. Constantin, G. Ansaloni, A. Burg, and D. Atienza. Synchronizing code execution on ultra-low-power embedded multi-channel signal analysis platforms. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 396--399, March 2013.

Digital Library

[6]

A. Y. Dogan, J. Constantiny, M. Ruggiero, A. Burg, and D. Atienza. Multi-core architecture design for ultra-low-power wearable health monitoring systems. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 988, 993, March 2012.

Digital Library

[7]

J. Eyre and J. Bier. Dsp processors hit the mainstream. IEEE Computer, 31(8): 51--59, Aug 1998.

Digital Library

[8]

C. H. Kim, S. Shim, J. W. Kwak, S. W. Chung, and C. S. Jhon. First-level instruction cache design for reducing dynamic energy consumption. In 5th International Workshop, SAMOS, pages 103--111, July 2005.

Digital Library

[9]

J. Kin, M. Gupta, and W. H. Mangione-Smith. Filtering memory references to increase energy efficiency. IEEE Transactions on Computers, 49(1): 1--15, Jan 2000.

Digital Library

[10]

Krhonos. The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl.

[11]

L. Lee, B. Moyer, and J. Arends. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In International Symposium on Low Power Electronics and Design, pages 267--269, Aug 1999.

Digital Library

[12]

A. Marongiu, P. Burgio, and L. Benini. Fast and lightweight support for nested parallelism on cluster-based embedded many-cores. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 105--110, March 2012.

Digital Library

[13]

P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues. Benchmarking of standard-cell based memories in the sub-vt domain in 65-nm cmos technology. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1(2): 173--182, Jun 2011.

[14]

NVIDIA. Next generation cuda compute architecture: Fermi. www.nvidia.com.

[15]

OR1200. Openrisc processor. http://opencores.org/or1k.

[16]

A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1--6, March 2011.

[17]

H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 235--246, March 2010.

Cited By

Chen JLoi IFlamand ETagliavini GBenini LRossi D(2023)Scalable Hierarchical Instruction Cache for Ultralow-Power Processors ClustersIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.322833631:4(456-469)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TVLSI.2022.3228336
Jie CLoi IBenini LRossi DDi Natale GFummi F(2020)Energy-efficient two-level instruction cache design for an Ultra-Low-Power multi-core clusterProceedings of the 23rd Conference on Design, Automation and Test in Europe10.5555/3408352.3408751(1734-1739)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.5555/3408352.3408751
Jie CLoi IBenini LRossi D(2020)Energy-Efficient Two-level Instruction Cache Design for an Ultra-Low-Power Multi-core Cluster2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116212(1734-1739)Online publication date: Mar-2020
https://doi.org/10.23919/DATE48585.2020.9116212
Show More Cited By

Index Terms

Exploring multi-banked shared-L1 program cache on ultra-low power, tightly coupled processor clusters

Recommendations

A multi banked - multi ported - non blocking shared L2 cache for MPSoC platforms
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent ...
Data filter cache with word selection cache for low power embedded processor
RACS '13: Proceedings of the 2013 Research in Adaptive and Convergent Systems

Filter cache was proposed to reduce power consumption. The proposers inserted a small and fast cache, which is called Filter Cache, between core and L1 cache. Filter cache reduced the number of accesses to L1 cache and a significant power savings is ...
Managing shared last-level cache in a heterogeneous multicore processor
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as GPU cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is one of the most important ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers

May 2015

413 pages

ISBN:9781450333580

DOI:10.1145/2742854

General Chairs:
Claudia Di Napoli
Istituto di Calcolo e Reti ad Alte Prestazioni, CNR, ITALY
,
Valentina Salapura
IBM T. J. Watson Research Center
,
Program Chairs:
Hubertus Franke
IBM T.J.Watson Research Center
,
Rui Hou
Institute for Computing Technology, Chinese Academy of Sciences, PRC

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'15

Sponsor:

SIGMICRO

CF'15: Computing Frontiers Conference

May 18 - 21, 2015

Ischia, Italy

Acceptance Rates

CF '15 Paper Acceptance Rate 33 of 96 submissions, 34%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)2

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen JLoi IFlamand ETagliavini GBenini LRossi D(2023)Scalable Hierarchical Instruction Cache for Ultralow-Power Processors ClustersIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2022.322833631:4(456-469)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1109/TVLSI.2022.3228336
Jie CLoi IBenini LRossi DDi Natale GFummi F(2020)Energy-efficient two-level instruction cache design for an Ultra-Low-Power multi-core clusterProceedings of the 23rd Conference on Design, Automation and Test in Europe10.5555/3408352.3408751(1734-1739)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.5555/3408352.3408751
Jie CLoi IBenini LRossi D(2020)Energy-Efficient Two-level Instruction Cache Design for an Ultra-Low-Power Multi-core Cluster2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116212(1734-1739)Online publication date: Mar-2020
https://doi.org/10.23919/DATE48585.2020.9116212
Wittig RHasler MMatus EFettweis G(2019)Queue Based Memory Management Unit for Heterogeneous MPSoCs2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715129(1297-1300)Online publication date: Mar-2019
https://doi.org/10.23919/DATE.2019.8715129
Azarkhish ERossi DLoi IBenini L(2018)Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory CubesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.275270629:2(420-434)Online publication date: 1-Feb-2018
https://doi.org/10.1109/TPDS.2017.2752706
Schonle PGlaser FBurger TRovere GBenini LHuang Q(2018)A Multi-Sensor and Parallel Processing SoC for Miniaturized Medical InstrumentationIEEE Journal of Solid-State Circuits10.1109/JSSC.2018.281565353:7(2076-2087)Online publication date: Jul-2018
https://doi.org/10.1109/JSSC.2018.2815653
Montagna FBenatti SRossi D(2017)Flexible, Scalable and Energy Efficient Bio-Signals Processing on the PULP Platform: A Case Study on Seizure DetectionJournal of Low Power Electronics and Applications10.3390/jlpea70200167:2(16)Online publication date: 11-Jun-2017
https://doi.org/10.3390/jlpea7020016
Gautschi MSchiavone PTraber ALoi IPullini ARossi DFlamand EGurkaynak FBenini L(2017)Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint DevicesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.265450625:10(2700-2713)Online publication date: Oct-2017
https://doi.org/10.1109/TVLSI.2017.2654506
Payami MAzarkhish ELoi IBenini L(2017)A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore ClustersIEEE Embedded Systems Letters10.1109/LES.2017.27079789:4(125-128)Online publication date: Dec-2017
https://doi.org/10.1109/LES.2017.2707978
Malazgirt GYurdakul A(2017)PrenautJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2016.07.00472:C(3-18)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.sysarc.2016.07.004
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents