skip to main content
10.1145/2742854.2747288acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Exploring multi-banked shared-L1 program cache on ultra-low power, tightly coupled processor clusters

Published: 06 May 2015 Publication History

Abstract

L1 instruction caches in many-core systems represent a sizable fraction of the total power consumption. Although large instruction caches can significantly improve performance, they have the potential to increase power consumption. Private caches are usually able to achieve higher speed, due to their simpler design, but the smaller L1 memory space seen by each core induces a high miss ratio. Shared instruction cache can be seen as an attractive solution to improve performance and energy efficiency while reducing area. In this paper we propose a multi-banked, shared instruction cache architecture suitable for ultra-low power multicore systems, where parallelism and near threshold operation is used to achieve minimum energy. We implemented the cluster architecture with different configurations of cache sharing, utilizing the 28nm UTBB FD-SOI from STMicroelectronics as reference technology. Experimental results, based on several real-life applications, demonstrate that sharing mechanisms have no impact on the system operating frequency, and allow to reduce the energy consumption of the cache subsystem by up to 10%, while keeping the same area footprint, or reducing by 2× the overall shared cache area, while keeping the same performance and energy efficiency with respect to a cluster of processing elements with private program caches.

References

[1]
N. E. Bellas, I. N. Hajj, and C. D. Polychronopoulos. Using dynamic cache management techniques to reduce energy in general purpose processors. IEEE Transactions on Very Large Scale Integrated Systems, 8(6): 693--708, Dec 2000.
[2]
L. Benini, E. Flamand, D. Fuin, and D. Melpignano. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 983--987, March 2012.
[3]
D. Bortolotti, F. Paterna, C. Pinto, A. Marongiu, M. Ruggiero, and L. Benini. Exploring instruction caching strategies for tightly-coupled shared-memory clusters. In International Symposium on System on Chip (SoC), pages 34--41, Oct 2011.
[4]
M. Butler, L. Barnes, D. D. Sarma, and B. Gelinas. Bulldozer: An approach to multithreaded compute performance. IEEE Micro, 31(2): 6--15, Apr 2011.
[5]
A. Dogan, R. Braojos, J. Constantin, G. Ansaloni, A. Burg, and D. Atienza. Synchronizing code execution on ultra-low-power embedded multi-channel signal analysis platforms. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 396--399, March 2013.
[6]
A. Y. Dogan, J. Constantiny, M. Ruggiero, A. Burg, and D. Atienza. Multi-core architecture design for ultra-low-power wearable health monitoring systems. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 988, 993, March 2012.
[7]
J. Eyre and J. Bier. Dsp processors hit the mainstream. IEEE Computer, 31(8): 51--59, Aug 1998.
[8]
C. H. Kim, S. Shim, J. W. Kwak, S. W. Chung, and C. S. Jhon. First-level instruction cache design for reducing dynamic energy consumption. In 5th International Workshop, SAMOS, pages 103--111, July 2005.
[9]
J. Kin, M. Gupta, and W. H. Mangione-Smith. Filtering memory references to increase energy efficiency. IEEE Transactions on Computers, 49(1): 1--15, Jan 2000.
[10]
Krhonos. The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl.
[11]
L. Lee, B. Moyer, and J. Arends. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In International Symposium on Low Power Electronics and Design, pages 267--269, Aug 1999.
[12]
A. Marongiu, P. Burgio, and L. Benini. Fast and lightweight support for nested parallelism on cluster-based embedded many-cores. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 105--110, March 2012.
[13]
P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues. Benchmarking of standard-cell based memories in the sub-vt domain in 65-nm cmos technology. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1(2): 173--182, Jun 2011.
[14]
NVIDIA. Next generation cuda compute architecture: Fermi. www.nvidia.com.
[15]
OR1200. Openrisc processor. http://opencores.org/or1k.
[16]
A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1--6, March 2011.
[17]
H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pages 235--246, March 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '15: Proceedings of the 12th ACM International Conference on Computing Frontiers
May 2015
413 pages
ISBN:9781450333580
DOI:10.1145/2742854
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. FDSOI
  2. near threshold computing
  3. shared instruction cache

Qualifiers

  • Research-article

Conference

CF'15
Sponsor:
CF'15: Computing Frontiers Conference
May 18 - 21, 2015
Ischia, Italy

Acceptance Rates

CF '15 Paper Acceptance Rate 33 of 96 submissions, 34%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media