research-article

Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

Authors:

Giuseppe Tagliavini,

Davide RossiAuthors Info & Claims

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 31, Issue 4

Pages 456 - 469

https://doi.org/10.1109/TVLSI.2022.3228336

Published: 01 April 2023 Publication History

Abstract

High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures’ performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.

References

[1]

A. Loquercio, A. I. Maqueda, C. R. del Blanco, and D. Scaramuzza, “Dronet: Learning to fly by driving,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 1088–1095, Apr. 2018.

[2]

D. Rossiet al., “Energy-efficient near-threshold parallel computing: The PULPv2 cluster,” IEEE Micro, vol. 37, no. 5, pp. 20–31, Sep. 2017.

Digital Library

[3]

L. Karam, I. Alkamal, A. Gatherer, G. Frantz, D. Anderson, and B. Evans, “Trends in multicore DSP platforms,” IEEE Signal Process. Mag., vol. 26, no. 6, pp. 38–49, Nov. 2009.

[4]

M. J. Flynn, “Some computer organizations and their effectiveness,” IEEE Trans. Comput., vols. C–21, no. 9, pp. 948–960, Sep. 1972.

[5]

D. Rossiet al., “A 60 GOPS/W, −1.8 V to 0.9 V body bias ULP cluster in 28 nm UTBB FD-SOI technology,” Solid-State Electron., vol. 117, pp. 170–184, Nov. 2015.

[6]

P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues, “Benchmarking of standard-cell based memories in the sub-V_T domain in 65-nm CMOS technology,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 1, no. 2, pp. 173–182, Jun. 2011.

[7]

I. Loiet al., “Exploring multi-banked shared-L1 program cache on ultra-low power, tightly coupled processor clusters,” in Proc. 12th ACM Int. Conf. Comput. Frontiers, New York, NY, USA: ACM, 2015, p. 64. 10.1145/2742854.2747288.

Digital Library

[8]

F. Oboril, R. Bishnoi, M. Ebrahimi, and M. B. Tahoori, “Evaluation of hybrid memory technologies using SOT-MRAM for on-chip cache hierarchy,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34, no. 3, pp. 367–380, Mar. 2015.

Digital Library

[9]

K. Kuan and T. Adegbija, “HALLS: An energy-efficient highly adaptable last level STT-RAM cache for multicore systems,” IEEE Trans. Comput., vol. 68, no. 11, pp. 1623–1634, Nov. 2019.

Digital Library

[10]

J. Myers, A. Savanth, R. Gaddh, D. Howard, P. Prabhat, and D. Flynn, “A subthreshold ARM cortex-M0+ subsystem in 65 nm CMOS for WSN applications with 14 power domains, 10T SRAM, and integrated voltage regulator,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 31–44, Jan. 2016.

[11]

N. Ickes, Y. Sinangil, F. Pappalardo, E. Guidetti, and A. P. Chandrakasan, “A 10 pJ/cycle ultra-low-voltage 32-bit microprocessor system-on-chip,” in Proc. IEEE ESSCIRC, Sep. 2011, pp. 159–162.

[12]

A. Temanet al., “Controlled placement of standard cell memory arrays for high density and low power in 28 nm FD-SOI,” in Proc. 20th Asia South Pacific Design Autom. Conf., 2015, pp. 81–86.

[13]

A. J. Smith, “Cache memories,” ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, Sep. 1982. 10.1145/356887.356892.

Digital Library

[14]

C. Liu, A. Sivasubramaniam, and M. Kandemir, “Organizing the last line of defense before hitting the memory wall for CMPs,” in Proc. 10th Int. Symp. High Perform. Comput. Archit. (HPCA), 2004, pp. 176–185.

[15]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying GPU microarchitecture through microbenchmarking,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Mar. 2010, pp. 235–246.

[16]

C. Zhang, F. Vahid, and W. Najjar, “A highly configurable cache for low energy embedded systems,” ACM Trans. Embedded Comput. Syst., vol. 4, no. 2, pp. 363–387, May 2005. 10.1145/1067915.1067921.

Digital Library

[17]

W. Wang, P. Mishra, and S. Ranka, “Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems,” in Proc. 48th Design Autom. Conf. (DAC), 2011, pp. 948–953.

[18]

G. Reinman, B. Calder, and T. Austin, “Fetch directed instruction prefetching,” in Proc. MICRO Proc. 32nd Annu. ACM/IEEE Int. Symp. Microarchitecture, Nov. 1999, pp. 16–27.

[19]

M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Temporal instruction fetch streaming,” in Proc. 41st IEEE/ACM Int. Symp. Microarchitecture, Nov. 2008, pp. 1–10.

[20]

M. Ferdman, C. Kaynak, and B. Falsafi, “Proactive instruction fetch,” in Proc. 44th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2011, pp. 152–162.

[21]

A. Kolli, A. Saidi, and T. F. Wenisch, “RDIP: Return-address-stack directed instruction prefetching,” in Proc. 46th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), Dec. 2013, pp. 260–271.

[22]

Y. Ishiiet al., “Re-establishing fetch-directed instruction prefetching: An industry perspective,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., Mar. 2021, pp. 172–182.

[23]

J. Díaz, P. Ibá nez, T. Monreal, V. Viñals, and J. M. Llabería, “Near-optimal replacement policies for shared caches in multicore processors,” J. Supercomput., vol. 77, no. 10, pp. 11756–11785, Oct. 2021.

Digital Library

[24]

S. N. Ghosh, L. Bhargava, and V. Sahula, “SRCP: Sharing and reuse-aware replacement policy for the partitioned cache in multicore systems,” Design Autom. Embedded Syst., vol. 25, no. 3, pp. 193–211, Sep. 2021.

Digital Library

[25]

J. Xiao, Y. Shen, and A. D. Pimentel, “Cache interference-aware task partitioning for non-preemptive real-time multi-core systems,” ACM Trans. Embedded Comput. Syst., vol. 21, no. 3, pp. 1–28, May 2022.

Digital Library

[26]

G. Caboet al., “SafeSU: An extended statistics unit for multicore timing interference,” in Proc. IEEE Eur. Test Symp. (ETS), May 2021, pp. 1–4.

[27]

M. Gautschiet al., “Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct. 2017.

Digital Library

[28]

L. Benini, E. Flamand, D. Fuin, and D. Melpignano, “P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator,” in Proc. Design, Autom. Test Eur. (DATE), Mar. 2012, pp. 983–987.

[29]

A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fully-synthesizable single-cycle interconnection network for shared-L1 processor clusters,” in Proc. Design, Autom. Test Eur., Mar. 2011, pp. 1–6.

[30]

D. Rossiet al., “Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters,” in Proc. 11th ACM Conf. Comput. Frontiers. New York, NY, USA: Association for Computing Machinery, 2014. 10.1145/2597917.2597922.

Digital Library

[31]

I. Loi, A. Capotondi, D. Rossi, A. Marongiu, and L. Benini, “The quest for energy-efficient I$ design in ultra-low-power clustered many-cores,” IEEE Trans. Multi-Scale Comput. Syst., vol. 4, no. 2, pp. 99–112, Apr. 2018.

[32]

C. Jie, I. Loi, L. Benini, and D. Rossi, “Energy-efficient two-level instruction cache design for an ultra-low-power multi-core cluster,” in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2020, pp. 173–1739.

[33]

GreenWaves Technologies. (2022). Gap9 Product Brief. [Online]. Available: https://greenwaves-technologies.com/wp-content/uploads/2022/06/Product-Brief-GAP9-Sensors-General-V1_14.pdf

[34]

A. Marongiu, A. Capotondi, G. Tagliavini, and L. Benini, “Simplifying many-core-based heterogeneous SoC programming with offload directives,” IEEE Trans. Ind. Informat., vol. 11, no. 4, pp. 957–967, Aug. 2015.

[35]

GreenWaves Technologies. (2018). Gap8 Auto-Tiler Manual. [Online]. Available: https://greenwaves-technologies.com

Cited By

Rogenmoser MTortorella YRossi DConti FBenini L(undefined)Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in SpaceACM Transactions on Cyber-Physical Systems10.1145/3635161
https://dl.acm.org/doi/10.1145/3635161

Recommendations

Balanced Instruction Cache: Reducing Conflict Misses of Direct-Mapped Caches through Balanced Subarray Accesses

It is observed that the limited memory space of directmapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of ...
An Energy-Efficient Partitioned Instruction Cache Architecture for Embedded Processors

Energy efficiency of cache memories is crucial in designing embedded processors. Reducing energy consumption in the instruction cache is especially important, since the instruction cache consumes a significant portion of total processor energy. This ...
High bandwidth cache design for superscalar processors

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Very Large Scale Integration (VLSI) Systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Volume 31, Issue 4

April 2023

200 pages

ISSN:1063-8210

Issue’s Table of Contents

1063-8210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 April 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rogenmoser MTortorella YRossi DConti FBenini L(undefined)Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in SpaceACM Transactions on Cyber-Physical Systems10.1145/3635161
https://dl.acm.org/doi/10.1145/3635161

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents