skip to main content
research-article

Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters

Published: 01 April 2023 Publication History

Abstract

High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures’ performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.

References

[1]
A. Loquercio, A. I. Maqueda, C. R. del Blanco, and D. Scaramuzza, “Dronet: Learning to fly by driving,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 1088–1095, Apr. 2018.
[2]
D. Rossiet al., “Energy-efficient near-threshold parallel computing: The PULPv2 cluster,” IEEE Micro, vol. 37, no. 5, pp. 20–31, Sep. 2017.
[3]
L. Karam, I. Alkamal, A. Gatherer, G. Frantz, D. Anderson, and B. Evans, “Trends in multicore DSP platforms,” IEEE Signal Process. Mag., vol. 26, no. 6, pp. 38–49, Nov. 2009.
[4]
M. J. Flynn, “Some computer organizations and their effectiveness,” IEEE Trans. Comput., vols. C–21, no. 9, pp. 948–960, Sep. 1972.
[5]
D. Rossiet al., “A 60 GOPS/W, −1.8 V to 0.9 V body bias ULP cluster in 28 nm UTBB FD-SOI technology,” Solid-State Electron., vol. 117, pp. 170–184, Nov. 2015.
[6]
P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues, “Benchmarking of standard-cell based memories in the sub-VT domain in 65-nm CMOS technology,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 1, no. 2, pp. 173–182, Jun. 2011.
[7]
I. Loiet al., “Exploring multi-banked shared-L1 program cache on ultra-low power, tightly coupled processor clusters,” in Proc. 12th ACM Int. Conf. Comput. Frontiers, New York, NY, USA: ACM, 2015, p. 64. 10.1145/2742854.2747288.
[8]
F. Oboril, R. Bishnoi, M. Ebrahimi, and M. B. Tahoori, “Evaluation of hybrid memory technologies using SOT-MRAM for on-chip cache hierarchy,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 34, no. 3, pp. 367–380, Mar. 2015.
[9]
K. Kuan and T. Adegbija, “HALLS: An energy-efficient highly adaptable last level STT-RAM cache for multicore systems,” IEEE Trans. Comput., vol. 68, no. 11, pp. 1623–1634, Nov. 2019.
[10]
J. Myers, A. Savanth, R. Gaddh, D. Howard, P. Prabhat, and D. Flynn, “A subthreshold ARM cortex-M0+ subsystem in 65 nm CMOS for WSN applications with 14 power domains, 10T SRAM, and integrated voltage regulator,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 31–44, Jan. 2016.
[11]
N. Ickes, Y. Sinangil, F. Pappalardo, E. Guidetti, and A. P. Chandrakasan, “A 10 pJ/cycle ultra-low-voltage 32-bit microprocessor system-on-chip,” in Proc. IEEE ESSCIRC, Sep. 2011, pp. 159–162.
[12]
A. Temanet al., “Controlled placement of standard cell memory arrays for high density and low power in 28 nm FD-SOI,” in Proc. 20th Asia South Pacific Design Autom. Conf., 2015, pp. 81–86.
[13]
A. J. Smith, “Cache memories,” ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, Sep. 1982. 10.1145/356887.356892.
[14]
C. Liu, A. Sivasubramaniam, and M. Kandemir, “Organizing the last line of defense before hitting the memory wall for CMPs,” in Proc. 10th Int. Symp. High Perform. Comput. Archit. (HPCA), 2004, pp. 176–185.
[15]
H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying GPU microarchitecture through microbenchmarking,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), Mar. 2010, pp. 235–246.
[16]
C. Zhang, F. Vahid, and W. Najjar, “A highly configurable cache for low energy embedded systems,” ACM Trans. Embedded Comput. Syst., vol. 4, no. 2, pp. 363–387, May 2005. 10.1145/1067915.1067921.
[17]
W. Wang, P. Mishra, and S. Ranka, “Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems,” in Proc. 48th Design Autom. Conf. (DAC), 2011, pp. 948–953.
[18]
G. Reinman, B. Calder, and T. Austin, “Fetch directed instruction prefetching,” in Proc. MICRO Proc. 32nd Annu. ACM/IEEE Int. Symp. Microarchitecture, Nov. 1999, pp. 16–27.
[19]
M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Temporal instruction fetch streaming,” in Proc. 41st IEEE/ACM Int. Symp. Microarchitecture, Nov. 2008, pp. 1–10.
[20]
M. Ferdman, C. Kaynak, and B. Falsafi, “Proactive instruction fetch,” in Proc. 44th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2011, pp. 152–162.
[21]
A. Kolli, A. Saidi, and T. F. Wenisch, “RDIP: Return-address-stack directed instruction prefetching,” in Proc. 46th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), Dec. 2013, pp. 260–271.
[22]
Y. Ishiiet al., “Re-establishing fetch-directed instruction prefetching: An industry perspective,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., Mar. 2021, pp. 172–182.
[23]
J. Díaz, P. Ibá nez, T. Monreal, V. Viñals, and J. M. Llabería, “Near-optimal replacement policies for shared caches in multicore processors,” J. Supercomput., vol. 77, no. 10, pp. 11756–11785, Oct. 2021.
[24]
S. N. Ghosh, L. Bhargava, and V. Sahula, “SRCP: Sharing and reuse-aware replacement policy for the partitioned cache in multicore systems,” Design Autom. Embedded Syst., vol. 25, no. 3, pp. 193–211, Sep. 2021.
[25]
J. Xiao, Y. Shen, and A. D. Pimentel, “Cache interference-aware task partitioning for non-preemptive real-time multi-core systems,” ACM Trans. Embedded Comput. Syst., vol. 21, no. 3, pp. 1–28, May 2022.
[26]
G. Caboet al., “SafeSU: An extended statistics unit for multicore timing interference,” in Proc. IEEE Eur. Test Symp. (ETS), May 2021, pp. 1–4.
[27]
M. Gautschiet al., “Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct. 2017.
[28]
L. Benini, E. Flamand, D. Fuin, and D. Melpignano, “P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator,” in Proc. Design, Autom. Test Eur. (DATE), Mar. 2012, pp. 983–987.
[29]
A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fully-synthesizable single-cycle interconnection network for shared-L1 processor clusters,” in Proc. Design, Autom. Test Eur., Mar. 2011, pp. 1–6.
[30]
D. Rossiet al., “Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters,” in Proc. 11th ACM Conf. Comput. Frontiers. New York, NY, USA: Association for Computing Machinery, 2014. 10.1145/2597917.2597922.
[31]
I. Loi, A. Capotondi, D. Rossi, A. Marongiu, and L. Benini, “The quest for energy-efficient I$ design in ultra-low-power clustered many-cores,” IEEE Trans. Multi-Scale Comput. Syst., vol. 4, no. 2, pp. 99–112, Apr. 2018.
[32]
C. Jie, I. Loi, L. Benini, and D. Rossi, “Energy-efficient two-level instruction cache design for an ultra-low-power multi-core cluster,” in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2020, pp. 173–1739.
[34]
A. Marongiu, A. Capotondi, G. Tagliavini, and L. Benini, “Simplifying many-core-based heterogeneous SoC programming with offload directives,” IEEE Trans. Ind. Informat., vol. 11, no. 4, pp. 957–967, Aug. 2015.
[35]
GreenWaves Technologies. (2018). Gap8 Auto-Tiler Manual. [Online]. Available: https://greenwaves-technologies.com

Cited By

View all
  • (undefined)Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in SpaceACM Transactions on Cyber-Physical Systems10.1145/3635161

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems  Volume 31, Issue 4
April 2023
200 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 April 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (undefined)Hybrid Modular Redundancy: Exploring Modular Redundancy Approaches in RISC-V Multi-Core Computing Clusters for Reliable Processing in SpaceACM Transactions on Cyber-Physical Systems10.1145/3635161

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media