research-article

Open access

Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

Authors:

Darío Suárez Gracia,

Alexandra Ferrerón,

Luis Montesano Del Campo,

Teresa Monreal Arnal,

Víctor Viñals YúferaAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 2

Article No.: 19, Pages 1 - 26

https://doi.org/10.1145/2632217

Published: 01 June 2014 Publication History

Abstract

Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low.

This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%.

References

[1]

Jorge Albericio, Rubén Gran, Pablo Ibáñez, Víctor Viñals, and Jose María Llabería. 2012. ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache. ACM Transactions on Architecture and Code Optimization, 8, 4, Article 19, 20 pages.

Digital Library

[2]

David H. Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO'32). 248--259.

Digital Library

[3]

Rajeev Balasubramonian, David Albonesi, Alper Buyuktosunoglu, and Sandhya Dwarkadas. 2000. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO 33: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. 245--257.

Digital Library

[4]

Bradford M. Beckmann, Michael R. Marty, and David A. Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual International Symposium on Microarchitecture. 443--454.

Digital Library

[5]

Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Communications ACM 54, 5 (May), 67--77.

Digital Library

[6]

Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitectures. IEEE Computer Society, 55.

Digital Library

[7]

Seungryul Choi and Donald Yeung. 2009. Hill-climbing SMT processor resource distribution. ACM Transactions on Computer Systems, 27, 1 (2009), 1--47.

Digital Library

[8]

F. Glover. 1989. Tabu search, part I. ORSA Journal on Computing 1, 3 (1989), 190--206.

[9]

Ron Gabor, Shlomo Weiss, and Avi Mendelson. 2006. Fairness and throughput in switch on event multithreading. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitectures (MICRO'39). 149--160.

Digital Library

[10]

Hongliang Gao and Chris Wilkerson. 2010. Dueling segmented LRU replacement algorithm with adaptive bypassing. In Proceedings of the 1st JILP Workshop on Computer Architecture Competitions: Cache Replacement Championship. 1--4.

[11]

Montse García, José González, and Antonio González. 2000. Data caches for multithreaded processors. In Proceedings of the Workshop on Multithreaded Execution, Architecture, and Compilation, 1--8.

[12]

Ed Grochowski and Murali Annavaram. 2006. Energy per instruction trends in Intel® microprocessors. Technology&commat; Intel Magazine 4, 3 (2006), 1--8.

[13]

Tom R. Halfhill. 2010a. Netlogic broadens XLP family. Microprocessor Report 24, 7 (2010), 1--11.

[14]

Tom R. Halfhill. 2010b. The rise of licensable SMP. Microprocessor Report 24, 2 (2010), 11--18.

[15]

Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. 2005. SimPoint 3.0: Faster and more flexible program analysis. In Proceedings of Workshop on Modeling, Benchmarking and Simulation, 1--8.

[16]

Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In ISCA’09: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). 184--195.

Digital Library

[17]

Sébastien Hily and André Seznec. 1997. Contention on 2nd level Cache May Limit the Effectiveness of Simultaneous Multithreading. Technical Report 1086. IRISA.

[18]

Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. 2002. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 209--220.

Digital Library

[19]

Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and Stephen W. Keckler. 2005. A NUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th Annual International Conference on Supercomputing. 31--40.

Digital Library

[20]

Intel Embedded. 2010. Intel® Xeon® processor C5500/C3500 Series. Datasheet--Volume 1. (February 2010). http://edc.intel.com/Link.aspx&excl;id=3179.

[21]

Intel Software 2011. Bull Mountain: Software Implementation Guide. Intel Software.

[22]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37nd Annual International Symposium on Computer Architecture. 60--71.

Digital Library

[23]

Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. ACM, New York, 364--373.

Digital Library

[24]

K. Kedzierski, M. Moreto, F. J. Cazorla, and M. Valero. 2010. Adapting cache partitioning algorithms to pseudo-LRU replacement policies. In Proceedings of International Symposium on Parallel Distributed Processing, 1--12.

[25]

R. E. Kessler, R. Jooss, A. Lebeck, and M. D. Hill. 1989. Inexpensive implementations of set-associativity. In Proc. of the 16th Annual International Symposium on Computer Architecture. 131--139.

Digital Library

[26]

Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An adaptive cache structure for future high-performance systems. In Proceedings of the IBM Austin Center for Advanced Studies Workshop, 1--12.

[27]

Hantak Kwak, Ben Lee, Ali R. Hurson, SukHan Yoon, and Woo-Jong Hahn. 1999. Effects of multi-threading on cache performance. IEEE Transactions on Computers 48 (1999), 176--184.

Digital Library

[28]

Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron, and Pradip Bose. 2004. Understanding the energy efficiency of simultaneous multithreading. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design (ISLPED’04). ACM, New York, 44--49.

Digital Library

[29]

Sonia López, Steve Dropsho, David H. Albonesi, Oscar Garnica, and Juan Lanchares. 2007. Dynamic capacity-speed tradeoffs in SMT processor caches. In Proceedings of the 2nd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’07). Springer-Verlag, Berlin, Heidelberg, 136--150.

Digital Library

[30]

LSI Corporation. 2010. PowerPC#8482; Processor (476FP) Embedded Core Product Brief. Available at http://www.lsi.com/DistributionSystem/AssetDocument/PPC476FP-PB-v7.pdf.

[31]

MIPS Technologies. 2010. MIPS32®1004K#8482; Coherent Processing System (CPS). (2010).

[32]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85. HP Laboratories.

[33]

U. G. Nawathe, M. Hassan, K. C. Yen, A. Kumar, A. Ramachandran, and D. Greenhill. 2008. Implementation of an 8-core, 64-thread, power-efficient SPARC server on a chip. IEEE Journal of Solid-State Circuits 43, 1 (2008), 6--20.

[34]

Mario Nemirovsky and Wayne Yamamoto. 1998. Quantitative study of data caches on a multistreamed architecture. In Workshop on Multithreaded Execution, Architecture, and Compilation, 1--8.

[35]

Emre Özer, Ronald G. Dreslinski, Trevor Mudge, Stuart Biles, and Krisztián Flautner. 2008. Energy-Efficient simultaneous thread fetch from different cache levels in a soft real-time SMT processor. In Proceedings of the 8th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. 12--22.

Digital Library

[36]

Aashish Phansalkar, Ajay Joshi, and Lizy K. John. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, 412--423.

Digital Library

[37]

Parthasarathy Ranganathan, Sarita Adve, and Norman P. Jouppi. 2000. Reconfigurable caches and their application to media processing. In ISCA’00: Proceedings of the 27th Annual International Symposium on Computer Architecture. 214--224.

Digital Library

[38]

S. J. Russell and P. Norvig. 2009. Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall.

Digital Library

[39]

Subhradyuti Sarkar and Dean M. Tullsen. 2011. Data layout for cache performance on a multithreaded arch. In Transactions on High-Performance Embedded Architectures and Compilers III. 43--68.

Digital Library

[40]

Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. 2012. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 355--366.

Digital Library

[41]

Alex Settle, Dan Connors, Enric Gibert, and Antonio González. 2006. A dynamically reconfigurable cache for multithreaded processors. Journal of Embedded Computing. 2, 2 (2006), 221--233.

Digital Library

[42]

Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architecture Support for Programming Languages and Operating Systems. 45--57.

Digital Library

[43]

D. Suárez Gracia, G. Dimitrakopoulos, T. Monreal Arnal, M. G. H. Katevenis, and V. Viñals Yúfera. 2012. LP-NUCA: Networks-in-cache for high-performance low-power embedded processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 8 (Aug. 2012), 1510--1523.

Digital Library

[44]

Karthik T. Sundararajan, Timothy M. Jones, and Nigel Topham. 2011. Smart cache: A self adaptive cache arch. for energy efficiency. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’11), 1--10.

[45]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 392--403.

Digital Library

[46]

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23nd Annual International Symposium on Computer Architecture, Vol. 24. ACM, 191--202.

Digital Library

[47]

Chuanjun Zhang, Frank Vahid, and Walid Najjar. 2003. A highly configurable cache arch. for embedded systems. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM, 136--146.

Digital Library

Cited By

Díaz Álvarez JRisco-Martín JColmenar J(2016)Multi-objective optimization of energy consumption and execution time in a single level cache memory for embedded systemsJournal of Systems and Software10.1016/j.jss.2015.10.012111:C(200-212)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.jss.2015.10.012

Index Terms

Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Analysis of static and dynamic energy consumption in NUCA caches: initial results
MEDEA '07: Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture

NUCA caches are large L2 on-chip cache memories characterized by multi-bank partitioning and designed to hide wire delay effects. They exhibit high hit rates while keeping access latency low. Proposed designs for such caches are Static NUCA, in which ...
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 2

June 2014

210 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2639036

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2014

Accepted: 01 March 2014

Revised: 01 February 2014

Received: 01 May 2013

Published in TACO Volume 11, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

European Social Fund
(Spanish Gov. and European ERDF)
Consolider CSD2007-00050 (Spanish Gov.)
gaZ: T48 research group (Aragón Gov. and European ESF)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
425
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)6

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Díaz Álvarez JRisco-Martín JColmenar J(2016)Multi-objective optimization of energy consumption and execution time in a single level cache memory for embedded systemsJournal of Systems and Software10.1016/j.jss.2015.10.012111:C(200-212)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.jss.2015.10.012

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents