skip to main content
research-article
Open access

Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

Published: 01 June 2014 Publication History

Abstract

Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low.
This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%.

References

[1]
Jorge Albericio, Rubén Gran, Pablo Ibáñez, Víctor Viñals, and Jose María Llabería. 2012. ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache. ACM Transactions on Architecture and Code Optimization, 8, 4, Article 19, 20 pages.
[2]
David H. Albonesi. 1999. Selective cache ways: On-demand cache resource allocation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO'32). 248--259.
[3]
Rajeev Balasubramonian, David Albonesi, Alper Buyuktosunoglu, and Sandhya Dwarkadas. 2000. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO 33: Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture. 245--257.
[4]
Bradford M. Beckmann, Michael R. Marty, and David A. Wood. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual International Symposium on Microarchitecture. 443--454.
[5]
Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Communications ACM 54, 5 (May), 67--77.
[6]
Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitectures. IEEE Computer Society, 55.
[7]
Seungryul Choi and Donald Yeung. 2009. Hill-climbing SMT processor resource distribution. ACM Transactions on Computer Systems, 27, 1 (2009), 1--47.
[8]
F. Glover. 1989. Tabu search, part I. ORSA Journal on Computing 1, 3 (1989), 190--206.
[9]
Ron Gabor, Shlomo Weiss, and Avi Mendelson. 2006. Fairness and throughput in switch on event multithreading. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitectures (MICRO'39). 149--160.
[10]
Hongliang Gao and Chris Wilkerson. 2010. Dueling segmented LRU replacement algorithm with adaptive bypassing. In Proceedings of the 1st JILP Workshop on Computer Architecture Competitions: Cache Replacement Championship. 1--4.
[11]
Montse García, José González, and Antonio González. 2000. Data caches for multithreaded processors. In Proceedings of the Workshop on Multithreaded Execution, Architecture, and Compilation, 1--8.
[12]
Ed Grochowski and Murali Annavaram. 2006. Energy per instruction trends in Intel® microprocessors. Technology@ Intel Magazine 4, 3 (2006), 1--8.
[13]
Tom R. Halfhill. 2010a. Netlogic broadens XLP family. Microprocessor Report 24, 7 (2010), 1--11.
[14]
Tom R. Halfhill. 2010b. The rise of licensable SMP. Microprocessor Report 24, 2 (2010), 11--18.
[15]
Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. 2005. SimPoint 3.0: Faster and more flexible program analysis. In Proceedings of Workshop on Modeling, Benchmarking and Simulation, 1--8.
[16]
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In ISCA’09: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). 184--195.
[17]
Sébastien Hily and André Seznec. 1997. Contention on 2nd level Cache May Limit the Effectiveness of Simultaneous Multithreading. Technical Report 1086. IRISA.
[18]
Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. 2002. Timekeeping in the memory system: Predicting and optimizing memory behavior. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 209--220.
[19]
Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and Stephen W. Keckler. 2005. A NUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th Annual International Conference on Supercomputing. 31--40.
[20]
Intel Embedded. 2010. Intel® Xeon® processor C5500/C3500 Series. Datasheet--Volume 1. (February 2010). http://edc.intel.com/Link.aspx!id=3179.
[21]
Intel Software 2011. Bull Mountain: Software Implementation Guide. Intel Software.
[22]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37nd Annual International Symposium on Computer Architecture. 60--71.
[23]
Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. ACM, New York, 364--373.
[24]
K. Kedzierski, M. Moreto, F. J. Cazorla, and M. Valero. 2010. Adapting cache partitioning algorithms to pseudo-LRU replacement policies. In Proceedings of International Symposium on Parallel Distributed Processing, 1--12.
[25]
R. E. Kessler, R. Jooss, A. Lebeck, and M. D. Hill. 1989. Inexpensive implementations of set-associativity. In Proc. of the 16th Annual International Symposium on Computer Architecture. 131--139.
[26]
Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An adaptive cache structure for future high-performance systems. In Proceedings of the IBM Austin Center for Advanced Studies Workshop, 1--12.
[27]
Hantak Kwak, Ben Lee, Ali R. Hurson, SukHan Yoon, and Woo-Jong Hahn. 1999. Effects of multi-threading on cache performance. IEEE Transactions on Computers 48 (1999), 176--184.
[28]
Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron, and Pradip Bose. 2004. Understanding the energy efficiency of simultaneous multithreading. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design (ISLPED’04). ACM, New York, 44--49.
[29]
Sonia López, Steve Dropsho, David H. Albonesi, Oscar Garnica, and Juan Lanchares. 2007. Dynamic capacity-speed tradeoffs in SMT processor caches. In Proceedings of the 2nd International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’07). Springer-Verlag, Berlin, Heidelberg, 136--150.
[30]
LSI Corporation. 2010. PowerPC#8482; Processor (476FP) Embedded Core Product Brief. Available at http://www.lsi.com/DistributionSystem/AssetDocument/PPC476FP-PB-v7.pdf.
[31]
MIPS Technologies. 2010. MIPS32®1004K#8482; Coherent Processing System (CPS). (2010).
[32]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report HPL-2009-85. HP Laboratories.
[33]
U. G. Nawathe, M. Hassan, K. C. Yen, A. Kumar, A. Ramachandran, and D. Greenhill. 2008. Implementation of an 8-core, 64-thread, power-efficient SPARC server on a chip. IEEE Journal of Solid-State Circuits 43, 1 (2008), 6--20.
[34]
Mario Nemirovsky and Wayne Yamamoto. 1998. Quantitative study of data caches on a multistreamed architecture. In Workshop on Multithreaded Execution, Architecture, and Compilation, 1--8.
[35]
Emre Özer, Ronald G. Dreslinski, Trevor Mudge, Stuart Biles, and Krisztián Flautner. 2008. Energy-Efficient simultaneous thread fetch from different cache levels in a soft real-time SMT processor. In Proceedings of the 8th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. 12--22.
[36]
Aashish Phansalkar, Ajay Joshi, and Lizy K. John. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, 412--423.
[37]
Parthasarathy Ranganathan, Sarita Adve, and Norman P. Jouppi. 2000. Reconfigurable caches and their application to media processing. In ISCA’00: Proceedings of the 27th Annual International Symposium on Computer Architecture. 214--224.
[38]
S. J. Russell and P. Norvig. 2009. Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall.
[39]
Subhradyuti Sarkar and Dean M. Tullsen. 2011. Data layout for cache performance on a multithreaded arch. In Transactions on High-Performance Embedded Architectures and Compilers III. 43--68.
[40]
Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry. 2012. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 355--366.
[41]
Alex Settle, Dan Connors, Enric Gibert, and Antonio González. 2006. A dynamically reconfigurable cache for multithreaded processors. Journal of Embedded Computing. 2, 2 (2006), 221--233.
[42]
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architecture Support for Programming Languages and Operating Systems. 45--57.
[43]
D. Suárez Gracia, G. Dimitrakopoulos, T. Monreal Arnal, M. G. H. Katevenis, and V. Viñals Yúfera. 2012. LP-NUCA: Networks-in-cache for high-performance low-power embedded processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 8 (Aug. 2012), 1510--1523.
[44]
Karthik T. Sundararajan, Timothy M. Jones, and Nigel Topham. 2011. Smart cache: A self adaptive cache arch. for energy efficiency. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’11), 1--10.
[45]
D. M. Tullsen, S. J. Eggers, and H. M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 392--403.
[46]
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23nd Annual International Symposium on Computer Architecture, Vol. 24. ACM, 191--202.
[47]
Chuanjun Zhang, Frank Vahid, and Walid Najjar. 2003. A highly configurable cache arch. for embedded systems. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). ACM, 136--146.

Cited By

View all

Index Terms

  1. Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 2
    June 2014
    210 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2639036
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2014
    Accepted: 01 March 2014
    Revised: 01 February 2014
    Received: 01 May 2013
    Published in TACO Volume 11, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. NUCA
    2. embedded processors
    3. energy
    4. hill climbing
    5. locality of reference

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • European Social Fund
    • (Spanish Gov. and European ERDF)
    • Consolider CSD2007-00050 (Spanish Gov.)
    • gaZ: T48 research group (Aragón Gov. and European ESF)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)51
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media