research-article

The auction: optimizing banks usage in Non-Uniform Cache Architectures

Authors:

Antonio GonzálezAuthors Info & Claims

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

Pages 37 - 47

https://doi.org/10.1145/1810085.1810095

Published: 02 June 2010 Publication History

Abstract

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NU-CAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the off-chip memory, because of the significant speed gap between processor and memory and the limited memory bandwidth. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has prevented previously proposed replacement policies from being effective in this kind of caches. As banks operate independently of each other, their replacement decisions are restricted to a single NUCA bank. We propose a novel mechanism based on the bank replacement policy for NUCA caches on CMP, called The Auction. This mechanism enables the replacement decisions taken in a single bank to be spread to the whole NUCA cache. Thus, global replacement policies that rely on the current state of the NUCA cache, such as evicting the least frequently accessed data in the whole NUCA cache, are now feasible. Moreover, The Auction adapts to current program behaviour in order to relocate a line that is being evicted from a bank in the NUCA cache to the most suitable position in the whole cache. We propose, implement and evaluate three approaches of The Auction mechanism. We also show that The Auction manages the cache efficiently and significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8%, and reduces energy consumed by the memory system by 4%.

References

[1]

Micron system power calculator. {Online}. Available: http://www.micron.com/support/partinfo/powercalc

[2]

Spec cpu2006. {Online}. Available: http://www.spec.org/cpu2006

[3]

V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate vs. ipc: The end of the road for conventional microprocessors," in Proceedings of the 27th International Symposium on Computer Architecture, 2000.

Digital Library

[4]

S. Akioka, F. Li, K. Malkowski, P. Raghavan, M. Kandemir, and M. J. Irwin, "Ring data location prediction scheme for non-uniform cache architectures," in Proocedings of the International Conference on Computer Design, 2008.

[5]

A. Bardine, P. Foglia, G. Gabrielli, and C. A. Prete, "Analysis of static and dynamic energy consumption in nuca caches: Initial results," in Proceedings of the 2007 Workshop on Memory Performance: Dealing with Applications, Systems and Architecture, 2007.

Digital Library

[6]

B. M. Beckmann, M. R. Marty, and D. A. Wood, "Asr: Adaptive selective replication for cmp caches," in Proceedings of the 39th Annual IEEE/ACM International Symposium of Microarchitecture, 2006.

Digital Library

[7]

B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches," in Proceedings of the 37th International Symposium on Microarchitecture, 2004.

Digital Library

[8]

C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2008.

Digital Library

[9]

J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in Proceedings of the 33rd International Symposium on Computer Architecture, 2006.

Digital Library

[10]

J. Chang and G. S. Sohi, "Cooperative cache partitioning for chip multiprocessors," in Proceedings of the 21st ACM International Conference on Supercomputing, 2007.

Digital Library

[11]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Distance associativity for high-performance energy-efficient non-uniform cache architectures," in Proceedings of the 36th International Symposium on Microarchitecture, 2003.

Digital Library

[12]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in cmps," in Proceedings of the 32nd International Symposium on Computer Architecture, 2005.

Digital Library

[13]

P. Dubey, "A platform 2015 workload model: Recognition, mining and synthesis moves computers to the era of tera," in Intel White Paper, Intel Corporation, 2005.

[14]

H. Dybdahl and P. Stenström, "An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors," in Proceedings of the 13th International Symposium on High-Performance Computer Architecture, 2007.

Digital Library

[15]

E. Grochowski, R. Ronen, J. Shen, and H. Wang, "Best of both latency and throughput," in Proceedings of the 22nd International Conference on Computer Design, 2004.

Digital Library

[16]

M. Hammoud, S. Cho, and R. Melhem, "Acm: An efficient approach for managing shared caches in chip multiprocessors," in Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC), 2009.

Digital Library

[17]

M. Hammoud, S. Cho, and R. Melhem, "Dynamic cache clustering for chip multiprocessors," in Proceedings of the ACM International Conference on Supercomputing (ICS), 2009.

Digital Library

[18]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive nuca: Near-optimal block placement and replication in distributed caches," in Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[19]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, "A nuca substrate for flexible cmp cache sharing," in Proceedings of the 19th ACM International Conference on Supercomputing, 2005.

Digital Library

[20]

N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90.

Digital Library

[21]

M. Kandemir, F. Li, M. J. Irwin, and S. W. Son, "A novel migration-based nuca design for chip multiprocessors," in Proceedings of the ACM/IEEE conference on Supercomputing, 2008.

Digital Library

[22]

C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[23]

P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: a 32-way multithreaded sparc processor," in IEEE Micro, March 2005.

Digital Library

[24]

H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, "Ibm power6 microarchitecture," IBM Journal, November 2007.

Digital Library

[25]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, Simics: A Full System Simulator Platform. Computer, 2002, vol. 35--2, pp. 50--58.

Digital Library

[26]

M. M. K. Martin, M. D. Hill, and D. A. Wood, "Token coherence: Decoupling performance and correctness," in Proceedings of the 30th International Symposium on Computer Architecture, 2003.

Digital Library

[27]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's general execution-driven multiprocessor simulator (gems) toolset," in Computer Architecture News, Sep. 2005.

Digital Library

[28]

D. Matzke, "Will physical scalability sabotage performance gains?" IEEE Computer, September 1997.

Digital Library

[29]

C. McNairy and R. Bhatia, "Montecito: A dual-core, dual-thread itanium processor," IEEE Micro, vol. 25, no. 2, March-April 2005.

Digital Library

[30]

N. Muralimanohar and R. Balasubramonian, "Interconnect design considerations for large nuca caches," in Proceedings of the 34th International Symposium on Computer Architecture, 2007.

Digital Library

[31]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007.

Digital Library

[32]

R. Ricci, S. Barrus, and R. Balasubramonian, "Leveraging bloom filters for smart search within nuca caches," in Proceedings of the 7th Workshop on Complexity-Effective Design (WCED), 2006.

[33]

S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finnan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, "An 80-tile 1.28tflops network-on-chip in 65nm cmos," in Proceedings of the IEEE International Solid-State Circuits Conference, 2007.

[34]

T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "Simflex: Statistical sampling of computer system simulation," IEEE Micro, vol. 26, no. 4, pp. 18--31, 2006.

Digital Library

[35]

M. Zhang and K. Asanović, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proceedings of the 32nd International Symposium on Computer Architecture, 2005.

Digital Library

Cited By

Zhao XAdileh AYu ZWang ZJaleel AEeckhout LManne SHunter HAltman E(2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322235
Qiu KNi YZhang WWang JWu XXue CLi T(2016)An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor2016 IEEE 34th International Conference on Computer Design (ICCD)10.1109/ICCD.2016.7753255(9-16)Online publication date: Oct-2016
https://doi.org/10.1109/ICCD.2016.7753255
Bathen LDutt N(2014)SPMCloudACM Transactions on Design Automation of Electronic Systems10.1145/261175519:3(1-45)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2611755
Show More Cited By

Index Terms

The auction: optimizing banks usage in Non-Uniform Cache Architectures
1. Computer systems organization

Recommendations

An adaptive migration---replication scheme (AMR) for shared cache in chip multiprocessors

Most of today's chip multiprocessors implement last-level shared caches as non-uniform cache architectures. A major problem faced by such multicore architectures is cache line placement, especially in scenarios where multiple cores compete for line ...
An Efficient Lightweight Shared Cache Design for Chip Multiprocessors
APPT '09: Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies

The large working sets of commercial and scientific workloads favor a shared L2 cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests in Chip Multiprocessors (CMP). The exponential increase in the number of ...
An efficient adaptive block pinning for multicore architectures

Most of today's multi-core processors feature last level shared L2 caches. A major problem faced by such multi-core architectures is cache contention, where multiple cores compete for usage of the single shared L2 cache. Previous research shows that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

June 2010

365 pages

ISBN:9781450300186

DOI:10.1145/1810085

General Chair:
Taisuke Boku
University of Tsukuba
,
Program Chairs:
Hiroshi Nakashima
Kyoto University
,
Avi Mendelson
Microsoft

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ICS'10

Sponsor:

SIGARCH

ICS'10: International Conference on Supercomputing

June 2 - 4, 2010

Ibaraki, Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
381
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao XAdileh AYu ZWang ZJaleel AEeckhout LManne SHunter HAltman E(2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322235
Qiu KNi YZhang WWang JWu XXue CLi T(2016)An adaptive Non-Uniform Loop Tiling for DMA-based bulk data transfers on many-core processor2016 IEEE 34th International Conference on Computer Design (ICCD)10.1109/ICCD.2016.7753255(9-16)Online publication date: Oct-2016
https://doi.org/10.1109/ICCD.2016.7753255
Bathen LDutt N(2014)SPMCloudACM Transactions on Design Automation of Electronic Systems10.1145/261175519:3(1-45)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1145/2611755
Gracia DDimitrakopoulos GArnal TKatevenis MYúfera V(2012)LP-NUCAIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2011.215824920:8(1510-1523)Online publication date: 1-Aug-2012
https://dl.acm.org/doi/10.1109/TVLSI.2011.2158249
Herrero EGonzalez JCanal R(2012)Distributed Cooperative CachingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2011.20023:5(853-861)Online publication date: 1-May-2012
https://dl.acm.org/doi/10.1109/TPDS.2011.200
Lira JMolina CGonzález A(2011)HK-NUCAProceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2011.48(419-430)Online publication date: 16-May-2011
https://dl.acm.org/doi/10.1109/IPDPS.2011.48
Bardine AFoglia PPanicucci FSolinas MSahuquillo J(2011)Energy Behaviour of NUCA Caches in CMPsProceedings of the 2011 14th Euromicro Conference on Digital System Design10.1109/DSD.2011.99(746-753)Online publication date: 31-Aug-2011
https://dl.acm.org/doi/10.1109/DSD.2011.99

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents