skip to main content
10.1145/1810085.1810095acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

The auction: optimizing banks usage in Non-Uniform Cache Architectures

Published: 02 June 2010 Publication History

Abstract

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NU-CAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the off-chip memory, because of the significant speed gap between processor and memory and the limited memory bandwidth. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has prevented previously proposed replacement policies from being effective in this kind of caches. As banks operate independently of each other, their replacement decisions are restricted to a single NUCA bank. We propose a novel mechanism based on the bank replacement policy for NUCA caches on CMP, called The Auction. This mechanism enables the replacement decisions taken in a single bank to be spread to the whole NUCA cache. Thus, global replacement policies that rely on the current state of the NUCA cache, such as evicting the least frequently accessed data in the whole NUCA cache, are now feasible. Moreover, The Auction adapts to current program behaviour in order to relocate a line that is being evicted from a bank in the NUCA cache to the most suitable position in the whole cache. We propose, implement and evaluate three approaches of The Auction mechanism. We also show that The Auction manages the cache efficiently and significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8%, and reduces energy consumed by the memory system by 4%.

References

[1]
Micron system power calculator. {Online}. Available: http://www.micron.com/support/partinfo/powercalc
[2]
Spec cpu2006. {Online}. Available: http://www.spec.org/cpu2006
[3]
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, "Clock rate vs. ipc: The end of the road for conventional microprocessors," in Proceedings of the 27th International Symposium on Computer Architecture, 2000.
[4]
S. Akioka, F. Li, K. Malkowski, P. Raghavan, M. Kandemir, and M. J. Irwin, "Ring data location prediction scheme for non-uniform cache architectures," in Proocedings of the International Conference on Computer Design, 2008.
[5]
A. Bardine, P. Foglia, G. Gabrielli, and C. A. Prete, "Analysis of static and dynamic energy consumption in nuca caches: Initial results," in Proceedings of the 2007 Workshop on Memory Performance: Dealing with Applications, Systems and Architecture, 2007.
[6]
B. M. Beckmann, M. R. Marty, and D. A. Wood, "Asr: Adaptive selective replication for cmp caches," in Proceedings of the 39th Annual IEEE/ACM International Symposium of Microarchitecture, 2006.
[7]
B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches," in Proceedings of the 37th International Symposium on Microarchitecture, 2004.
[8]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2008.
[9]
J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in Proceedings of the 33rd International Symposium on Computer Architecture, 2006.
[10]
J. Chang and G. S. Sohi, "Cooperative cache partitioning for chip multiprocessors," in Proceedings of the 21st ACM International Conference on Supercomputing, 2007.
[11]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Distance associativity for high-performance energy-efficient non-uniform cache architectures," in Proceedings of the 36th International Symposium on Microarchitecture, 2003.
[12]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in cmps," in Proceedings of the 32nd International Symposium on Computer Architecture, 2005.
[13]
P. Dubey, "A platform 2015 workload model: Recognition, mining and synthesis moves computers to the era of tera," in Intel White Paper, Intel Corporation, 2005.
[14]
H. Dybdahl and P. Stenström, "An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors," in Proceedings of the 13th International Symposium on High-Performance Computer Architecture, 2007.
[15]
E. Grochowski, R. Ronen, J. Shen, and H. Wang, "Best of both latency and throughput," in Proceedings of the 22nd International Conference on Computer Design, 2004.
[16]
M. Hammoud, S. Cho, and R. Melhem, "Acm: An efficient approach for managing shared caches in chip multiprocessors," in Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC), 2009.
[17]
M. Hammoud, S. Cho, and R. Melhem, "Dynamic cache clustering for chip multiprocessors," in Proceedings of the ACM International Conference on Supercomputing (ICS), 2009.
[18]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive nuca: Near-optimal block placement and replication in distributed caches," in Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA), 2009.
[19]
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, "A nuca substrate for flexible cmp cache sharing," in Proceedings of the 19th ACM International Conference on Supercomputing, 2005.
[20]
N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90.
[21]
M. Kandemir, F. Li, M. J. Irwin, and S. W. Son, "A novel migration-based nuca design for chip multiprocessors," in Proceedings of the ACM/IEEE conference on Supercomputing, 2008.
[22]
C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.
[23]
P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: a 32-way multithreaded sparc processor," in IEEE Micro, March 2005.
[24]
H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, "Ibm power6 microarchitecture," IBM Journal, November 2007.
[25]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, Simics: A Full System Simulator Platform. Computer, 2002, vol. 35--2, pp. 50--58.
[26]
M. M. K. Martin, M. D. Hill, and D. A. Wood, "Token coherence: Decoupling performance and correctness," in Proceedings of the 30th International Symposium on Computer Architecture, 2003.
[27]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood, "Multifacet's general execution-driven multiprocessor simulator (gems) toolset," in Computer Architecture News, Sep. 2005.
[28]
D. Matzke, "Will physical scalability sabotage performance gains?" IEEE Computer, September 1997.
[29]
C. McNairy and R. Bhatia, "Montecito: A dual-core, dual-thread itanium processor," IEEE Micro, vol. 25, no. 2, March-April 2005.
[30]
N. Muralimanohar and R. Balasubramonian, "Interconnect design considerations for large nuca caches," in Proceedings of the 34th International Symposium on Computer Architecture, 2007.
[31]
N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007.
[32]
R. Ricci, S. Barrus, and R. Balasubramonian, "Leveraging bloom filters for smart search within nuca caches," in Proceedings of the 7th Workshop on Complexity-Effective Design (WCED), 2006.
[33]
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finnan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, "An 80-tile 1.28tflops network-on-chip in 65nm cmos," in Proceedings of the IEEE International Solid-State Circuits Conference, 2007.
[34]
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "Simflex: Statistical sampling of computer system simulation," IEEE Micro, vol. 26, no. 4, pp. 18--31, 2006.
[35]
M. Zhang and K. Asanović, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proceedings of the 32nd International Symposium on Computer Architecture, 2005.

Cited By

View all

Index Terms

  1. The auction: optimizing banks usage in Non-Uniform Cache Architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
    June 2010
    365 pages
    ISBN:9781450300186
    DOI:10.1145/1810085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bank replacement policy
    2. chip multiprocessors (CMP)
    3. non-uniform cache architecture (NUCA)

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICS'10
    Sponsor:
    ICS'10: International Conference on Supercomputing
    June 2 - 4, 2010
    Ibaraki, Tsukuba, Japan

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media