skip to main content
10.1145/2872362.2872363acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Whirlpool: Improving Dynamic Cache Management with Static Data Classification

Published: 25 March 2016 Publication History

Abstract

Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on regular programs, but because they set fixed policies, they are vulnerable to changes in program behavior or available cache space. Instead, most systems rely on dynamic caching policies that adapt to observed program behavior. Unfortunately, dynamic policies spend significant resources trying to learn how programs use memory, and yet they often perform worse than a static policy. We present Whirlpool, a novel approach that combines static information with dynamic policies to reap the benefits of each. Whirlpool statically classifies data into pools based on how the program uses memory. Whirlpool then uses dynamic policies to tune the cache to each pool. Hence, rather than setting policies statically, Whirlpool uses static analysis to guide dynamic policies. We present both an API that lets programmers specify pools manually and a profiling tool that discovers pools automatically in unmodified binaries.
We evaluate Whirlpool on a state-of-the-art NUCA cache. Whirlpool significantly outperforms prior approaches: on sequential programs, Whirlpool improves performance by up to 38% and reduces data movement energy by up to 53%; on parallel programs, Whirlpool improves performance by up to 67% and reduces data movement energy by up to 2.6x.

References

[1]
N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking bandwidth for GPUs in CC-NUMA systems," in Proc. HPCA-21, 2015.
[2]
N. Agarwal, D. Nellans, M. Stephenson, M. O'Connor, and S. W. Keckler, "Page Placement Strategies for GPUs within Heterogeneous Memory Systems," in Proc. ASPLOS-XX, 2015.
[3]
K. Aingaran, D. Smentek, T. Wicki, S. Jairath, G. Konstadinidis, S. Leung et al., "M7: Oracle's Next-Generation Sparc Processor," IEEE Micro, no. 2, 2015.
[4]
M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter, "Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches," in Proc. HPCA-15, 2009.
[5]
S. Beamer, K. Asanović, and D. Patterson, "The GAP benchmark suite," arXiv:1508.03619 [cs.DC], 2015.
[6]
B. M. Beckmann, M. R. Marty, and D. A. Wood, "ASR: Adaptive selective replication for CMP caches," in Proc. MICRO-39, 2006.
[7]
B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches," in Proc. MICRO-37, 2004.
[8]
N. Beckmann, "Design and analysis of spatially-partitioned shared caches," Ph.D. dissertation, Massachusetts Institute of Technology, 2015.
[9]
N. Beckmann and D. Sanchez, "Jigsaw: Scalable software-defined caches," in Proc. PACT-22, 2013.
[10]
N. Beckmann and D. Sanchez, "Talus: A Simple Way to Remove Cliffs in Cache Performance," in Proc. HPCA-21, 2015.
[11]
N. Beckmann, P.-A. Tsai, and D. Sanchez, "Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling," in Proc. HPCA-21, 2015.
[12]
K. Beyls and E. D'Hollander, "Generating cache hints for improved program efficiency," J. Syst. Architect., vol. 51, no. 4, 2005.
[13]
R. D. Blumofe and C. E. Leiserson, "Scheduling multithreaded computations by work stealing," J. ACM, vol. 46, no. 5, 1999.
[14]
J. Brock, X. Gu, B. Bao, and C. Ding, "Pacman: Program-assisted cache management," in Proc. ISMM, 2013.
[15]
D. Chandra, F. Guo, S. Kim, and Y. Solihin, "Predicting inter-thread cache contention on a chip multi-processor architecture," in Proc. HPCA-11, 2005.
[16]
Q. Chen, M. Guo, and H. Guan, "LAWS: locality-aware work-stealing for multi-socket multi-core architectures," in Proc. ICS, 2014.
[17]
X. E. Chen and T. M. Aamodt, "A first-order fine-grained multithreaded throughput model," in Proc. HPCA-15, 2009.
[18]
Z. Chishti, M. D. Powell, and T. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs," in Proc. ISCA-32, 2005.
[19]
S. Cho and L. Jin, "Managing distributed, shared L2 caches through OS-level page allocation," in Proc. MICRO-39, 2006.
[20]
S. Coleman and K. S. McKinley, "Tile size selection using cache organization and data layout," in Proc. PLDI, 1995.
[21]
W. J. Dally, "GPU Computing: To Exascale and Beyond," in Supercomputing '10, Plenary Talk, 2010.
[22]
S. Das, T. M. Aamodt, and W. J. Dally, "SLIP: reducing wire energy in the memory hierarchy," in Proc. ISCA-42, 2015.
[23]
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers et al., "Traffic management: a holistic approach to memory placement on NUMA systems," in Proc. ASPLOS-XVIII, 2013.
[24]
C. Ding and Y. Zhong, "Predicting whole-program locality through reuse distance analysis," in Proc. PLDI, 2003.
[25]
D. Eklov, D. Black-Schaffer, and E. Hagersten, "StatCC: a statistical cache contention model," in Proc. PACT-19, 2010.
[26]
M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proc. PLDI, 1998.
[27]
X. Gu, T. Bai, Y. Gao, C. Zhang, R. Archambault, and C. Ding, "P-OPT: Program-directed optimal cache management," in Proc. LCPC-21, 2008.
[28]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proc. ISCA-36, 2009.
[29]
E. Herrero, J. González, and R. Canal, "Distributed cooperative caching," in Proc. PACT-17, 2008.
[30]
E. Herrero, J. González, and R. Canal, "Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors," in Proc. ISCA-37, 2010.
[31]
A. Hilton, N. Eswaran, and A. Roth, "FIESTA: A sample-balanced multi-program workload methodology," Proc. MoBS, 2009.
[32]
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer, "Adaptive insertion policies for managing shared caches," in Proc. PACT-17, 2008.
[33]
A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. ISCA-37, 2010.
[34]
N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA-17, 1990.
[35]
G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, 1998.
[36]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the future of parallel computing," IEEE Micro, no. 5, 2011.
[37]
C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proc. ASPLOS-X, 2002.
[38]
R. Komuravelli, M. D. Sinclair, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve, "Stash: Have your scratchpad and cache it too," in Proc. ISCA-42, 2015.
[39]
M. D. Lam, E. E. Rothberg, and M. E. Wolf, "The cache performance and optimizations of blocked algorithms," in Proc. ASPLOS-IV, 1991.
[40]
D. Lea, "A memory allocator," http://gee.cs.oswego.edu/dl/html/malloc.html, 2000.
[41]
J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing memory systems for chip multiprocessors," in Proc. ISCA-34, 2007.
[42]
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proc. MICRO-42, 2009.
[43]
J. Lira, C. Molina, and A. González, "LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors," in Proc. ICCD, 2009.
[44]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney et al., "Pin: building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.
[45]
A. A. Melkman, "On-line construction of the convex hull of a simple polyline," Information Processing Letters, vol. 25, no. 1, 1987.
[46]
Micron, "1.35V DDR3L power calculator (4Gb x16 chips)," 2013.
[47]
T. C. Mowry, M. S. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching," in Proc. ASPLOS-V, 1992.
[48]
S. Palanca, V. Pentkovski, S. Tsai, and S. Maiyuran, "Method and apparatus for implementing non-temporal stores," 2001, US Patent 6,205,520.
[49]
G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. Kozuch et al., "Exploiting compressed block size as an indicator of future reuse," in Proc. HPCA-21, 2015.
[50]
P. Petoumenos, G. Keramidas, and S. Kaxiras, "Instruction-based reuse-distance prediction for effective cache management," in Proc. SAMOS, 2009.
[51]
M. K. Qureshi, "Adaptive spill-receive for robust high-performance caching in CMPs," in Proc. HPCA-15, 2009.
[52]
M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. MICRO-39, 2006.
[53]
S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin et al., "Ivytown: A 22nm 15-core enterprise Xeon® processor family," in Proc. ISSCC, 2014.
[54]
D. Sanchez and C. Kozyrakis, "The ZCache: Decoupling ways and associativity," in Proc. MICRO-43, 2010.
[55]
D. Sanchez and C. Kozyrakis, "Vantage: Scalable and efficient fine-grain cache partitioning," in Proc. ISCA-38, 2011.
[56]
D. Sanchez and C. Kozyrakis, "ZSim: fast and accurate microarchitectural simulation of thousand-core systems," in Proc. ISCA-40, 2013.
[57]
D. Sanchez, R. M. Yoo, and C. Kozyrakis, "Flexible architectural support for fine-grain scheduling," in Proc. ASPLOS-XV, 2010.
[58]
N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan et al., "Navigating the maze of graph analytics frameworks using massive graph datasets," in Proc. SIGMOD, 2014.
[59]
J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in Proc. High Performance Computing for Computational Science (VECPAR), 2010.
[60]
J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola, H. V. Simhadri et al., "Brief announcement: the problem based benchmark suite," in Proc. SPAA, 2012.
[61]
D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm, "RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations," in Proc. ASPLOS-XIV, 2009.
[62]
G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proc. MICRO-28, 1995.
[63]
Z. Wang, K. McKinley, A. L. Rosenberg, and C. C. Weems, "Using the compiler to improve cache replacement decisions," in Proc. PACT-11, 2002.
[64]
M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in Proc. PLDI, 1991.
[65]
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. MICRO-44, 2011.
[66]
X. Xiang, B. Bao, C. Ding, and Y. Gao, "Linear-time modeling of program working set in shared cache," in Proc. PACT-20, 2011.
[67]
R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis, "Locality-aware task management for unstructured parallelism: A quantitative limit study," in Proc. SPAA, 2013.
[68]
M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proc. ISCA-32, 2005.
[69]
S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing shared resource contention in multicore processors via scheduling," in Proc. ASPLOS-XV, 2010.

Cited By

View all

Index Terms

  1. Whirlpool: Improving Dynamic Cache Management with Static Data Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2016
    824 pages
    ISBN:9781450340915
    DOI:10.1145/2872362
    • General Chair:
    • Tom Conte,
    • Program Chair:
    • Yuanyuan Zhou
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 March 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cache modeling
    2. data movement
    3. non-uniform cache access (NUCA)
    4. static analysis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASPLOS '16

    Acceptance Rates

    ASPLOS '16 Paper Acceptance Rate 53 of 232 submissions, 23%;
    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)288
    • Downloads (Last 6 weeks)49
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media