research-article

Public Access

Whirlpool: Improving Dynamic Cache Management with Static Data Classification

Authors:

Anurag Mukkara,

Nathan Beckmann,

Daniel SanchezAuthors Info & Claims

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 113 - 127

https://doi.org/10.1145/2872362.2872363

Published: 25 March 2016 Publication History

Abstract

Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on regular programs, but because they set fixed policies, they are vulnerable to changes in program behavior or available cache space. Instead, most systems rely on dynamic caching policies that adapt to observed program behavior. Unfortunately, dynamic policies spend significant resources trying to learn how programs use memory, and yet they often perform worse than a static policy. We present Whirlpool, a novel approach that combines static information with dynamic policies to reap the benefits of each. Whirlpool statically classifies data into pools based on how the program uses memory. Whirlpool then uses dynamic policies to tune the cache to each pool. Hence, rather than setting policies statically, Whirlpool uses static analysis to guide dynamic policies. We present both an API that lets programmers specify pools manually and a profiling tool that discovers pools automatically in unmodified binaries.

We evaluate Whirlpool on a state-of-the-art NUCA cache. Whirlpool significantly outperforms prior approaches: on sequential programs, Whirlpool improves performance by up to 38% and reduces data movement energy by up to 53%; on parallel programs, Whirlpool improves performance by up to 67% and reduces data movement energy by up to 2.6x.

References

[1]

N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, "Unlocking bandwidth for GPUs in CC-NUMA systems," in Proc. HPCA-21, 2015.

[2]

N. Agarwal, D. Nellans, M. Stephenson, M. O'Connor, and S. W. Keckler, "Page Placement Strategies for GPUs within Heterogeneous Memory Systems," in Proc. ASPLOS-XX, 2015.

Digital Library

[3]

K. Aingaran, D. Smentek, T. Wicki, S. Jairath, G. Konstadinidis, S. Leung et al., "M7: Oracle's Next-Generation Sparc Processor," IEEE Micro, no. 2, 2015.

[4]

M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter, "Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches," in Proc. HPCA-15, 2009.

[5]

S. Beamer, K. Asanović, and D. Patterson, "The GAP benchmark suite," arXiv:1508.03619 [cs.DC], 2015.

[6]

B. M. Beckmann, M. R. Marty, and D. A. Wood, "ASR: Adaptive selective replication for CMP caches," in Proc. MICRO-39, 2006.

Digital Library

[7]

B. M. Beckmann and D. A. Wood, "Managing wire delay in large chip-multiprocessor caches," in Proc. MICRO-37, 2004.

[8]

N. Beckmann, "Design and analysis of spatially-partitioned shared caches," Ph.D. dissertation, Massachusetts Institute of Technology, 2015.

[9]

N. Beckmann and D. Sanchez, "Jigsaw: Scalable software-defined caches," in Proc. PACT-22, 2013.

[10]

N. Beckmann and D. Sanchez, "Talus: A Simple Way to Remove Cliffs in Cache Performance," in Proc. HPCA-21, 2015.

[11]

N. Beckmann, P.-A. Tsai, and D. Sanchez, "Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling," in Proc. HPCA-21, 2015.

[12]

K. Beyls and E. D'Hollander, "Generating cache hints for improved program efficiency," J. Syst. Architect., vol. 51, no. 4, 2005.

Digital Library

[13]

R. D. Blumofe and C. E. Leiserson, "Scheduling multithreaded computations by work stealing," J. ACM, vol. 46, no. 5, 1999.

[14]

J. Brock, X. Gu, B. Bao, and C. Ding, "Pacman: Program-assisted cache management," in Proc. ISMM, 2013.

Digital Library

[15]

D. Chandra, F. Guo, S. Kim, and Y. Solihin, "Predicting inter-thread cache contention on a chip multi-processor architecture," in Proc. HPCA-11, 2005.

[16]

Q. Chen, M. Guo, and H. Guan, "LAWS: locality-aware work-stealing for multi-socket multi-core architectures," in Proc. ICS, 2014.

Digital Library

[17]

X. E. Chen and T. M. Aamodt, "A first-order fine-grained multithreaded throughput model," in Proc. HPCA-15, 2009.

[18]

Z. Chishti, M. D. Powell, and T. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs," in Proc. ISCA-32, 2005.

[19]

S. Cho and L. Jin, "Managing distributed, shared L2 caches through OS-level page allocation," in Proc. MICRO-39, 2006.

[20]

S. Coleman and K. S. McKinley, "Tile size selection using cache organization and data layout," in Proc. PLDI, 1995.

Digital Library

[21]

W. J. Dally, "GPU Computing: To Exascale and Beyond," in Supercomputing '10, Plenary Talk, 2010.

[22]

S. Das, T. M. Aamodt, and W. J. Dally, "SLIP: reducing wire energy in the memory hierarchy," in Proc. ISCA-42, 2015.

Digital Library

[23]

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers et al., "Traffic management: a holistic approach to memory placement on NUMA systems," in Proc. ASPLOS-XVIII, 2013.

Digital Library

[24]

C. Ding and Y. Zhong, "Predicting whole-program locality through reuse distance analysis," in Proc. PLDI, 2003.

Digital Library

[25]

D. Eklov, D. Black-Schaffer, and E. Hagersten, "StatCC: a statistical cache contention model," in Proc. PACT-19, 2010.

Digital Library

[26]

M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proc. PLDI, 1998.

Digital Library

[27]

X. Gu, T. Bai, Y. Gao, C. Zhang, R. Archambault, and C. Ding, "P-OPT: Program-directed optimal cache management," in Proc. LCPC-21, 2008.

Digital Library

[28]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proc. ISCA-36, 2009.

Digital Library

[29]

E. Herrero, J. González, and R. Canal, "Distributed cooperative caching," in Proc. PACT-17, 2008.

Digital Library

[30]

E. Herrero, J. González, and R. Canal, "Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors," in Proc. ISCA-37, 2010.

Digital Library

[31]

A. Hilton, N. Eswaran, and A. Roth, "FIESTA: A sample-balanced multi-program workload methodology," Proc. MoBS, 2009.

[32]

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely Jr, and J. Emer, "Adaptive insertion policies for managing shared caches," in Proc. PACT-17, 2008.

Digital Library

[33]

A. Jaleel, K. B. Theobald, S. C. Steely Jr, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," in Proc. ISCA-37, 2010.

Digital Library

[34]

N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA-17, 1990.

[35]

G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, 1998.

[36]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the future of parallel computing," IEEE Micro, no. 5, 2011.

[37]

C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proc. ASPLOS-X, 2002.

[38]

R. Komuravelli, M. D. Sinclair, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve, "Stash: Have your scratchpad and cache it too," in Proc. ISCA-42, 2015.

Digital Library

[39]

M. D. Lam, E. E. Rothberg, and M. E. Wolf, "The cache performance and optimizations of blocked algorithms," in Proc. ASPLOS-IV, 1991.

Digital Library

[40]

D. Lea, "A memory allocator," http://gee.cs.oswego.edu/dl/html/malloc.html, 2000.

[41]

J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis, "Comparing memory systems for chip multiprocessors," in Proc. ISCA-34, 2007.

[42]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Proc. MICRO-42, 2009.

Digital Library

[43]

J. Lira, C. Molina, and A. González, "LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors," in Proc. ICCD, 2009.

[44]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney et al., "Pin: building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.

Digital Library

[45]

A. A. Melkman, "On-line construction of the convex hull of a simple polyline," Information Processing Letters, vol. 25, no. 1, 1987.

Digital Library

[46]

Micron, "1.35V DDR3L power calculator (4Gb x16 chips)," 2013.

[47]

T. C. Mowry, M. S. Lam, and A. Gupta, "Design and evaluation of a compiler algorithm for prefetching," in Proc. ASPLOS-V, 1992.

Digital Library

[48]

S. Palanca, V. Pentkovski, S. Tsai, and S. Maiyuran, "Method and apparatus for implementing non-temporal stores," 2001, US Patent 6,205,520.

[49]

G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. Kozuch et al., "Exploiting compressed block size as an indicator of future reuse," in Proc. HPCA-21, 2015.

[50]

P. Petoumenos, G. Keramidas, and S. Kaxiras, "Instruction-based reuse-distance prediction for effective cache management," in Proc. SAMOS, 2009.

[51]

M. K. Qureshi, "Adaptive spill-receive for robust high-performance caching in CMPs," in Proc. HPCA-15, 2009.

[52]

M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. MICRO-39, 2006.

Digital Library

[53]

S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin et al., "Ivytown: A 22nm 15-core enterprise Xeon® processor family," in Proc. ISSCC, 2014.

[54]

D. Sanchez and C. Kozyrakis, "The ZCache: Decoupling ways and associativity," in Proc. MICRO-43, 2010.

Digital Library

[55]

D. Sanchez and C. Kozyrakis, "Vantage: Scalable and efficient fine-grain cache partitioning," in Proc. ISCA-38, 2011.

Digital Library

[56]

D. Sanchez and C. Kozyrakis, "ZSim: fast and accurate microarchitectural simulation of thousand-core systems," in Proc. ISCA-40, 2013.

Digital Library

[57]

D. Sanchez, R. M. Yoo, and C. Kozyrakis, "Flexible architectural support for fine-grain scheduling," in Proc. ASPLOS-XV, 2010.

[58]

N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan et al., "Navigating the maze of graph analytics frameworks using massive graph datasets," in Proc. SIGMOD, 2014.

Digital Library

[59]

J. Shalf, S. Dosanjh, and J. Morrison, "Exascale computing technology challenges," in Proc. High Performance Computing for Computational Science (VECPAR), 2010.

[60]

J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola, H. V. Simhadri et al., "Brief announcement: the problem based benchmark suite," in Proc. SPAA, 2012.

Digital Library

[61]

D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm, "RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations," in Proc. ASPLOS-XIV, 2009.

Digital Library

[62]

G. Tyson, M. Farrens, J. Matthews, and A. R. Pleszkun, "A modified approach to data cache management," in Proc. MICRO-28, 1995.

[63]

Z. Wang, K. McKinley, A. L. Rosenberg, and C. C. Weems, "Using the compiler to improve cache replacement decisions," in Proc. PACT-11, 2002.

[64]

M. E. Wolf and M. S. Lam, "A data locality optimizing algorithm," in Proc. PLDI, 1991.

Digital Library

[65]

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr, and J. Emer, "SHiP: signature-based hit predictor for high performance caching," in Proc. MICRO-44, 2011.

Digital Library

[66]

X. Xiang, B. Bao, C. Ding, and Y. Gao, "Linear-time modeling of program working set in shared cache," in Proc. PACT-20, 2011.

[67]

R. M. Yoo, C. J. Hughes, C. Kim, Y.-K. Chen, and C. Kozyrakis, "Locality-aware task management for unstructured parallelism: A quantitative limit study," in Proc. SPAA, 2013.

Digital Library

[68]

M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in Proc. ISCA-32, 2005.

Digital Library

[69]

S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing shared resource contention in multicore processors via scheduling," in Proc. ASPLOS-XV, 2010.

Cited By

Wang ZLiu CBeckmann NNowatzki T(2023)Affinity Alloc: Taming Not-So Near-Data ComputingProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623778(784-799)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623778
Gomes CHempstead M(2023)CInC: Workload Characterization In Context of Resource Contention2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00035(201-205)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00035
Olson MKammerdiener BJantz MDoshi KJones T(2022)Online Application Guidance for Heterogeneous Memory SystemsACM Transactions on Architecture and Code Optimization10.1145/353385519:3(1-27)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3533855
Show More Cited By

Index Terms

Whirlpool: Improving Dynamic Cache Management with Static Data Classification
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

Whirlpool: Improving Dynamic Cache Management with Static Data Classification
ASPLOS'16

Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on ...
Whirlpool: Improving Dynamic Cache Management with Static Data Classification
ASPLOS '16

Cache hierarchies are increasingly non-uniform and difficult to manage. Several techniques, such as scratchpads or reuse hints, use static information about how programs access data to manage the memory hierarchy. Static techniques are effective on ...
Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET
Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

March 2016

824 pages

ISBN:9781450340915

DOI:10.1145/2872362

General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

ACM SIGARCH Computer Architecture News Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ASPLOS '16

Sponsor:

ASPLOS '16: Architectural Support for Programming Languages and Operating Systems

April 2 - 6, 2016

Georgia, Atlanta, USA

Acceptance Rates

ASPLOS '16 Paper Acceptance Rate 53 of 232 submissions, 23%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
1,821
Total Downloads

Downloads (Last 12 months)288
Downloads (Last 6 weeks)49

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZLiu CBeckmann NNowatzki T(2023)Affinity Alloc: Taming Not-So Near-Data ComputingProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623778(784-799)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623778
Gomes CHempstead M(2023)CInC: Workload Characterization In Context of Resource Contention2023 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC59245.2023.00035(201-205)Online publication date: 1-Oct-2023
https://doi.org/10.1109/IISWC59245.2023.00035
Olson MKammerdiener BJantz MDoshi KJones T(2022)Online Application Guidance for Heterogeneous Memory SystemsACM Transactions on Architecture and Code Optimization10.1145/353385519:3(1-27)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3533855
Vijaykumar NOlgun AKanellopoulos KBostanci FHassan HLotfi MGibbons PMutlu O(2022)MetaSys: A Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer OptimizationsACM Transactions on Architecture and Code Optimization10.1145/350525019:2(1-29)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3505250
Wen FQin MGratz PReddy N(2022)Software Hint-Driven Data Management for Hybrid Memory in Mobile SystemsACM Transactions on Embedded Computing Systems10.1145/349453621:1(1-18)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.1145/3494536
Schwedock BYoovidhya PSeibert JBeckmann NSalapura VZahran MChong FTang L(2022)täkōProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527379(42-58)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527379
Gomes CChen XHempstead M(2022)PInTE: Probabilistic Induction of Theft Evictions2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00011(1-13)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00011
McAllister SBerg BTutuncu-Macias JYang JGunasekar SLu JBerger DBeckmann NGanger G(2021)KangarooProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483568(243-262)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483568
Chen YWu AHwang T(2021)A Dynamic Link-latency Aware Cache Replacement Policy (DLRP)Proceedings of the 26th Asia and South Pacific Design Automation Conference10.1145/3394885.3431420(210-215)Online publication date: 18-Jan-2021
https://dl.acm.org/doi/10.1145/3394885.3431420
Saez JCastro FFanizzi GPrieto-Matias M(2021)LFOC+: A Fair OS-level Cache-Clustering Policy for Commodity Multicore SystemsIEEE Transactions on Computers10.1109/TC.2021.3112970(1-1)Online publication date: 2021
https://doi.org/10.1109/TC.2021.3112970
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents