research-article

Public Access

BATMAN: techniques for maximizing system bandwidth of memory systems with stacked-DRAM

Authors:

Moinuddin QureshiAuthors Info & Claims

MEMSYS '17: Proceedings of the International Symposium on Memory Systems

Pages 268 - 280

https://doi.org/10.1145/3132402.3132404

Published: 02 October 2017 Publication History

Abstract

Tiered-memory systems consist of high-bandwidth 3D-DRAM and high-capacity commodity-DRAM. Conventional designs attempt to improve system performance by maximizing the number of memory accesses serviced by 3D-DRAM. However, when the commodity-DRAM bandwidth is a significant fraction of overall system bandwidth, the techniques inefficiently utilize the total bandwidth offered by the tiered-memory system and yields sub-optimal performance. In such situations, the performance can be improved by distributing memory accesses that are proportional to the bandwidth of each memory. Ideally, we want a simple and effective runtime mechanism that achieves the desired access distribution without requiring significant hardware or software support.

This paper proposes Bandwidth-Aware Tiered-Memory Management (BATMAN), a runtime mechanism that manages the distribution of memory accesses in a tiered-memory system by explicitly controlling data movement. BATMAN monitors the number of accesses to both memories, and when the number of 3D-DRAM accesses exceeds the desired threshold, BATMAN disallows data movement from the commodity-DRAM to 3D-DRAM and proactively moves data from 3D-DRAM to commodity-DRAM. We demonstrate BATMAN on systems that architect the 3D-DRAM as either a hardware-managed cache (cache mode) or a part of the OS-visible memory space (flat mode). Our evaluations on a system with 4GB 3D-DRAM and 32GB commodity-DRAM show that BATMAN improves performance by an average of 11% and 10% and energy-delay product by 13% and 11% for systems in the cache and flat modes, respectively. BATMAN incurs only an eight-byte hardware overhead and requires negligible software modification.

References

[1]

Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. SIGARCH Comput. Archit. News 43, 1, 607--618.

Digital Library

[2]

Frank Bellosa. 2004. When physical is not real enough. In Proceedings of the ACM SIGOPS European workshop.

Digital Library

[3]

Sergey Blagodurov and Alexandra Fedorova. 2011. User-level Scheduling on NUMA Multicore Systems under Linux. In in Proc. of Linux Symposium.

[4]

Jag Bolaria. 2011. Micron Reinvents DRAM Memory. Microprocessor Report (2011).

[5]

William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. 1991. NUMA Policies and Their Relation to Memory Architecture. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, USA, 212--221.

Digital Library

[6]

Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 12--24.

Digital Library

[7]

D. W. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. S. Kim, and M. Schulte. 2013. Reevaluating the latency claims of 3D stacked memories. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific. 657--662.

[8]

Niladrish Chatterjee, Rajeev Balasubramonian, Manjunath Shevgoor, Seth H. Pugsley, Aniruddha N. Udipi, Ali Shafiee, Kshitij Sudan, and Manu Awasthi. 2012. USIMM. University of Utah.

[9]

Chiachen Chou, Aamer Jaleel, and Moinuddin Qureshi. 2015. BATMAN: Maximizing Bandwidth Utilization of Hybrid Memory Systems. Technical Report. School of Electrical and Computer Engineering, Georgia Institute of Technology.

[10]

Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 1--12.

[11]

Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2015. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 198--210.

Digital Library

[12]

S.K. De, R.A. Stewart, G.C. Cascaval, and D.T. Chun. 2015. System and method for allocating memory to dissimilar memory devices using quality of service. (July 28 2015). US Patent 9,092,327.

[13]

Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11.

Digital Library

[14]

Magnus Ekman and Per Stenstrom. 2004. A Case for Multi-level Main Memory. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI '04). ACM, New York, NY, USA, 1--8.

Digital Library

[15]

G.H. Golub and C.F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.

[16]

Darryl Gove. 2007. CPU2006 Working Set Size. SIGARCH Comput. Archit. News 35, 1 (March 2007), 90--96.

Digital Library

[17]

N. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan. 2014. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 38--50.

Digital Library

[18]

John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1--17.

Digital Library

[19]

John L. Henning. 2007. SPEC CPU2006 Memory Footprint. SIGARCH Comput. Archit. News 35, 1 (March 2007), 84--89.

Digital Library

[20]

HMCC. 2013. HMC Specification 1.0. http://www.hybridmemorycube.org

[21]

M. A. Holliday. 1989. Reference History, Page Size, and Migration Daemons in Local/Remote Architectures. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III). ACM, New York, NY, USA, 104--112.

Digital Library

[22]

Hai Huang, Padmanabhan Pillai, and Kang G. Shin. 2003. Design and Implementation of Power-aware Virtual Memory. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC '03). USENIX Association, Berkeley, CA, USA, 5--5. http://dl.acm.org/citation.cfm?id=1247340.1247345

Digital Library

[23]

Intel. 2013. Intel Core i7 Processor. http://www.intel.com/processor/corei7/specifications.html

[24]

JEDEC. 2013. DDR4 SPEC.

[25]

JEDEC. 2014. High Bandwidth Memory (HBM) DRAM, Gen 2.

[26]

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 25--37.

Digital Library

[27]

Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 404--415.

Digital Library

[28]

Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy Kurian John. 2011. Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 24--35.

Digital Library

[29]

Richard P. Larowe, Jr. and Carla Schlatter Ellis. 1991. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Trans. Comput. Syst. 9, 4 (Nov. 1991), 319--363.

Digital Library

[30]

Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, Berkeley, CA, USA, 277--289. http://dl.acm.org/citation.cfm?id=2813767.2813788

Digital Library

[31]

Gabriel H. Loh and Mark D. Hill. 2011. Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 454--464.

Digital Library

[32]

Gabriel H. Loh, Nuwan Jayasena, Jaewoong Chung, Steven K. Reinhardt, J. Michael OConnor, and Kevin McGrath. 2012. Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems. In 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads.

[33]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190--200.

Digital Library

[34]

John D. McCalpin. 1991. STREAM: Sustainable Memory Bandwidth in High Performance Computer. http://www.cs.virginia.edu/stream/

[35]

M.R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G.H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on.

[36]

Micron. 2010. 1Gb_DDR3_SDRAM.

[37]

Micron. 2012. Calculating DDR Memory System Power Introduction.

[38]

Micron. 2014. HMC Gen2. Micron.

[39]

NVIDIA. 2014. NVIDIA Pascal. http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/

[40]

Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '03). ACM, New York, NY, USA, 318--319.

Digital Library

[41]

Moinuddin K. Qureshi and Gabe H. Loh. 2012. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 12.

Digital Library

[42]

Jaewoong Sim, Gabriel H. Loh, Hyesoon Kim, Mike O'Connor, and Mithuna Thottethodi. 2012. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 11.

Digital Library

[43]

Avinash Sodani. 2015. Knights Landing (KNL): 2nd Generation Intel Xeon Phi Processor. (Hot-Chips 2015). http://tinyurl.com/hotchips-2015-sodani

[44]

Avinash Sodani. 2016. Knights Landing Intel Xeon Phi CPU: Path to Parallelism with General Purpose Programming. (Keynote Address HPCA 2016).

[45]

A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34--46.

Digital Library

[46]

Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen. 2015. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIXATC '15). USENIX Association, Berkeley, CA, USA, 529--540. http://dl.acm.org/citation.cfm?id=2813767.2813807

Digital Library

[47]

Kevin Tran. 2016. The Era of High Bandwidth Memory. In Hot Chips: A Symposium on High Performance Chips.

[48]

Thomas Vogelsang. 2010. Understanding the Energy Consumption of Dynamic Random Access Memories. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '43). IEEE Computer Society, Washington, DC, USA, 363--374.

Digital Library

Cited By

Zhong YBerger DWaldspurger CWee RAgarwal IAgarwal RHady FKumar KHill MChowdhury MCidon AGavrilovska ATerry D(2024)Managing memory tiers with CXL in virtualized environmentsProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691941(37-56)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691941
Xiang LLin ZDeng WLu HRao JYuan YWang RGavrilovska ATerry D(2024)NOMADProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691940(19-35)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691940
Vuppalapati MAgarwal RWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Tiered Memory Management: Access Latency is the Key!Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695968(79-94)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695968
Show More Cited By

Index Terms

BATMAN: techniques for maximizing system bandwidth of memory systems with stacked-DRAM

Recommendations

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dual-mode flash memory in embedded systems

Similar to traditional NAND flash memory, triple-level cell (TLC) flash memory is used as secondary storage to meet the fast growing demands on storage capacity. TLC flash memory exhibits attractive features such as shock resistance, high density, low ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '17: Proceedings of the International Symposium on Memory Systems

October 2017

409 pages

ISBN:9781450353359

DOI:10.1145/3132402

General Chair:
Bruce Jacob
University of Maryland

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

MEMSYS 2017

MEMSYS 2017: The International Symposium on Memory Systems, 2017

October 2 - 5, 2017

Virginia, Alexandria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
622
Total Downloads

Downloads (Last 12 months)179
Downloads (Last 6 weeks)26

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhong YBerger DWaldspurger CWee RAgarwal IAgarwal RHady FKumar KHill MChowdhury MCidon AGavrilovska ATerry D(2024)Managing memory tiers with CXL in virtualized environmentsProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691941(37-56)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691941
Xiang LLin ZDeng WLu HRao JYuan YWang RGavrilovska ATerry D(2024)NOMADProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691940(19-35)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691940
Vuppalapati MAgarwal RWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Tiered Memory Management: Access Latency is the Key!Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695968(79-94)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695968
Amouzegar MRezaalipour MDehyadegari M(2024)Genetic Cache: A Machine Learning Approach to Designing DRAM Cache Controllers in HBM SystemsACM Journal on Emerging Technologies in Computing Systems10.1145/367696620:3(1-24)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3676966
Li YTian BGao M(2024)Trimma: Trimming Metadata Storage and Latency for Hybrid Memory SystemsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3689612(108-120)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3689612
Shao QArelakis AStenström P(2024)HMComp: Extending Near-Memory Capacity using Compression in Hybrid MemoryProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656612(74-84)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656612
Li YGao M(2024)Hydrogen: Contention-Aware Hybrid Memory for Heterogeneous CPU-GPU ArchitecturesSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00017(1-15)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SC41406.2024.00017
Bakhshalipour MZare HSamandi FGolshan FLotfi-Kamran PSarbazi-Azad H(2024)Blenda: Dynamically-Reconfigurable Stacked DRAM2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00098(1323-1337)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00098
Zhen YChen WGao WRen JChen KChen Y(2024)PatternS: An intelligent hybrid memory scheduler driven by page pattern recognitionJournal of Systems Architecture10.1016/j.sysarc.2024.103178153(103178)Online publication date: Aug-2024
https://doi.org/10.1016/j.sysarc.2024.103178
Sun YYuan YYu ZKuper RSong CHuang JJi HAgarwal SLou JJeong IWang RAhn JXu TKim N(2023)Demystifying CXL Memory with Genuine CXL-Ready Systems and DevicesProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614256(105-121)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614256
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten