skip to main content
10.1145/3132402.3132404acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article
Public Access

BATMAN: techniques for maximizing system bandwidth of memory systems with stacked-DRAM

Published: 02 October 2017 Publication History

Abstract

Tiered-memory systems consist of high-bandwidth 3D-DRAM and high-capacity commodity-DRAM. Conventional designs attempt to improve system performance by maximizing the number of memory accesses serviced by 3D-DRAM. However, when the commodity-DRAM bandwidth is a significant fraction of overall system bandwidth, the techniques inefficiently utilize the total bandwidth offered by the tiered-memory system and yields sub-optimal performance. In such situations, the performance can be improved by distributing memory accesses that are proportional to the bandwidth of each memory. Ideally, we want a simple and effective runtime mechanism that achieves the desired access distribution without requiring significant hardware or software support.
This paper proposes Bandwidth-Aware Tiered-Memory Management (BATMAN), a runtime mechanism that manages the distribution of memory accesses in a tiered-memory system by explicitly controlling data movement. BATMAN monitors the number of accesses to both memories, and when the number of 3D-DRAM accesses exceeds the desired threshold, BATMAN disallows data movement from the commodity-DRAM to 3D-DRAM and proactively moves data from 3D-DRAM to commodity-DRAM. We demonstrate BATMAN on systems that architect the 3D-DRAM as either a hardware-managed cache (cache mode) or a part of the OS-visible memory space (flat mode). Our evaluations on a system with 4GB 3D-DRAM and 32GB commodity-DRAM show that BATMAN improves performance by an average of 11% and 10% and energy-delay product by 13% and 11% for systems in the cache and flat modes, respectively. BATMAN incurs only an eight-byte hardware overhead and requires negligible software modification.

References

[1]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. SIGARCH Comput. Archit. News 43, 1, 607--618.
[2]
Frank Bellosa. 2004. When physical is not real enough. In Proceedings of the ACM SIGOPS European workshop.
[3]
Sergey Blagodurov and Alexandra Fedorova. 2011. User-level Scheduling on NUMA Multicore Systems under Linux. In in Proc. of Linux Symposium.
[4]
Jag Bolaria. 2011. Micron Reinvents DRAM Memory. Microprocessor Report (2011).
[5]
William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. 1991. NUMA Policies and Their Relation to Memory Architecture. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, USA, 212--221.
[6]
Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, USA, 12--24.
[7]
D. W. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. S. Kim, and M. Schulte. 2013. Reevaluating the latency claims of 3D stacked memories. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific. 657--662.
[8]
Niladrish Chatterjee, Rajeev Balasubramonian, Manjunath Shevgoor, Seth H. Pugsley, Aniruddha N. Udipi, Ali Shafiee, Kshitij Sudan, and Manu Awasthi. 2012. USIMM. University of Utah.
[9]
Chiachen Chou, Aamer Jaleel, and Moinuddin Qureshi. 2015. BATMAN: Maximizing Bandwidth Utilization of Hybrid Memory Systems. Technical Report. School of Electrical and Computer Engineering, Georgia Institute of Technology.
[10]
Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 1--12.
[11]
Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2015. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 198--210.
[12]
S.K. De, R.A. Stewart, G.C. Cascaval, and D.T. Chun. 2015. System and method for allocating memory to dissimilar memory devices using quality of service. (July 28 2015). US Patent 9,092,327.
[13]
Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11.
[14]
Magnus Ekman and Per Stenstrom. 2004. A Case for Multi-level Main Memory. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI '04). ACM, New York, NY, USA, 1--8.
[15]
G.H. Golub and C.F. Van Loan. 1996. Matrix Computations. Johns Hopkins University Press.
[16]
Darryl Gove. 2007. CPU2006 Working Set Size. SIGARCH Comput. Archit. News 35, 1 (March 2007), 90--96.
[17]
N. Gulur, M. Mehendale, R. Manikantan, and R. Govindarajan. 2014. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 38--50.
[18]
John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCH Comput. Archit. News 34, 4 (Sept. 2006), 1--17.
[19]
John L. Henning. 2007. SPEC CPU2006 Memory Footprint. SIGARCH Comput. Archit. News 35, 1 (March 2007), 84--89.
[20]
HMCC. 2013. HMC Specification 1.0. http://www.hybridmemorycube.org
[21]
M. A. Holliday. 1989. Reference History, Page Size, and Migration Daemons in Local/Remote Architectures. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III). ACM, New York, NY, USA, 104--112.
[22]
Hai Huang, Padmanabhan Pillai, and Kang G. Shin. 2003. Design and Implementation of Power-aware Virtual Memory. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC '03). USENIX Association, Berkeley, CA, USA, 5--5. http://dl.acm.org/citation.cfm?id=1247340.1247345
[23]
Intel. 2013. Intel Core i7 Processor. http://www.intel.com/processor/corei7/specifications.html
[24]
JEDEC. 2013. DDR4 SPEC.
[25]
JEDEC. 2014. High Bandwidth Memory (HBM) DRAM, Gen 2.
[26]
D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 25--37.
[27]
Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 404--415.
[28]
Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy Kurian John. 2011. Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 24--35.
[29]
Richard P. Larowe, Jr. and Carla Schlatter Ellis. 1991. Experimental Comparison of Memory Management Policies for NUMA Multiprocessors. ACM Trans. Comput. Syst. 9, 4 (Nov. 1991), 319--363.
[30]
Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15). USENIX Association, Berkeley, CA, USA, 277--289. http://dl.acm.org/citation.cfm?id=2813767.2813788
[31]
Gabriel H. Loh and Mark D. Hill. 2011. Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, USA, 454--464.
[32]
Gabriel H. Loh, Nuwan Jayasena, Jaewoong Chung, Steven K. Reinhardt, J. Michael OConnor, and Kevin McGrath. 2012. Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems. In 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads.
[33]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '05). ACM, New York, NY, USA, 190--200.
[34]
John D. McCalpin. 1991. STREAM: Sustainable Memory Bandwidth in High Performance Computer. http://www.cs.virginia.edu/stream/
[35]
M.R. Meswani, S. Blagodurov, D. Roberts, J. Slice, M. Ignatowski, and G.H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on.
[36]
Micron. 2010. 1Gb_DDR3_SDRAM.
[37]
Micron. 2012. Calculating DDR Memory System Power Introduction.
[38]
Micron. 2014. HMC Gen2. Micron.
[39]
NVIDIA. 2014. NVIDIA Pascal. http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/
[40]
Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '03). ACM, New York, NY, USA, 318--319.
[41]
Moinuddin K. Qureshi and Gabe H. Loh. 2012. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 12.
[42]
Jaewoong Sim, Gabriel H. Loh, Hyesoon Kim, Mike O'Connor, and Mithuna Thottethodi. 2012. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 11.
[43]
Avinash Sodani. 2015. Knights Landing (KNL): 2nd Generation Intel Xeon Phi Processor. (Hot-Chips 2015). http://tinyurl.com/hotchips-2015-sodani
[44]
Avinash Sodani. 2016. Knights Landing Intel Xeon Phi CPU: Path to Parallelism with General Purpose Programming. (Keynote Address HPCA 2016).
[45]
A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (Mar 2016), 34--46.
[46]
Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen. 2015. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIXATC '15). USENIX Association, Berkeley, CA, USA, 529--540. http://dl.acm.org/citation.cfm?id=2813767.2813807
[47]
Kevin Tran. 2016. The Era of High Bandwidth Memory. In Hot Chips: A Symposium on High Performance Chips.
[48]
Thomas Vogelsang. 2010. Understanding the Energy Consumption of Dynamic Random Access Memories. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '43). IEEE Computer Society, Washington, DC, USA, 363--374.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '17: Proceedings of the International Symposium on Memory Systems
October 2017
409 pages
ISBN:9781450353359
DOI:10.1145/3132402
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 October 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

MEMSYS 2017

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)179
  • Downloads (Last 6 weeks)26
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media