research-article

To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach

Authors:

Jennifer L. WongAuthors Info & Claims

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Pages 357 - 368

https://doi.org/10.1145/2451116.2451155

Published: 16 March 2013 Publication History

Abstract

Most hardware and software venders suggest disabling hardware prefetching in virtualized environments. They claim that prefetching is detrimental to application performance due to inaccurate prediction caused by workload diversity and VM interference on shared cache. However, no comprehensive or quantitative measurements to support this belief have been performed.

This paper is the first to systematically measure the influence of hardware prefetching in virtualized environments. We examine a wide variety of benchmarks on three types of chip-multiprocessors (CMPs) to analyze the hardware prefetching performance. We conduct extensive experiments by taking into account a number of important virtualization factors. We find that hardware prefetching has minimal destructive influence under most configurations. Only with certain application combinations does prefetching influence the overall performance.

To leverage these findings and make hardware prefetching effective across a diversity of virtualized environments, we propose a dynamic prefetching-aware VCPU-core binding approach (PAVCB), which includes two phases - classifying and binding. The workload of each VM is classified into different cache sharing constraint categories based upon its cache access characteristics, considering both prefetch requests and demand requests. Then following heuristic rules, the VCPUs of each VM are scheduled onto appropriate cores subject to cache sharing constraints. We show that the proposed approach can improve performance by 12% on average over the default scheduler and 46% over manual system administrator bindings across different workload combinations in the presence of hardware prefetching.

References

[1]

Adams, K., and Agesen, O. A comparison of software and hardware techniques for x86 virtualization. In ASPLOS (2006), pp. 2--13.

Digital Library

[2]

AMD. BIOS and kernel developer's guide for AMD family 10h processors. White Paper, 2010.

[3]

Barrow-Williams, N., Fensch, C., and Moore, S. A communication characterisation of Splash-2 and Parsec. In IISWC (2009), pp. 86--97.

Digital Library

[4]

Bhattacharjee, A., and Martonosi, M. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In PACT (2009), pp. 29--40.

Digital Library

[5]

Bienia, C. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

Digital Library

[6]

Ebrahimi, E., Mutlu, O., Lee, C. J., and Patt, Y. N. Coordinated control of multiple prefetchers in multi-core systems. In Micro (2009), pp. 316--326.

Digital Library

[7]

Ebrahimi, E., Mutlu, O., and Patt, Y. N. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA (2009), pp. 7 -- 17.

[8]

Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A. D., Ailamaki, A., and Falsafi, B. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ASPLOS (2012), pp. 37--48.

Digital Library

[9]

Filebench. Filebench. http://sourceforge.net/apps/mediawiki/filebench.

[10]

Govindan, S., Liu, J., Kansal, A., and Sivasubramaniam, A. Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines. In SoCC (2011), pp. 22:1--22:14.

Digital Library

[11]

IBM. IBM eServer xSeries 366 tuning tips. Technical Report, 2005.

[12]

IBM. Virtualization on the IBM system x3950 server. Technical Report, 2006.

[13]

IBM. Tuning IBM system x servers for performance. Technical Report, 2007.

[14]

Intel. Achieving fast, scalable I/O for virtualized servers. White Paper, 2009.

[15]

Jaleel, A., Najaf-abadi, H. H., Subramaniam, S., Steely, S. C., and Emer, J. CRUISE: cache replacement and utility-aware scheduling. In ASPLOS (2012), pp. 249--260.

Digital Library

[16]

Jaleel, A., Theobald, K. B., Steely, Jr., S. C., and Emer, J. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA (2010), pp. 60--71.

Digital Library

[17]

Jones, S. T., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Geiger: monitoring the buffer cache in a virtual machine environment. In ASPLOS (2006), pp. 14--24.

Digital Library

[18]

Khan, S. M., Tian, Y., and Jimenez, D. A. Sampling dead block prediction for last-level caches. In Micro (2010), pp. 175--186.

Digital Library

[19]

Lee, C. J., Mutlu, O., Narasiman, V., and Patt, Y. N. Prefetch-Aware DRAM controllers. In Micro (2008), pp. 200--209.

Digital Library

[20]

Lee, C. J., Mutlu, O., Narasiman, V., and Patt, Y. N. Prefetch-aware shared resource management for multi-core systems. In ISCA (2011), pp. 141--152.

[21]

Liu, F., Jiang, X., and Solihin, Y. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In HPCA (2010), pp. 1--12.

[22]

Liu, F., and Solihin, Y. Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors. In SIGMETRICS (2011), pp. 37--48.

Digital Library

[23]

Lo, J., Barroso, L. A., Eggers, S. J., Gharachorloo, K., Levy, H. M., and Parekh, S. S. An analysis of database workload performance on simultaneous multithreaded processors. In ISCA (1998), pp. 39--50.

Digital Library

[24]

Ma, Z., Sheng, Z., Gu, L., Wen, L., and Zhang, G. DVM: towards a datacenter-scale virtual machine. In VEE (2012), pp. 39--50.

Digital Library

[25]

Muralidhara, S. P., Subramanian, L., Mutlu, O., Kandemir, M., and Moscibroda, T. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In Micro (2011), pp. 374--385.

Digital Library

[26]

Ongaro, D., Cox, A. L., and Rixner, S. Scheduling I/O in virtual machine monitors. In VEE (2008), pp. 14--24.

Digital Library

[27]

OProfile. A system profiler for Linux. http://http://oprofile.sourceforge.net.

[28]

Pan, S., Cherng, C., Dick, K., and Ladner, R. E. Algorithms to take advantage of hardware prefetching. In ALENEX (2007).

[29]

Singh, B. Page/slab cache control in a virtualized environment. In Linux Symposium (2010), pp. 252--262.

[30]

Soares, L., Tam, D., and Stumm, M. Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. In MICRO (2008), pp. 258--269.

Digital Library

[31]

Srikantaiah, S., Kandemir, M., and Irwin, M. J. Adaptive set pinning: managing shared caches in chip multiprocessors. In ASPLOS (2008), pp. 135--144.

Digital Library

[32]

Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA (2007), pp. 63--74.

Digital Library

[33]

SysBench. Sysbench: a system performance benchmark. http://sysbench.sourceforge.net.

[34]

Tam, D., Azimi, R., Soares, L., and Stumm, M. Managing shared L2 caches on multicore systems in software. In WIOSCA (2007).

[35]

Tang, L., Mars, J., Vachharajani, N., Hundt, R., and Soffa, M. L. The impact of memory subsystem resource sharing on datacenter applications. In ISCA (2011), pp. 283--294.

Digital Library

[36]

Verma, S., Koppelman, D. M., and Peng, L. Efficient prefetching with hybrid schemes and use of program feedback to adjust prefetcher aggressiveness. Journal of Instruction-Level Parallelism, 13 (2011), 1--14.

[37]

VMware. VMware VMmark v1.0.0 Results - Dell PowerEdge R900. Technical Report, 2008.

[38]

VMware. Performance best practices for VMware vSphere 5.0. Technical Report, 2011.

[39]

Waldspurger, C. A. Memory resource management in vmware esx server. In SIGOPS Oper. Syst. Rev. (2002), pp. 181--194.

Digital Library

[40]

Wu, C.-J., Jaleel, A., Martonosi, M., Steely, Jr., S. C., and Emer, J. PACMan: prefetch-aware cache management for high performance caching. In Micro (2011), pp. 442--453.

Digital Library

[41]

Xie, Y., and Loh, G. H. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches. In ISCA (2009), pp. 174--183.

Digital Library

[42]

Zhang, E., Jiang, Y., and Shen, X. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP (2010), pp. 203--212.

Digital Library

[43]

Zhang, X., Dwarkadas, S., and Shen, K. Towards practical page coloring-based multicore cache management. In Eurosys (2009), pp. 89--102.

Digital Library

[44]

Zhuravlev, S., Blagodurov, S., and Fedorova, A. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS (2010), pp. 129--142.

Digital Library

Cited By

Alcorta EMadhav MAfoakwa RTetrick SYadwadkar NGerstlauer A(2024)Characterizing Machine Learning-Based Runtime Prefetcher SelectionIEEE Computer Architecture Letters10.1109/LCA.2024.340488723:2(146-149)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2024.3404887
Hiebel JBrown LWang Z(2019)Machine Learning for Fine-Grained Hardware Prefetcher ControlProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337854(1-9)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337854
Farshin ARoozbeh AMaguire GKostić D(2019)Make the Most out of Last Level Cache in Intel ProcessorsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303977(1-17)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303977
Show More Cited By

Index Terms

To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach
ASPLOS '13

Most hardware and software venders suggest disabling hardware prefetching in virtualized environments. They claim that prefetching is detrimental to application performance due to inaccurate prediction caused by workload diversity and VM interference on ...
Data prefetch mechanisms

The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in ...
To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach
ASPLOS '13

Most hardware and software venders suggest disabling hardware prefetching in virtualized environments. They claim that prefetching is detrimental to application performance due to inaccurate prediction caused by workload diversity and VM interference on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

March 2013

574 pages

ISBN:9781450318709

DOI:10.1145/2451116

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

ACM SIGARCH Computer Architecture News Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '13

Sponsor:

ASPLOS '13: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2013

Texas, Houston, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
1,132
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alcorta EMadhav MAfoakwa RTetrick SYadwadkar NGerstlauer A(2024)Characterizing Machine Learning-Based Runtime Prefetcher SelectionIEEE Computer Architecture Letters10.1109/LCA.2024.340488723:2(146-149)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2024.3404887
Hiebel JBrown LWang Z(2019)Machine Learning for Fine-Grained Hardware Prefetcher ControlProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337854(1-9)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337854
Farshin ARoozbeh AMaguire GKostić D(2019)Make the Most out of Last Level Cache in Intel ProcessorsProceedings of the Fourteenth EuroSys Conference 201910.1145/3302424.3303977(1-17)Online publication date: 25-Mar-2019
https://dl.acm.org/doi/10.1145/3302424.3303977
Bakhshalipour MShakerinava MLotfi-Kamran PSarbazi-Azad H(2019)Bingo Spatial Data Prefetcher2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00053(399-411)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00053
Guo YRao JJiang CZhou X(2017)Moving Hadoop into the Cloud with Flexible Slot Management and Speculative ExecutionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2016.258764128:3(798-812)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/TPDS.2016.2587641
Mittal S(2016)A Survey of Recent Prefetching Techniques for Processor CachesACM Computing Surveys10.1145/290707149:2(1-35)Online publication date: 2-Aug-2016
https://dl.acm.org/doi/10.1145/2907071
Lee CStrazdins P(2015)How Small Can it Be?Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems10.1109/HPCC-CSS-ICESS.2015.302(455-462)Online publication date: 24-Aug-2015
https://dl.acm.org/doi/10.1109/HPCC-CSS-ICESS.2015.302
Rahman SBurtscher MZong ZQasem A(2015)Maximizing Hardware Prefetch Effectiveness with Machine LearningProceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and Systems10.1109/HPCC-CSS-ICESS.2015.175(383-389)Online publication date: 24-Aug-2015
https://dl.acm.org/doi/10.1109/HPCC-CSS-ICESS.2015.175
Guo YRao JJiang CZhou XDamkroger TDongarra J(2014)FlexSlotProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.83(959-969)Online publication date: 16-Nov-2014
https://dl.acm.org/doi/10.1109/SC.2014.83
Jung SLee HJo H(2023)CluMP: Clustered Markov Chain for Storage I/O PrefetchElectronics10.3390/electronics1215329312:15(3293)Online publication date: 31-Jul-2023
https://doi.org/10.3390/electronics12153293
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents