skip to main content
10.1145/3030207.3030223acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article

Detecting Memory-Boundedness with Hardware Performance Counters

Published: 17 April 2017 Publication History

Abstract

Modern processors incorporate several performance monitoring units, which can be used to count events that occur within different components of the processor. They provide access to information on hardware resource usage and can therefore be used to detect performance bottlenecks. Thus, many performance measurement tools are able to record them complementary to information about the application behavior. However, the exact meaning of the supported hardware events is often incomprehensible due to the system complexity and partially lacking or even inaccurate documentation. For most events it is also not documented whether a certain rate indicates a saturated resource usage. Therefore, it is usually difficult to draw conclusions on the performance impact from the observed event rates. In this paper, we evaluate whether hardware performance counters can be used to measure the capacity utilization within the memory hierarchy and estimate the impact of memory accesses on the achieved performance. The presented approach is based on a small selection of micro-benchmarks that constantly stress individual components in the memory subsystem, ranging from caches to main memory. These workloads are used to identify hardware performance counters that provide good estimates for the utilization of individual components in the memory hierarchy. However, since access latencies can be interleaved with computing instructions, a high utilization of the memory hierarchy does not necessarily result in low performance. We therefore also investigate which stall counters provide good estimates for the number of cycles that are actually spent waiting for the memory hierarchy.

References

[1]
L. Adhianto et al. Hpctoolkit: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010.
[2]
AMD. BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors, Jan 2013. Publication # 42301, Revision: 3.14.
[3]
D. H. Bailey et. al. The nas parallel benchmarks-summary and preliminary results. In ACM/IEEE Conference on Supercomputing, 1991.
[4]
L. A. Barroso et al. Memory system characterization of commercial workloads. SIGARCH Comput. Archit. News, 1998.
[5]
S. Eranian. What can performance counters do for memory subsystem analysis? In ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC), 2008.
[6]
D. Hackenberg et al. Power measurement techniques on standard compute nodes: A quantitative comparison. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013.
[7]
Intel. An Introduction to the IntelR© QuickPath Interconnect, 1 2009.
[8]
Intel. IntelR© XeonR© Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual, 6 2015. Document Number: 331051-002.
[9]
Intel. IntelR© XeonR© Processor E5 v3 Product Families - Specification Update, 8 2015. Reference Number: 330785-009US, Revision 009.
[10]
Intel. Intel R© 64 and IA-32 Architectures Optimization Reference Manual, Jan 2016. Order Number: 248966-032.
[11]
Intel. Intel R© 64 and IA-32 Architectures Software Developer's Manual, Combined Volumes 1, 2A, 2B, 2C, 3A, 3B and 3C, Apr 2016. Order Number: 325462-058US.
[12]
A. Knüscalasca, tau, and vampir. In International Workshop on Parallel Tools for High Performance Computing. 2012.
[13]
D. Levinthal. Performance analysis guide for Intel Core i7 processor and Intel Xeon 5500 processors. Technical report, Intel, 2009.
[14]
J. D. Little. A proof for the queueing formula: L=·W. Operations Research, 9(3), 1961.
[15]
Z. Majo and T. R. Gross. Memory system performance in a numa multicore multiprocessor. In International Conference on Systems and Storage (SYSTOR), 2011.
[16]
D. Molka. Performance Analysis of Complex Shared Memory Systems. PhD thesis, Technische Universität Dresden, 2017.
[17]
D. Molka et al. Main memory and cache performance of intel sandy bridge and amd bulldozer. In Workshop on Memory Systems Performance and Correctness (MSPC), 2014.
[18]
D. Molka et al. Cache coherence protocol and memory performance of the intel haswell-ep architecture. In International Conference on Parallel Processing (ICPP), 2015.
[19]
L. Oliker et. al. A performance evaluation of the cray x1 for scientific applications. In High Performance Computing for Computational Science (VECPAR 2004). 2005.
[20]
V. Palomares. Combining static and dynamic approaches to model loop performance in HPC. PhD thesis, Université de Versailles, 2015. https://tel.archives-ouvertes.fr/tel-01293040.
[21]
R. Schne and D. Molka. Integrating performance öanalysis and energy efficiency optimizations in a unified environment. Computer Science - Research and Development, 2014.
[22]
R. Schöne et al. Memory performance at reduced cpu clock speeds: an analysis of current x86_64 processors. In USENIX conference on Power-Aware Computing and Systems (HotPower), 2012.
[23]
V. Spiliopoulos et al. Green governors: A framework for continuously adaptive dvfs. In International Green Computing Conference and Workshops (IGCC), 2011.
[24]
D. Terpstra et al. Collecting performance data with PAPI-C. In International Workshop on Parallel Tools for High Performance Computing, 2009. 11.
[25]
J. Treibig et al. Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering. In Euro-Par 2012: Parallel Processing Workshops. 2013.
[26]
A. Yasin. A top-down method for performance analysis and counters architecture. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014.
[27]
W. Yoo et al. Adp: Automated diagnosis of performance pathologies using hardware events. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, 2012.

Cited By

View all

Index Terms

  1. Detecting Memory-Boundedness with Hardware Performance Counters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICPE '17: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering
    April 2017
    450 pages
    ISBN:9781450344043
    DOI:10.1145/3030207
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 April 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. benchmarking
    2. hardware performance counters
    3. performance analysis

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICPE '17
    Sponsor:

    Acceptance Rates

    ICPE '17 Paper Acceptance Rate 27 of 83 submissions, 33%;
    Overall Acceptance Rate 252 of 851 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 07 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media