skip to main content
10.1109/MICRO56248.2022.00070acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Page Size Aware Cache Prefetching

Published: 18 December 2023 Publication History

Abstract

The increase in working set sizes of contemporary applications outpaces the growth in cache sizes, resulting in frequent main memory accesses that deteriorate system performance due to the disparity between processor and memory speeds. Prefetching data blocks into the cache hierarchy ahead of demand accesses has proven successful at attenuating this bottleneck. However, spatial cache prefetchers operating in the physical address space leave significant performance on the table by limiting their pattern detection within 4KB physical page boundaries when modern systems use page sizes larger than 4KB to mitigate the address translation overheads.
This paper exploits the high usage of large pages in modern systems to increase the effectiveness of spatial cache prefetching. We design and propose the Page-size Propagation Module (PPM), a μarchitectural scheme that propagates the page size information to the lower-level cache prefetchers, enabling safe prefetching beyond 4KB physical page boundaries when the accessed blocks reside in large pages, at the cost of augmenting the first-level caches' Miss Status Holding Register (MSHR) entries with one additional bit. PPM is compatible with any cache prefetcher without implying design modifications. We capitalize on PPM's benefits by designing a module that consists of two page size aware prefetchers that inherently use different page sizes to drive prefetching. The composite module uses adaptive logic to dynamically enable the most appropriate page size aware prefetcher. Finally, we show that the proposed designs are transparent to which cache prefetcher is used.
We apply the proposed page size exploitation techniques to four state-of-the-art spatial cache prefetchers. Our evaluation shows that our proposals improve single-core geomean performance by up to 8.1% (2.1% at minimum) over the original implementation of the considered prefetchers, across 80 memory-intensive workloads. In multi-core contexts, we report geomean speedups up to 7.7% across different cache prefetchers and core configurations.

References

[1]
W. A. Wulf and S. A. McKee, "Hitting the memory wall: Implications of the obvious," SIGARCH Computer Architecture News, vol. 23, 1995.
[2]
S. A. McKee, "Reflections on the memory wall," in Proceedings of the 1st Conference on Computing Frontiers, 2004.
[3]
S. Mittal, "A survey of recent prefetching techniques for processor caches," ACM Computing Surveys, vol. 49, 2016.
[4]
D. Suggs, M. Subramony, and D. Bouvier, "The AMD "Zen 2" Processor," IEEE Micro, vol. 40, 2020.
[5]
A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu, "Knights landing: Second-generation intel xeon phi product," IEEE Micro, vol. 36, 2016.
[6]
R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield, K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski, a. gara, G. Chiu, P. Boyle, N. Chist, and C. Kim, "The ibm blue gene/q compute chip," IEEE Micro, vol. 32, 2012.
[7]
B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jiménez, T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha, and A. Ghiya, "Evolution of the samsung exynos cpu microarchitecture," in Proceedings of the 47th International Symposium on Computer Architecture, 2020.
[8]
D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo, "The ibm system/360 model 91: Machine philosophy and instruction-handling," IBM Journal of Research and Development, vol. 11, 1967.
[9]
R. P. Case and A. Padegs, "Architecture of the ibm system/370," Commun. ACM, vol. 21, 1978.
[10]
Levinthal D., "Performance analysis guide for Intel Core i7 processor and Intel Xeon 5500 processors." https://www.intel.com/content/dam/develop/external/us/en/documents/performance-analysis-guide-181827.pdf.
[11]
D. Sager, D. P. Group, and I. Corp, "The microarchitecture of the pentium 4 processor," Intel Technology Journal, vol. 1, 2001.
[12]
B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner, "Power5 system microarchitecture," IBM Journal of Research and Development, vol. 49, 2005.
[13]
J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy, "Power4 system microarchitecture," IBM Journal of Research and Development, vol. 46, 2002.
[14]
J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti, "Path confidence based lookahead prefetching," in Proceedings of the 49th International Symposium on Microarchitecture, 2016.
[15]
M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti, "Efficiently prefetching complex address patterns," in Proceedings of the 48th International Symposium on Microarchitecture, 2015.
[16]
E. Bhatia, G. Chacon, S. Pugsley, E. Teran, P. V. Gratz, and D. A. Jiménez, "Perceptron-based prefetch filtering," in Proceedings of the 46th International Symposium on Computer Architecture, 2019.
[17]
P. Michaud, "Best-offset hardware prefetching," in Proceedings of the 22nd IEEE International Symposium on High Performance Computer Architecture, 2016.
[18]
R. Bera, A. V. Nori, O. Mutlu, and S. Subramoney, "Dspatch: Dual spatial pattern prefetcher," in Proceedings of the 52nd International Symposium on Microarchitecture, 2019.
[19]
S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Spatial memory streaming," in Proceedings of the 33rd International Symposium on Computer Architecture, 2006.
[20]
Y. Ishii, M. Inaba, and K. Hiraki, "Access map pattern matching for data cache prefetch," in Proceedings of the 23rd International Conference on Supercomputing, 2009.
[21]
M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Bingo spatial data prefetcher," in Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019.
[22]
H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and C. Lin, "Temporal prefetching without the off-chip metadata," in Proceedings of the 52nd International Symposium on Microarchitecture, 2019.
[23]
H. Wu, K. Nathella, D. Sunwoo, A. Jain, and C. Lin, "Efficient metadata management for irregular data prefetching," in Proceedings of the 46th International Symposium on Computer Architecture, 2019.
[24]
M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Domino temporal data prefetcher," in Proceedings of the 24th IEEE International Symposium on High Performance Computer Architecture, 2018.
[25]
S. Kumar and C. Wilkerson, "Exploiting spatial locality in data caches using spatial footprints," in Proceedings of the 25th International Symposium on Computer Architecture, 1998.
[26]
A. Jain and C. Lin, "Linearizing irregular memory accesses for improved correlated prefetching," in Proceedings of the 46th International Symposium on Microarchitecture, 2013.
[27]
R. Bera, K. Kanellopoulos, A. Nori, T. Shahroodi, S. Subramoney, and O. Mutlu, "Pythia: A customizable hardware prefetching framework using online reinforcement learning," in Proceedings of the 54th International Symposium on Microarchitecture, 2021.
[28]
S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi, "Spatio-temporal memory streaming," SIGARCH Computer Architecture News, vol. 37, 2009.
[29]
S. Volos, J. Picorel, B. Falsafi, and B. Grot, "Bump: Bulk memory access prediction and streaming," in Proceedings of the 47th International Symposium on Microarchitecture, 2014.
[30]
J. R. S. Vicarte, M. Flanders, R. Paccagnella, G. Garrett-Grossman, A. Morrison, C. W. Fletcher, and D. Kohlbrenner, "Augury: Using data memory-dependent prefetchers to leak data at rest," in Proceedings of the 2022 IEEE Symposium on Security and Privacy, 2022.
[31]
D. Gruss, C. Maurice, A. Fogh, M. Lipp, and S. Mangard, "Prefetch side-channel attacks: Bypassing smap and kernel aslr," in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016.
[32]
Y. Chen, L. Pei, and T. E. Carlson, "Leaking control flow information via the hardware prefetcher," CoRR, vol. abs/2109.00474, 2021.
[33]
Abishek Bhattacharjee, "Advanced concepts on address translation, appendix L in 'Computer Architecture: A Quantitative Approach' by hennessy and patterson," http://www.cs.yale.edu/homes/abhishek/abhishek-appendix-l.pdf.
[34]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient virtual memory for big memory servers," in Proceedings of the 40th International Symposium on Computer Architecture, 2013.
[35]
A. Bhattacharjee, "Large-reach memory management unit caches," in Proceedings of the 46th International Symposium on Microarchitecture, 2013.
[36]
T. E. Anderson, H. M. Levy, B. N. Bershad, and E. D. Lazowska, "The interaction of architecture and operating system design," SIGARCH Computer Architecture News, vol. 19, 1991.
[37]
M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod, "Using the simos machine simulator to study complex computer systems," ACM Transactions on Modeling and Computer Simulation, vol. 7, 1997.
[38]
S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and D. Brooks, "Profiling a warehouse-scale computer," in Proceedings of the 42nd International Symposium on Computer Architecture, 2015.
[39]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: A study of emerging scale-out workloads on modern hardware," in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012.
[40]
G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, "Memory hierarchy for web search," in Proceedings of the 24th IEEE International Symposium on High Performance Computer Architecture, 2018.
[41]
R. Kumar, B. Grot, and V. Nagarajan, "Blasting through the front-end bottleneck with shotgun," in Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, 2018.
[42]
Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, "Translation ranger: Operating system support for contiguity-aware tlbs," in Proceedings of the 46th International Symposium on Computer Architecture, 2019.
[43]
G. Vavouliotis, L. Alvarez, V. Karakostas, K. Nikas, N. Koziris, D. A. Jiménez, and M. Casas, "Exploiting page table locality for agile tlb prefetching," in Proceedings of the 48th International Symposium on Computer Architecture, 2021.
[44]
V. S. S. Ram, A. Panwar, and A. Basu, "Trident: Harnessing architectural resources for all page sizes in x86 processors," in Proceedings of the 54th International Symposium on Microarchitecture, 2021.
[45]
C. Alverti, S. Psomadakis, V. Karakostas, J. Gandhi, K. Nikas, G. Goumas, and N. Koziris, "Enhancing and exploiting contiguity for fast memory virtualization," in Proceedings of the 2020 47th International Symposium on Computer Architecture, 2020.
[46]
V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal, "Redundant memory mappings for fast access to large memories," in Proceedings of the 42nd International Symposium on Computer Architecture, 2015.
[47]
V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. S. Unsal, "Energy-efficient address translation," in Proceedings of the 22nd IEEE International Symposium on High Performance Computer Architecture, 2016.
[48]
D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 1990.
[49]
A. Bhattacharjee, D. Lustig, and M. Martonosi, Architectural and Operating System Support for Virtual Memory. Morgan & Claypool Publishers, 2017.
[50]
H. Shacham, M. Page, B. Pfaff, E.-J. Goh, N. Modadugu, and D. Boneh, "On the effectiveness of address-space randomization," in Proceedings of the 11th ACM SIGSAC Conference on Computer and Communications Security, 2004.
[51]
C. Kil, J. Jun, C. Bookholt, J. Xu, and P. Ning, "Address space layout permutation (aslp): Towards fine-grained randomization of commodity software," in Proceedings of the 22nd Computer Security Applications Conference, 2006.
[52]
C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum, "Enhanced operating system security through efficient and fine-grained address space randomization," in Proceedings of the 21st USENIX Security Symposium, 2012.
[53]
R. W. Carr and J. L. Hennessy, "Wsclock---a simple and effective algorithm for virtual memory management," in Proceedings of the 8th ACM Symposium on Operating Systems Principles, 1981.
[54]
D. E. Knuth, The Art of Computer Programming, Vol. 1: Fundamental Algorithms, 3rd ed. Reading, Mass.: Addison-Wesley, 1997.
[55]
"Intel®64 and IA-32 Architectures Optimization Reference Manual," https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
[56]
T. W. Barr, A. L. Cox, and S. Rixner, "Translation Caching: Skip, Don'T Walk (the Page Table)," in Proceedings of the 37th International Symposium on Computer Architecture, 2010.
[57]
Intel Corporation, "TLBs, Paging-Structure Caches, and Their Invalidation," https://composter.com.ua/documents/TLBs_Paging-Structure_Caches_and_Their_Invalidation.pdf, 2008.
[58]
"Kernel address space layout randomization," https://lwn.net/Articles/569635/.
[59]
Y. Jang, S. Lee, and T. Kim, "Breaking kernel address space layout randomization with intel tsx," in Proceedings of the 23rd ACM SIGSAC Conference on Computer and Communications Security, 2016.
[60]
"Transparent Huge Pages," http://lwn.net/Articles/423584/.
[61]
J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, transparent operating system support for superpages," in Proceedings of the 5th Symposium on Operating Systems Design and implementation, 2002.
[62]
"Intel® 64 and IA-32 Architectures Software Developer Manuals," https://software.intel.com/en-us/articles/intel-sdm.
[63]
"AMD-V™ Nested Paging - White Paper 2008," http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%201-final-TM.pdf.
[64]
"Database Tuning on Linux OS: Reference Guide for AMD EPYC™ 7002 Series Processors," https://developer.amd.com/wp-content/resources/56783_1.0.pdf.
[65]
"Virtual memory support, armv4 and armv5," https://developer.arm.com/documentation/ddi0406/b/Appendices/ARMv4-and-ARMv5-Differences/System-level-memory-model/Virtual-memory-support?lang=en.
[66]
F. Guo, S. Kim, Y. Baskakov, and I. Banerjee, "Proactively breaking large pages to improve memory overcommitment performance in vmware esxi," in Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2015.
[67]
J. L. Henning, "SPEC CPU2006 Benchmark Descriptions," SIGARCH Computer Architecture News, vol. 34, 2006.
[68]
"SPEC CPU 2017," https://www.spec.org/cpu2017/.
[69]
S. Beamer, K. Asanovic, and D. A. Patterson, "The GAP benchmark suite," CoRR, vol. abs/1508.03619, 2015.
[70]
"Intel Xeon Gold," https://en.wikichip.org/wiki/intel/xeongold/6258r.
[71]
R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March, P. Ram, N. A. Mehta, and A. G. Gray, "Mlpack: A scalable c++ machine learning library," J. Mach. Learn. Res., vol. 14, 2013.
[72]
"Championship Value Prediction (CVP)," https://www.microarch.org/cvp1/.
[73]
M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, "Adaptive insertion policies for high performance caching," SIGARCH Computer Architecture News, vol. 35, 2007.
[74]
P. Conway and B. Hughes, "The amd opteron northbridge architecture," IEEE Micro, vol. 27, 2007.
[75]
"Intel 5-Level Paging and 5-Level EPT," https://ebin.pub/5-level-paging-and-5-level-ept-white-paper-revision-10nbsped.html.
[76]
A. Margaritov, D. Ustiugov, E. Bugnion, and B. Grot, "Prefetched address translation," in Proceedings of the 52nd International Symposium on Microarchitecture, 2019.
[77]
"Libhugetlbfs," https://lwn.net/Articles/374424/, 2010.
[78]
M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, "Temporal instruction fetch streaming," in Proceedings of the 41st International Symposium on Microarchitecture, 2008.
[79]
A. Smith, "Sequential program prefetching in memory hierarchies," Computer, vol. 11, 1978.
[80]
F. Dahlgren and P. Stenstrom, "Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors," in Proceedings of 1st IEEE Symposium on High Performance Computer Architecture, 1995.
[81]
"Arm Architecture Reference Manual for A-profile Architecture," https://developer.arm.com/documentation/ddi0487/latest.
[82]
J.-L. Baer and T.-F. Chen, "An effective on-chip preloading scheme to reduce data access penalty," in Proceedings of the 1991 Conference on Supercomputing, 1991.
[83]
S. Pakalapati and B. Panda, "Bouquet of instruction pointers: Instruction pointer classifier-based spatial hardware prefetching," in Proceedings of the 47th International Symposium on Computer Architecture, 2020.
[84]
M. Ferdman, C. Kaynak, and B. Falsafi, "Proactive instruction fetch," in Proceedings of the 44th International Symposium on Microarchitecture, 2011.
[85]
"ARM Cortex-A55 Core Technical Reference Manual r1p0," https://developer.arm.com/documentation/100442/0100/functional-description/level-1-memory-system/data-prefetching?lang=en.
[86]
G. Vavouliotis, L. Alvarez, B. Grot, D. Jiménez, and M. Casas, "Morrigan: A composite instruction tlb prefetcher," in Proceedings of the 54th International Symposium on Microarchitecture, 2021.
[87]
"Page-collect - Capturing Process Memory Usage Under Linux," https://github.com/cslab-ntua/contiguity-isca2020.
[88]
"ChampSim," https://crc2.ece.tamu.edu/.
[89]
"CVE-2021-4002 Vulnerability," https://nvd.nist.gov/vuln/detail/CVE-2021-4002.
[90]
"CVE-2017-15127 Vulnerability," https://nvd.nist.gov/vuln/detail/CVE-2017-15127.
[91]
D. Tarjan and K. Skadron, "Merging analysis and gshare indexing in perceptron branch prediction," ACM Trans. Archit. Code Optim., vol. 2, 2005.
[92]
"AMD Epyc 7702P," https://en.wikichip.org/wiki/amd/epyc/7702p.
[93]
"AMD Ryzen Threadripper 3990X," https://en.wikichip.org/wiki/amd/ryzen_threadripper/3990x.
[94]
E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder, "Using simpoint for accurate and efficient simulation," SIGMETRICS Perform. Eval. Rev., vol. 31, 2003.
[95]
S. Mirbagher-Ajorpaz, E. Garza, G. Pokam, and D. A. Jiménez, "CHiRP: Control-flow history reuse prediction," in Proceedings of the 53rd International Symposium on Microarchitecture, 2020.
[96]
D. A. Jiménez and E. Teran, "Multiperspective reuse prediction," in Proceedings of the 50th International Symposium on Microarchitecture, 2017.

Index Terms

  1. Page Size Aware Cache Prefetching
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture
        October 2022
        1498 pages
        ISBN:9781665462723

        Sponsors

        Publisher

        IEEE Press

        Publication History

        Published: 18 December 2023

        Check for updates

        Author Tags

        1. cache hierarchy
        2. prefetching
        3. spatial correlation
        4. microarchitecture
        5. hardware
        6. virtual memory
        7. address translation
        8. large pages
        9. memory management
        10. memory wall

        Qualifiers

        • Research-article

        Conference

        MICRO '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 67
          Total Downloads
        • Downloads (Last 12 months)67
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 26 Dec 2024

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media