skip to main content
10.1145/3613424.3614255acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Decoupled Vector Runahead

Published: 08 December 2023 Publication History

Abstract

We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect memory accesses. DVR dynamically infers loop bounds at run-time, recognizing striding loads, and vectorizing subsequent instructions that are part of an indirect chain. It proactively issues memory accesses for the resulting loads far into the future, even when the out-of-order core has not yet stalled, bringing their data into the L1 cache, and thus providing timely prefetches for the main thread. DVR can adjust the degree of vectorization at run-time, vectorize the same chain of indirect memory accesses across multiple invocations of an inner loop, and efficiently handle branch divergence along the vectorized chain. DVR runs as an on-demand, speculative, in-order, lightweight hardware subthread alongside the main thread within the core and incurs a minimal hardware overhead of only 1139 bytes. Relative to a large superscalar 5-wide out-of-order baseline and Vector Runahead — a recent microarchitectural technique to accelerate indirect memory accesses on out-of-order processors — DVR delivers 2.4 × and 2 × higher performance, respectively, for a set of graph analytics, database, and HPC workloads.

References

[1]
Sam Ainsworth and Timothy M. Jones. 2016. Graph Prefetching Using Data Structure Knowledge. In Proceedings of the 2016 International Conference on Supercomputing (Istanbul, Turkey) (ICS ’16). Association for Computing Machinery, New York, NY, USA, Article 39, 11 pages. https://doi.org/10.1145/2925426.2926254
[2]
Sam Ainsworth and Timothy M. Jones. 2018. An Event-Triggered Programmable Prefetcher for Irregular Workloads. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) (ASPLOS ’18). Association for Computing Machinery, New York, NY, USA, 578–592. https://doi.org/10.1145/3173162.3173189
[3]
Sam Ainsworth and Timothy M. Jones. 2019. Software Prefetching for Indirect Memory Accesses: A Microarchitectural Perspective. ACM Transactions on Computer Systems 36, 3, Article 8 (jun 2019), 34 pages. https://doi.org/10.1145/3319393
[4]
Sam Ainsworth and Timothy M. Jones. 2020. Prefetching in Functional Languages. In Proceedings of the 2020 ACM SIGPLAN International Symposium on Memory Management (London, UK) (ISMM ’20). Association for Computing Machinery, New York, NY, USA, 16–29. https://doi.org/10.1145/3381898.3397209
[5]
Hassan Al-Sukhni, Ian Bratt, and Daniel A. Connors. 2003. Compiler-Directed Content-Aware Prefetching for Dynamic Data Structures. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques(PACT ’03). IEEE Computer Society, Los Alamitos, CA, USA, 91. https://doi.org/10.1109/PACT.2003.1238005
[6]
James Alfred Ang, Brian W. Barrett, Kyle Bruce Wheeler, and Richard C. Murphy. 2010. Introducing the graph 500.Cray User’s Group (CUG) 19 (5 2010), 45–74. https://www.osti.gov/biblio/1014641
[7]
Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. 2001. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Göteborg, Sweden) (ISCA ’01). Association for Computing Machinery, New York, NY, USA, 52–61. https://doi.org/10.1145/379240.379251
[8]
Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis, and Nectarios Koziris. 2008. Exploring the Performance Limits of Simultaneous Multithreading for Memory Intensive Applications. Journal of Supercomputing 44, 1 (apr 2008), 64–97. https://doi.org/10.1007/s11227-007-0149-x
[9]
Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 513–526. https://doi.org/10.1145/3373376.3378498
[10]
Sara S. Baghsorkhi, Nalini Vasudevan, and Youfeng Wu. 2016. FlexVec: Auto-Vectorization for Irregular Loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (Santa Barbara, CA, USA) (PLDI ’16). Association for Computing Machinery, New York, NY, USA, 697–710. https://doi.org/10.1145/2908080.2908111
[11]
Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo Spatial Data Prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture(HPCA ’19). IEEE Computer Society, Los Alamitos, CA, USA, 399–411. https://doi.org/10.1109/HPCA.2019.00053
[12]
Scott Beamer, Krste Asanović, and David Patterson. 2017. The GAP Benchmark Suite. arxiv:1508.03619 [cs.DC]
[13]
Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadat, and Onur Mutlu. 2022. Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction. In 2022 55th IEEE/ACM International Symposium on Microarchitecture(MICRO-55). IEEE Computer Society, Los Alamitos, CA, USA, 1–18. https://doi.org/10.1109/MICRO56248.2022.00015
[14]
Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu. 2021. Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 1121–1137. https://doi.org/10.1145/3466752.3480114
[15]
Rahul Bera, Anant V. Nori, Onur Mutlu, and Sreenivas Subramoney. 2019. DSPatch: Dual Spatial Pattern Prefetcher. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 531–544. https://doi.org/10.1145/3352460.3358325
[16]
Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology 25, 2 (2001), 163–177. https://doi.org/10.1080/0022250X.2001.9990249
[17]
David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software Prefetching. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, California, USA) (ASPLOS IV). Association for Computing Machinery, New York, NY, USA, 40–52. https://doi.org/10.1145/106972.106979
[18]
Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3, Article 28 (aug 2014), 25 pages. https://doi.org/10.1145/2629677
[19]
Mustafa Cavus, Resit Sendag, and Joshua J. Yi. 2020. Informed Prefetching for Indirect Memory Accesses. ACM Transactions on Architecture and Code Optimization 17, 1, Article 4 (mar 2020), 29 pages. https://doi.org/10.1145/3374216
[20]
Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture (Atlanta, Georgia, USA) (ISCA ’99). IEEE Computer Society, Los Alamitos, CA, USA, 186–195. https://doi.org/10.1145/300979.300995
[21]
Shimin Chen, Anastassia Ailamaki, Phillip B. Gibbons, and Todd C. Mowry. 2007. Improving Hash Join Performance through Prefetching. ACM Transactions on Database Systems 32, 3 (aug 2007), 17–es. https://doi.org/10.1145/1272743.1272747
[22]
Tien-Fu Chen and Jean-Loup Baer. 1992. Reducing Memory Latency via Non-blocking and Prefetching Caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, Massachusetts, USA) (ASPLOS V). Association for Computing Machinery, New York, NY, USA, 51–61. https://doi.org/10.1145/143365.143486
[23]
Tien-Fu Chen and Jean-Loup Baer. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 44, 5 (May 1995), 609–623. https://doi.org/10.1109/12.381947
[24]
Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, and Donald Yeung. 2004. A General Framework for Prefetch Scheduling in Linked Data Structures and Its Application to Multi-chain Prefetching. ACM Transactions on Computer Systems 22, 2 (may 2004), 214–280. https://doi.org/10.1145/986533.986536
[25]
Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, and John P. Shen. 2001. Speculative Precomputation: Long-Range Prefetching of Delinquent Loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture (Göteborg, Sweden) (ISCA ’01). Association for Computing Machinery, New York, NY, USA, 14–25. https://doi.org/10.1145/379240.379248
[26]
Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A Stateless, Content-directed Data Prefetching Mechanism. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 279–290. https://doi.org/10.1145/605397.605427
[27]
Kenzo Van Craeynest, Stijn Eyerman, and Lieven Eeckhout. 2009. MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor. In High Performance Embedded Architectures and Compilers, Fourth International Conference, HiPEAC 2009, Paphos, Cyprus, January 25-28, 2009. Proceedings(Lecture Notes in Computer Science, Vol. 5409). Springer Berlin Heidelberg, Berlin, Heidelberg, 110–124. https://doi.org/10.1007/978-3-540-92990-1_10
[28]
Dr. Ian Cutress. 2018. Intel’s Architecture Day 2018: The future of core, Intel gpus, 10nm, and hybrid x86. AnandTech. https://www.anandtech.com/show/13699/intel-architecture-day-2018-core-future-hybrid-x86
[29]
James Dundas and Trevor Mudge. 1997. Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss. In Proceedings of the 11th International Conference on Supercomputing (Vienna, Austria) (ICS ’97). Association for Computing Machinery, New York, NY, USA, 68–75. https://doi.org/10.1145/263580.263597
[30]
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 7–17. https://doi.org/10.1109/HPCA.2009.4798232
[31]
Jack Edmonds and Richard M. Karp. 1972. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 19, 2 (April 1972), 248–264. https://doi.org/10.1145/321694.321699
[32]
Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Springer Cham, Cham, Switzerland. https://doi.org/10.1007/978-3-031-01743-8
[33]
Andrei Frumusanu. 2020. Apple Announces The Apple Silicon M1: Ditching x86 - What to Expect, Based on A14. Anandtech. https://www.anandtech.com/show/16226/apple-silicon-m1-a14-deep-dive/2
[34]
Andrei Frumusanu. 2021. The Snapdragon 888 vs The Exynos 2100: Cortex-X1 & 5nm - Who Does It Better? AnandTech. https://www.anandtech.com/show/16463/snapdragon-888-vs-exynos-2100-galaxy-s21-ultra/3
[35]
Ilya Ganusov and Martin Burtscher. 2006. Efficient Emulation of Hardware Prefetchers via Event-Driven Helper Threading. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (Seattle, Washington, USA) (PACT ’06). Association for Computing Machinery, New York, NY, USA, 144–153. https://doi.org/10.1145/1152154.1152178
[36]
Saurabh Gupta, Niranjan Soundararajan, Ragavendra Natarajan, and Sreenivas Subramoney. 2020. Opportunistic Early Pipeline Re-Steering for Data-Dependent Branches. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (Virtual Event, GA, USA) (PACT ’20). Association for Computing Machinery, New York, NY, USA, 305–316. https://doi.org/10.1145/3410463.3414628
[37]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled Supply-compute Communication Management for Heterogeneous Architectures. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 191–203. https://doi.org/10.1145/2830772.2830800
[38]
Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2016. Accelerating Dependent Cache Misses with an Enhanced Memory Controller. In Proceedings of the 43rd International Symposium on Computer Architecture (Seoul, Republic of Korea) (ISCA ’16). IEEE Computer Society, Los Alamitos, CA, USA, 444–455. https://doi.org/10.1109/ISCA.2016.46
[39]
Milad Hashemi, Onur Mutlu, and Yale N. Patt. 2016. Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Computer Society, Los Alamitos, CA, USA, Article 61, 12 pages. https://doi.org/10.1109/MICRO.2016.7783764
[40]
Milad Hashemi and Yale N. Patt. 2015. Filtered Runahead Execution with a Runahead Buffer. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 358–369. https://doi.org/10.1145/2830772.2830812
[41]
Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient Execution of Memory Access Phases Using Dataflow Specialization. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 118–130. https://doi.org/10.1145/2749469.2750390
[42]
Akanksha Jain and Calvin Lin. 2013. Linearizing Irregular Memory Accesses for Improved Correlated Prefetching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (Davis, California) (MICRO-46). Association for Computing Machinery, New York, NY, USA, 247–259. https://doi.org/10.1145/2540708.2540730
[43]
Doug Joseph and Dirk Grunwald. 1997. Prefetching Using Markov Predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture (Denver, Colorado, USA) (ISCA ’97). Association for Computing Machinery, New York, NY, USA, 252–263. https://doi.org/10.1145/264107.264207
[44]
Changhee Jung, Daeseob Lim, Jaejin Lee, and Yan Solihin. 2006. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (Rhodes Island, Greece) (IPDPS’06). IEEE Computer Society, Los Alamitos, CA, USA, 10 pp.–. https://doi.org/10.1109/IPDPS.2006.1639375
[45]
Dongkeun Kim and Donald Yeung. 2002. Design and Evaluation of Compiler Algorithms for Pre-execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 159–170. https://doi.org/10.1145/605397.605415
[46]
Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path Confidence Based Lookahead Prefetching. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Computer Society, Los Alamitos, CA, USA, Article 60, 12 pages. https://doi.org/10.1109/MICRO.2016.7783763
[47]
Onur Kocberber, Babak Falsafi, and Boris Grot. 2015. Asynchronous Memory Access Chaining. Proc. VLDB Endow. 9, 4 (dec 2015), 252–263. https://doi.org/10.14778/2856318.2856321
[48]
Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. 2013. Meet the Walkers: Accelerating Index Traversals for In-memory Databases. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (Davis, California) (MICRO-46). Association for Computing Machinery, New York, NY, USA, 468–479. https://doi.org/10.1145/2540708.2540748
[49]
Nicholas Kohout, Seungryul Choi, Dongkeun Kim, and Donald Yeung. 2001. Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques(PACT ’01). IEEE Computer Society, Los Alamitos, CA, USA, 268–279. https://doi.org/10.1109/PACT.2001.953307
[50]
Snehasish Kumar, Arrvindh Shriraman, Vijayalakshmi Srinivasan, Dan Lin, and Jordon Phillips. 2014. SQRL: Hardware Accelerator for Collecting Software Data Structures. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (Edmonton, AB, Canada) (PACT ’14). Association for Computing Machinery, New York, NY, USA, 475–476. https://doi.org/10.1145/2628071.2628118
[51]
Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman, and Vijayalakshmi Srinivasan. 2015. DASX: Hardware Accelerator for Software Data Structures. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS ’15). Association for Computing Machinery, New York, NY, USA, 361–372. https://doi.org/10.1145/2751205.2751231
[52]
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (Vancouver, British Columbia, Canada) (PLDI ’00). Association for Computing Machinery, New York, NY, USA, 145–156. https://doi.org/10.1145/349299.349320
[53]
Eric Lau, Jason E. Miller, Inseok Choi, Donald Yeung, Saman Amarasinghe, and Anant Agarwal. 2011. Multicore Performance Optimization Using Partner Cores. In 3rd USENIX Workshop on Hot Topics in Parallelism(HotPar 11). USENIX Association, Berkeley, CA, 1–6. https://www.usenix.org/conference/hotpar11/multicore-performance-optimization-using-partner-cores
[54]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28, 2 (March 2008), 39–55. https://doi.org/10.1109/MM.2008.31
[55]
Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A Compiler Framework for Extracting Superword Level Parallelism. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (Beijing, China) (PLDI ’12). Association for Computing Machinery, New York, NY, USA, 347–358. https://doi.org/10.1145/2254064.2254106
[56]
Elliot Lockerman, Axel Feldmann, Mohammad Bakhshalipour, Alexandru Stanescu, Shashwat Gupta, Daniel Sanchez, and Nathan Beckmann. 2020. Livia: Data-Centric Computing Throughout the Memory Hierarchy. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 417–433. https://doi.org/10.1145/3373376.3378497
[57]
Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An Evaluation of Vectorizing Compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques(PACT ’11). IEEE Computer Society, Los Alamitos, CA, USA, 372–382. https://doi.org/10.1109/PACT.2011.68
[58]
Pierre Michaud. 2016. Best-offset hardware prefetching. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 469–480. https://doi.org/10.1109/HPCA.2016.7446087
[59]
Sparsh Mittal. 2016. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Comput. Surv. 49, 2, Article 35 (aug 2016), 35 pages. https://doi.org/10.1145/2907071
[60]
Andreas Moshovos, Dionisios N. Pnevmatikatos, and Amirali Baniasadi. 2001. Slice-Processors: An Implementation of Operation-Based Prediction. In Proceedings of the 15th International Conference on Supercomputing (Sorrento, Italy) (ICS ’01). Association for Computing Machinery, New York, NY, USA, 321–334. https://doi.org/10.1145/377792.377856
[61]
Todd Carl Mowry. 1995. Tolerating Latency through Software-Controlled Data Prefetching. Ph. D. Dissertation. Stanford University, Computer Systems Laboratory, Stanford, CA, USA. UMI Order No. GAX94-29983.
[62]
Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2005. Techniques for Efficient Processing in Runahead Execution Engines. In Proceedings of the 32nd Annual International Symposium on Computer Architecture(ISCA ’05). IEEE Computer Society, Los Alamitos, CA, USA, 370–381. https://doi.org/10.1109/ISCA.2005.49
[63]
Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2006. Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses. IEEE Trans. Comput. 55, 12 (Dec 2006), 1491–1508. https://doi.org/10.1109/TC.2006.191
[64]
Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2006. Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance. IEEE Micro 26, 1 (Jan 2006), 10–20. https://doi.org/10.1109/MM.2006.10
[65]
Onur Mutlu, Hyesoon Kim, Jared Stark, and Yale N. Patt. 2005. On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor. IEEE Computer Architecture Letters 4, 1 (Jan 2005), 2–2. https://doi.org/10.1109/L-CA.2005.1
[66]
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: an alternative to very large instruction windows for out-of-order processors. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.IEEE Computer Society, Los Alamitos, CA, USA, 129–140. https://doi.org/10.1109/HPCA.2003.1183532
[67]
Ajeya Naithani, Sam Ainsworth, Timothy M. Jones, and Lieven Eeckhout. 2021. Vector Runahead. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA ’21). IEEE Computer Society, Los Alamitos, CA, USA, 195–208. https://doi.org/10.1109/ISCA52012.2021.00024
[68]
Ajeya Naithani, Sam Ainsworth, Timothy M. Jones, and Lieven Eeckhout. 2022. Vector Runahead for Indirect Memory Accesses. IEEE Micro 42, 4 (jul 2022), 116–123. https://doi.org/10.1109/MM.2022.3163132
[69]
Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout. 2019. Precise Runahead Execution. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 71–74. https://doi.org/10.1109/LCA.2019.2910518
[70]
Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout. 2020. Precise Runahead Execution. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 397–410. https://doi.org/10.1109/HPCA47549.2020.00040
[71]
Agustín Navarro-Torres, Biswabandan Panda, Jesús Alastruey-Benedé, Pablo Ibáñez, Víctor Viñals-Yúfera, and Alberto Ros. 2022. Berti: an Accurate Local-Delta Data Prefetcher. In 2022 55th IEEE/ACM International Symposium on Microarchitecture(MICRO-55). IEEE Computer Society, Los Alamitos, CA, USA, 975–991. https://doi.org/10.1109/MICRO56248.2022.00072
[72]
Kyle J. Nesbit and James E. Smith. 2004. Data Cache Prefetching Using a Global History Buffer. In Proceedings of the 10th International Symposium on High Performance Computer Architecture(HPCA ’04). IEEE Computer Society, Los Alamitos, CA, USA, 96. https://doi.org/10.1109/HPCA.2004.10030
[73]
Quan M. Nguyen and Daniel Sanchez. 2020. Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 596–608. https://doi.org/10.1109/MICRO50266.2020.00056
[74]
Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-Vectorization of Interleaved Data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation (Ottawa, Ontario, Canada) (PLDI ’06). Association for Computing Machinery, New York, NY, USA, 132–143. https://doi.org/10.1145/1133981.1133997
[75]
Samuel Pakalapati and Biswabandan Panda. 2020. Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Los Alamitos, CA, USA, 118–131. https://doi.org/10.1109/ISCA45697.2020.00021
[76]
Vasileios Porpodas and Timothy M. Jones. 2015. Throttling Automatic Vectorization: When Less is More. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)(PACT ’15). IEEE Computer Society, Los Alamitos, CA, USA, 432–444. https://doi.org/10.1109/PACT.2015.32
[77]
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP Automatic Vectorization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (San Francisco, California) (CGO ’15). IEEE Computer Society, Los Alamitos, CA, USA, 190–201. https://doi.org/10.1109/CGO.2015.7054199
[78]
Stephen Pruett and Yale Patt. 2021. Branch Runahead: An Alternative to Branch Prediction for Impossible to Predict Branches. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 804–815. https://doi.org/10.1145/3466752.3480053
[79]
Tanausú Ramírez, Alex Pajuelo, Oliverio Jesus Santana, Onur Mutlu, and Mateo Valero. 2010. Efficient Runahead Threads. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (Vienna, Austria) (PACT ’10). Association for Computing Machinery, New York, NY, USA, 443–452. https://doi.org/10.1145/1854273.1854328
[80]
Tanausú Ramírez, Alex Pajuelo, Oliverio Jesus Santana, and Mateo Valero. 2008. Runahead Threads to improve SMT performance. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture. IEEE Computer Society, Los Alamitos, CA, USA, 149–158. https://doi.org/10.1109/HPCA.2008.4658635
[81]
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques(PACT ’04). IEEE Computer Society, Los Alamitos, CA, USA, 177–188. https://doi.org/10.1109/PACT.2004.1342552
[82]
Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence Based Prefetching for Linked Data Structures. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA) (ASPLOS VIII). Association for Computing Machinery, New York, NY, USA, 115–126. https://doi.org/10.1145/291069.291034
[83]
André Seznec. 2016. TAGE-SC-L Branch Predictors Again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5) (Seoul, South Korea). INRIA HAL, rennes France, 1–4. https://inria.hal.science/hal-01354253
[84]
Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently Prefetching Complex Address Patterns. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 141–152. https://doi.org/10.1145/2830772.2830793
[85]
Zhan Shi, Akanksha Jain, Kevin Swersky, Milad Hashemi, Parthasarathy Ranganathan, and Calvin Lin. 2021. A Hierarchical Neural Model of Data Prefetching. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Virtual, USA) (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA, 861–873. https://doi.org/10.1145/3445814.3446752
[86]
Peng Sun, Giacomo Gabrielli, and Timothy M. Jones. 2021. Speculative Vectorisation with Selective Replay. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA ’21). IEEE Computer Society, Los Alamitos, CA, USA, 223–236. https://doi.org/10.1109/ISCA52012.2021.00026
[87]
Hikaru Takayashiki, Masayuki Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi. 2019. A Hardware Prefetching Mechanism for Vector Gather Instructions. In 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE Computer Society, Los Alamitos, CA, USA, 59–66. https://doi.org/10.1109/IA349570.2019.00015
[88]
Nishil Talati, Kyle May, Armand Behroozi, Yichen Yang, Kuba Kaszyk, Christos Vasiladiotis, Tarunesh Verma, Lu Li, Brandon Nguyen, Jiawen Sun, John Magnus Morton, Agreen Ahmadi, Todd Austin, Michael O’Boyle, Scott Mahlke, Trevor Mudge, and Ronald Dreslinski. 2021. Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design, In 2021 IEEE International Symposium on High-Performance Computer Architecture. Proceedings - International Symposium on High-Performance Computer Architecture 2021-February, 654–667. https://doi.org/10.1109/HPCA51647.2021.00061
[89]
Sam Ainsworth Timothy and M. Jones. 2017. Software prefetching for indirect memory accesses. In CGO 2017 - Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Computer Society, Los Alamitos, CA, USA, 305–317. https://doi.org/10.1109/CGO.2017.7863749
[90]
Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Själander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: Look-Ahead Compile-Time Scheduling. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, USA) (CGO ’17). IEEE Computer Society, Los Alamitos, CA, USA, 171–184. https://doi.org/10.1109/CGO.2017.7863738
[91]
Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous Multithreading: Maximizing on-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (S. Margherita Ligure, Italy) (ISCA ’95). Association for Computing Machinery, New York, NY, USA, 392–403. https://doi.org/10.1145/223982.224449
[92]
Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative Decoupled Software Pipelining. In 2007 16th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Los Alamitos, CA, USA, 49–59. https://doi.org/10.1109/PACT.2007.66
[93]
Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, Terry Sych, Stephen F. Moore, and John P. Shen. 2004. Helper Threads via Virtual Multithreading. IEEE Micro 24, 6 (nov 2004), 74–82. https://doi.org/10.1109/MM.2004.75
[94]
Zhenlin Wang, Doug Burger, Kathryn S. McKinley, Steven K. Reinhardt, and Charles C. Weems. 2003. Guided Region Prefetching: A Cooperative Hardware/Software Approach. In Proceedings of the 30th Annual International Symposium on Computer Architecture (San Diego, California) (ISCA ’03). Association for Computing Machinery, New York, NY, USA, 388–398. https://doi.org/10.1145/859618.859663
[95]
Hao Wu, Krishnendra Nathella, Joseph Pusdesris, Dam Sunwoo, Akanksha Jain, and Calvin Lin. 2019. Temporal Prefetching Without the Off-Chip Metadata. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 996–1008. https://doi.org/10.1145/3352460.3358300
[96]
Hao Wu, Krishnendra Nathella, Dam Sunwoo, Akanksha Jain, and Calvin Lin. 2019. Efficient Metadata Management for Irregular Data Prefetching. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA ’19). Association for Computing Machinery, New York, NY, USA, 449–461. https://doi.org/10.1145/3307650.3322225
[97]
Chia-Lin Yang and Alvin R. Lebeck. 2002. A Programmable Memory Hierarchy for Prefetching Linked Data Structures. In Proceedings of the 4th International Symposium on High Performance Computing(ISHPC ’02). Springer-Verlag, Berlin, Heidelberg, 160–174. https://doi.org/10.1007/3-540-47847-7_15
[98]
Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect Memory Prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 178–190. https://doi.org/10.1145/2830772.2830807
[99]
Chao Zhang, Yuan Zeng, John Shalf, and Xiaochen Guo. 2020. RnR: A Software-Assisted Record-and-Replay Hardware Prefetcher. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, Los Alamitos, CA, USA, 609–621. https://doi.org/10.1109/MICRO50266.2020.00057
[100]
Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (Williamsburg, VA, USA) (ASPLOS ’18). Association for Computing Machinery, New York, NY, USA, 593–607. https://doi.org/10.1145/3173162.3173197

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
October 2023
1528 pages
ISBN:9798400703294
DOI:10.1145/3613424
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CPU microarchitecture
  2. graph processing
  3. prefetching
  4. runahead
  5. speculative vectorization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MICRO '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)580
  • Downloads (Last 6 weeks)31
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media