article

Open access

Tolerating memory latency through push prefetching for pointer-intensive applications

Authors:

Alvin R. Lebeck,

Hung-Wei Tseng,

Chien-Hao LeeAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 1, Issue 4

Pages 445 - 475

https://doi.org/10.1145/1044823.1044827

Published: 01 December 2004 Publication History

Abstract

Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In this paper, we proposed a cooperative hardware/software prefetching framework, the push architecture, which is designed specifically for linked data structures. The push architecture exploits program structure for future address generation instead of relying on past address history. It identifies the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution. This allows the prefetch engine to successfully generate future addresses. To overcome the serial nature of LDS address generation, the push architecture employs a novel data movement model. It attaches the prefetch engine to each level of the memory hierarchy and pushes, rather than pulls, data to the CPU. This push model decouples the pointer dereference from the transfer of the current node up to the processor. Thus a series of pointer dereferences becomes a pipelined process rather than a serial process. Simulation results show that the push architecture can reduce up to 100% of memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 15%.

References

[1]

Alexander, T. and Kedem, G. 1996. Distributed predictive cache design for high performance memory system. In Proceedings of the 2nd International Symposium on High-Performance Computer Architecture. 254--263.

[2]

Annavaram, M. M., Patel, J. M., and Davidson, E. S. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 52--61.

[3]

Baer, J.-L. and Chen, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing. 176--186.

[4]

Burger, D. C., Austin, T. M., and Bennett, S. 1996. Evaluating Future Microprocessors the Simplescalar Tool Set. Tech. Rep. 1308, Computer Sciences Department, University of Wisconsin--Madison. July.

[5]

Callahan, D., Kennedy, K., and Porterfield, A. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). 40--52.

[6]

Carter, J., Hsieh, W., Stoller, L., Swanson, M., and Zhang, L. 1999. Impulse: Building a smarter memory controller. In Proceedings of 5th Symposium High-Performance Computer Architecture. 70--79.

[7]

Chilimbi, T., Larus, J., and Hill, M. 1998. Improving Pointer-Based Codes through Cache-Concious Data Placement. Tech. Rep. CSL-TR-98-1365, University of Wisconsin, Madison. March.

[8]

Chilimbi, T. M., Davidson, B., and Larus, J. R. 1999a. Cache-conscious structure definition. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation. 13--24.

[9]

Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999b. Cache-conscious structure layout. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation. 1--12.

[10]

Collins, J. D., Wang, H., Tullsen, D. M., Christopher, H. J., Lee, Y. F., Lavery, D., and Shen, J. P. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 14--25.

[11]

Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the ACM International Conference on Supercomputing. 176--186.

[12]

Fu, J. W. C. and Patel, J. H. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture. 102--110.

Digital Library

[13]

Gwennap, L. 1998. Alpha 21364 to ease memory bottleneck. Microprocessor Report.

[14]

Hughes, C. J. 2002. Prefetching linked data structures in systems with merged dram-logic. M.S. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign.

[15]

Joseph, D. and Grunwald, D. 1997. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 252--263.

[16]

Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 364--373.

[17]

Kang, Y., Huang, W., Yoo, S.-M., Keen, D., Ge, Z., Lam, V., Pattnaik, P., and Torrellas, J. 1999. Flexram: Toward an advanced intelligent memory system. In Proceedings of the 1999 International Conference on Computer Design. 192--201.

[18]

Karlsson, M., Dahlgren, F., and Stenstrom, P. 1999. A prefetching technique for irregular accesses to linked data structures. In Proceedings of 6th Symposium High-Performance Computer Architecture. 206--217.

[19]

Kessler, R. E. 1999. The alpha 21264 microprocessor. IEEE Micro. 34--36.

[20]

Klaiber, A. C. and Levy, H. M. 1991. An architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture. 43--53.

[21]

Kolb, C. The rayshade user's guide. In http://graphics.stanford.edu/- cek/-rayshade.

[22]

Kroft, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture. 81--87.

[23]

Lebeck, A. and Wood, D. 1994. Cache profiling and the spec benchmarks: A case study. IEEE Computer. 15--26.

[24]

Leibson, S. 2000. Xscale (strongarm-2) muscles. Microprocessor Report.

[25]

Lipasti, M. H., Schmidt, W. J., Kunkel, S. R., and Roediger, R. R. 1995. Spaid: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual International Symposium on Microarchitecture. 252--263.

[26]

Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 40--51.

[27]

Luk, C.-K. and Mowry, T. C. 1996. Compiler based prefetching for recursive data structure. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). 222--233.

[28]

Mehrotra, S. and Harrison, L. 1996. Examination of a memory access classification scheme for pointer-intensive and numeric program. In Proceedings of the 10th International Conference on Supercomputing. 133--139.

[29]

Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating System. 62--73.

[30]

Oskin, M., Chong, F. T., and Sherwood, T. 1998. Active pages: a computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture. 192--203.

[31]

Patterson, D., Andreson, T., Cardwell, N., Fromm, R., Keaton, K., Kazyrakis, C., Thomas, R., and Yellick, K. 1997. A case for intelligent ram. IEEE Micro. 34--44.

[32]

Pleszkun, A. R. and Davidson, E. S. 1983. Structured memory access architecture. In Proceedings of International Conference on Parallel Processing. 461--471.

[33]

Porterfield, A. K. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. Thesis, Department of Computer Science, Rice University.

[34]

Roger, A., Carlisle, M., Reppy, J., and Hendren, L. 1995. Supporting dynamic data structures on distributed memory machines. ACM Trans. Program. Lang. Syst. 17, 2 (March).

[35]

Roth, A., Moshovos, A., and Sohi, G. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). 115--126.

[36]

Roth, A. and Sohi, G. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture. 111--121.

[37]

Roth, A. and Sohi, G. 2001. Speculative data-driven multithreading. In Proceedings of 7th Symposium High-Performance Computer Architecture. 134--143.

[38]

Smith, B. 1982a. Architecture and applications of the hep multiprocessor computer system. In Proceedings of the International Society for Optical Engineers. 241--248.

[39]

Smith, J. E. 1982b. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual International Symposium on Computer Architecture. 112--119.

[40]

Sohi, G. 1990. Instruction issue logic for high performance, interruptable, multiple functional unit, pipelined computers. IEEE Trans. Comput. 39, 3 (March), 349--359.

Digital Library

[41]

Solihin, Y., Lee, J., and Torrellas, J. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 171--182.

[42]

Sundaramoorthy, K., Purser, Z., and Rotenberg, E. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX). 257--268.

[43]

VanderWiel, S. and Lilja, D. J. 1999. A Compiler-Assisted Data Prefetch Controller. Tech. Rep. ARCTiC-99-05, Department of Electrical and Computer Engineering, University of Minnesota. May.

[44]

Yang, C. and Lebeck, A. R. 2000. Push vs. pull: Data movement for linked data structures. In Proceedings of the ACM International Conference on Supercomputing. 176--186.

[45]

Zhang, Z. and Torrellas, J. 1995. Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 188--200.

[46]

Zilles, C. B. and Sohi, G. 2001. Execution-base prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 2--13.

Cited By

Lee JKim HVuduc R(2012)When Prefetching Works, When It Doesn’t, and WhyACM Transactions on Architecture and Code Optimization10.1145/2133382.21333849:1(1-29)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.1145/2133382.2133384
Niu JQiu MWang XLi JWu GChen T(2012)Cost Minimization with HPDFG and Data Mining for Heterogeneous DSPJournal of Signal Processing Systems10.1007/s11265-010-0546-x67:3(213-228)Online publication date: 1-Jun-2012
https://dl.acm.org/doi/10.1007/s11265-010-0546-x
Chu RXiao NLu X(2011)Push-based Prefetching in Remote Memory Sharing SystemCloud, Grid and High Performance Computing10.4018/978-1-60960-603-9.ch017(269-283)Online publication date: 2011
https://doi.org/10.4018/978-1-60960-603-9.ch017
Show More Cited By

Index Terms

Tolerating memory latency through push prefetching for pointer-intensive applications
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Tolerating latency through software-controlled data prefetching
Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Going the distance for TLB prefetching: an application-driven study
Special Issue: Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02)

The importance of the Translation Lookaside Buffer (TLB) on system performance is well known. There have been numerous prior efforts addressing TLB design issues for cutting down access times and lowering miss rates. However, it was only recently that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 1, Issue 4

December 2004

107 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/1044823

Issue’s Table of Contents

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2004

Published in TACO Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
1,176
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)16

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee JKim HVuduc R(2012)When Prefetching Works, When It Doesn’t, and WhyACM Transactions on Architecture and Code Optimization10.1145/2133382.21333849:1(1-29)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.1145/2133382.2133384
Niu JQiu MWang XLi JWu GChen T(2012)Cost Minimization with HPDFG and Data Mining for Heterogeneous DSPJournal of Signal Processing Systems10.1007/s11265-010-0546-x67:3(213-228)Online publication date: 1-Jun-2012
https://dl.acm.org/doi/10.1007/s11265-010-0546-x
Chu RXiao NLu X(2011)Push-based Prefetching in Remote Memory Sharing SystemCloud, Grid and High Performance Computing10.4018/978-1-60960-603-9.ch017(269-283)Online publication date: 2011
https://doi.org/10.4018/978-1-60960-603-9.ch017
Askitis NZobel J(2011)Redesigning the string hash table, burst trie, and BST to exploit cacheACM Journal of Experimental Algorithmics10.1145/1671970.192170415(1.1-1.61)Online publication date: 7-Feb-2011
https://dl.acm.org/doi/10.1145/1671970.1921704
Zhang JMei KZhao J(2010)An Adaptive and Selective Instruction Active Push Mechanism for Multi-core ArchitectureProceedings of the 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage10.1109/NAS.2010.15(359-366)Online publication date: 15-Jul-2010
https://dl.acm.org/doi/10.1109/NAS.2010.15
Zou JXiao JHou RWang Y(2010)Frequent Instruction Sequential Pattern Mining in Hardware Sample DataProceedings of the 2010 IEEE International Conference on Data Mining10.1109/ICDM.2010.123(1205-1210)Online publication date: 13-Dec-2010
https://dl.acm.org/doi/10.1109/ICDM.2010.123
Drews FLichtenberg JWelch L(2010)Scalable parallel word search in multicore/multiprocessor systemsThe Journal of Supercomputing10.1007/s11227-009-0308-351:1(58-75)Online publication date: 1-Jan-2010
https://dl.acm.org/doi/10.1007/s11227-009-0308-3
Askitis NSinha R(2010)Engineering scalable, cache and space efficient tries for stringsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-010-0183-919:5(633-660)Online publication date: 1-Oct-2010
https://dl.acm.org/doi/10.1007/s00778-010-0183-9
Shi LFan XChen JHuang XTian H(2009)Tolerating Memory Latency Using a Hardware-Based Active-Pushing TechniqueProceedings of the 2009 International Conference on Embedded Software and Systems10.1109/ICESS.2009.65(407-411)Online publication date: 25-May-2009
https://dl.acm.org/doi/10.1109/ICESS.2009.65
Byna SChen YSun X(2009)Taxonomy of Data Prefetching for Multicore ProcessorsJournal of Computer Science and Technology10.1007/s11390-009-9233-424:3(405-417)Online publication date: 26-May-2009
https://doi.org/10.1007/s11390-009-9233-4
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents