skip to main content
article
Open access

Tolerating memory latency through push prefetching for pointer-intensive applications

Published: 01 December 2004 Publication History

Abstract

Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In this paper, we proposed a cooperative hardware/software prefetching framework, the push architecture, which is designed specifically for linked data structures. The push architecture exploits program structure for future address generation instead of relying on past address history. It identifies the load instructions that traverse a LDS and uses a prefetch engine to execute them ahead of the CPU execution. This allows the prefetch engine to successfully generate future addresses. To overcome the serial nature of LDS address generation, the push architecture employs a novel data movement model. It attaches the prefetch engine to each level of the memory hierarchy and pushes, rather than pulls, data to the CPU. This push model decouples the pointer dereference from the transfer of the current node up to the processor. Thus a series of pointer dereferences becomes a pipelined process rather than a serial process. Simulation results show that the push architecture can reduce up to 100% of memory stall time on a suite of pointer-intensive applications, reducing overall execution time by an average 15%.

References

[1]
Alexander, T. and Kedem, G. 1996. Distributed predictive cache design for high performance memory system. In Proceedings of the 2nd International Symposium on High-Performance Computer Architecture. 254--263.
[2]
Annavaram, M. M., Patel, J. M., and Davidson, E. S. 2001. Data prefetching by dependence graph precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 52--61.
[3]
Baer, J.-L. and Chen, T.-F. 1991. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of the 1991 Conference on Supercomputing. 176--186.
[4]
Burger, D. C., Austin, T. M., and Bennett, S. 1996. Evaluating Future Microprocessors the Simplescalar Tool Set. Tech. Rep. 1308, Computer Sciences Department, University of Wisconsin--Madison. July.
[5]
Callahan, D., Kennedy, K., and Porterfield, A. 1991. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). 40--52.
[6]
Carter, J., Hsieh, W., Stoller, L., Swanson, M., and Zhang, L. 1999. Impulse: Building a smarter memory controller. In Proceedings of 5th Symposium High-Performance Computer Architecture. 70--79.
[7]
Chilimbi, T., Larus, J., and Hill, M. 1998. Improving Pointer-Based Codes through Cache-Concious Data Placement. Tech. Rep. CSL-TR-98-1365, University of Wisconsin, Madison. March.
[8]
Chilimbi, T. M., Davidson, B., and Larus, J. R. 1999a. Cache-conscious structure definition. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation. 13--24.
[9]
Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999b. Cache-conscious structure layout. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation. 1--12.
[10]
Collins, J. D., Wang, H., Tullsen, D. M., Christopher, H. J., Lee, Y. F., Lavery, D., and Shen, J. P. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 14--25.
[11]
Dundas, J. and Mudge, T. 1997. Improving data cache performance by pre-executing instructions under a cache miss. In Proceedings of the ACM International Conference on Supercomputing. 176--186.
[12]
Fu, J. W. C. and Patel, J. H. 1992. Stride directed prefetching in scalar processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture. 102--110.
[13]
Gwennap, L. 1998. Alpha 21364 to ease memory bottleneck. Microprocessor Report.
[14]
Hughes, C. J. 2002. Prefetching linked data structures in systems with merged dram-logic. M.S. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign.
[15]
Joseph, D. and Grunwald, D. 1997. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 252--263.
[16]
Jouppi, N. P. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 364--373.
[17]
Kang, Y., Huang, W., Yoo, S.-M., Keen, D., Ge, Z., Lam, V., Pattnaik, P., and Torrellas, J. 1999. Flexram: Toward an advanced intelligent memory system. In Proceedings of the 1999 International Conference on Computer Design. 192--201.
[18]
Karlsson, M., Dahlgren, F., and Stenstrom, P. 1999. A prefetching technique for irregular accesses to linked data structures. In Proceedings of 6th Symposium High-Performance Computer Architecture. 206--217.
[19]
Kessler, R. E. 1999. The alpha 21264 microprocessor. IEEE Micro. 34--36.
[20]
Klaiber, A. C. and Levy, H. M. 1991. An architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture. 43--53.
[21]
Kolb, C. The rayshade user's guide. In http://graphics.stanford.edu/- cek/-rayshade.
[22]
Kroft, D. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture. 81--87.
[23]
Lebeck, A. and Wood, D. 1994. Cache profiling and the spec benchmarks: A case study. IEEE Computer. 15--26.
[24]
Leibson, S. 2000. Xscale (strongarm-2) muscles. Microprocessor Report.
[25]
Lipasti, M. H., Schmidt, W. J., Kunkel, S. R., and Roediger, R. R. 1995. Spaid: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual International Symposium on Microarchitecture. 252--263.
[26]
Luk, C.-K. 2001. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 40--51.
[27]
Luk, C.-K. and Mowry, T. C. 1996. Compiler based prefetching for recursive data structure. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). 222--233.
[28]
Mehrotra, S. and Harrison, L. 1996. Examination of a memory access classification scheme for pointer-intensive and numeric program. In Proceedings of the 10th International Conference on Supercomputing. 133--139.
[29]
Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating System. 62--73.
[30]
Oskin, M., Chong, F. T., and Sherwood, T. 1998. Active pages: a computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture. 192--203.
[31]
Patterson, D., Andreson, T., Cardwell, N., Fromm, R., Keaton, K., Kazyrakis, C., Thomas, R., and Yellick, K. 1997. A case for intelligent ram. IEEE Micro. 34--44.
[32]
Pleszkun, A. R. and Davidson, E. S. 1983. Structured memory access architecture. In Proceedings of International Conference on Parallel Processing. 461--471.
[33]
Porterfield, A. K. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. Thesis, Department of Computer Science, Rice University.
[34]
Roger, A., Carlisle, M., Reppy, J., and Hendren, L. 1995. Supporting dynamic data structures on distributed memory machines. ACM Trans. Program. Lang. Syst. 17, 2 (March).
[35]
Roth, A., Moshovos, A., and Sohi, G. 1998. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). 115--126.
[36]
Roth, A. and Sohi, G. 1999. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture. 111--121.
[37]
Roth, A. and Sohi, G. 2001. Speculative data-driven multithreading. In Proceedings of 7th Symposium High-Performance Computer Architecture. 134--143.
[38]
Smith, B. 1982a. Architecture and applications of the hep multiprocessor computer system. In Proceedings of the International Society for Optical Engineers. 241--248.
[39]
Smith, J. E. 1982b. Decoupled access/execute computer architectures. In Proceedings of the 9th Annual International Symposium on Computer Architecture. 112--119.
[40]
Sohi, G. 1990. Instruction issue logic for high performance, interruptable, multiple functional unit, pipelined computers. IEEE Trans. Comput. 39, 3 (March), 349--359.
[41]
Solihin, Y., Lee, J., and Torrellas, J. 2002. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th Annual International Symposium on Computer Architecture. 171--182.
[42]
Sundaramoorthy, K., Purser, Z., and Rotenberg, E. 2000. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IX). 257--268.
[43]
VanderWiel, S. and Lilja, D. J. 1999. A Compiler-Assisted Data Prefetch Controller. Tech. Rep. ARCTiC-99-05, Department of Electrical and Computer Engineering, University of Minnesota. May.
[44]
Yang, C. and Lebeck, A. R. 2000. Push vs. pull: Data movement for linked data structures. In Proceedings of the ACM International Conference on Supercomputing. 176--186.
[45]
Zhang, Z. and Torrellas, J. 1995. Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 188--200.
[46]
Zilles, C. B. and Sohi, G. 2001. Execution-base prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture. 2--13.

Cited By

View all

Index Terms

  1. Tolerating memory latency through push prefetching for pointer-intensive applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 1, Issue 4
    December 2004
    107 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/1044823
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 December 2004
    Published in TACO Volume 1, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Prefetch
    2. linked data structures
    3. memory hierarchy
    4. pointer-chasing

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)119
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media