skip to main content
article

Memory-side prefetching for linked data structures for processor-in-memory systems

Published: 01 April 2005 Publication History

Abstract

This paper studies a memory-side prefetching technique to hide latency incurred by inherently serial accesses to linked data structures (LDS). A programmable engine sits close to memory and traverses LDS independently from the processor. The engine can run ahead of the processor because of its low latency path to memory, allowing it to initiate data transfers earlier than the processor and pipeline multiple transfers over the network. We evaluate the proposed memory-side prefetching scheme for the Olden benchmarks on a processor-in-memory system. For the six benchmarks where LDS memory stall time is significant, the memory-side scheme reduces execution time by an average of 27% compared to a system without any prefetching. Compared to a state-of-the-art processor-side software prefetching scheme, the memory-side scheme reduces execution time in the range of 20-50% for three of the six applications, is about the same for two applications, and is worse by 18% for one application. We conclude that our memory-side scheme is effective, but a combination of the processor- and memory-side prefetching schemes is best and provide a qualitative framework to determine when either scheme should be used.

References

[1]
{1} T. Alexander, G. Kedem, Distributed prefetch-buffer/cache design for high performance memory systems, in: Proceedings of the Second International Symposium on High-Performance Computer Archives, 1996.]]
[2]
{2} M. Bekerman, et al., Correlated load-address predictors, in: Proceedings of the 26th International Symposium on Computer Archives, 1999.]]
[3]
{3} D. Burger, S. Kaxiras, J.R. Goodman, DataScalar architectures, in: Proceedings of the 24th International Symposium on Computer Archives, 1997.]]
[4]
{4} B. Calder, C. Krintz, S. John, T. Austin, Cache-conscious data placement, in: Proceedings of the Eighth International Conference on Archives Support for Programming Languages and Operating Systems, 1998.]]
[5]
{5} M.C. Carlisle, A. Rogers, Software caching and computation migration in Olden, in: Proceedings of the Sixth Principles and Practice of Parallel Programming, 1995.]]
[6]
{6} J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, Impulse: building a smarter memory controller, in: Proceedings of the Fifth International Symposium on High-Performance Computer Archives, 1999.]]
[7]
{7} T.-F. Chen, An effective programmable prefetch engine for on-chip caches, in: Proceedings of the 28th International Symposium on Microarchives, 1995.]]
[8]
{8} T.-F. Chen, J.-L. Baer, Effective hardware-based data prefetching for high-performance processors, IEEE Trans. Comput. (1995).]]
[9]
{9} T.M. Chilimbi, M.D. Hill, J.R. Larus, Cache-conscious structure layout, in: Proceedings of the SIGPLAN'99 Conference on Programming Language Design and Implementation, 1999.]]
[10]
{10} D.G. Elliot, W.M. Snelgrove, Computational ram: a memory-SIMD hybrid and its application to DSP, in: Proceedings of the Custom Integrated Circuits Conference, 1992.]]
[11]
{11} M. Gokhale, B. Holmes, K. Iobst, Processing in memory: the terasys massively parallel PIM array, IEEE Comput. (April 1995).]]
[12]
{12} M. Hall, et al., Mapping irregular applications to DIVA, a PIM-based data-intensive architecture, in: Proceedings of Supercomputing'99, 1999.]]
[13]
{13} C.J. Hughes, Prefetching linked data structures in systems with merged DRAM-logic, Master's Thesis, University of Illinois at Urbana-Champaign, May 2000, URL: http://rsim.cs.uiuc.edu/c~ jhughes/cjhughesmsthesis.pdf.]]
[14]
{14} C.J. Hughes, S. Adve, Supplemental data for memory-side prefetching for linked data structures, URL: http://rsim.cs.uiuc.edu/c~ jhughes/alternative-arch.ps.]]
[15]
{15} IBM, URL: http://www.research.ibm.com/bluegene/comsci.html.]]
[16]
{16} Intel, Intel IA-64 Architecture Software Developer's Manual, 2000.]]
[17]
{17} Intel, IA-32 Intel Architecture Software Developer's Manual, 2001.]]
[18]
{18} B.L. Jacob, T.N. Mudge, A look at several memory management units, TLB-refill mechanisms, and page table organizations, in: Proceedings of the Eighth International Conference on Archives Support for Programming Languages and Operating Systems, 1998.]]
[19]
{19} D. Joseph, D. Grunwald, Prefetching using Markov predictors, in: Proceedings of the 24th International Symposium on Computer Archives, 1997.]]
[20]
{20} Y. Kang, et al., FlexRAM: toward an advanced intelligent memory system, in: Proceedings of the 1999 International Conference on Computer Design, 1999.]]
[21]
{21} M. Karlsson, F. Dahlgren, P. Stenström, A prefetching technique for irregular accesses to linked data structures, in: Proceedings of the Sixth International Symposium on High-Performance Computer Archives, 2000.]]
[22]
{22} P.M. Kogge, The EXECUBE approach to massively parallel processing, in: Proceedings of the 1994 International Conference on Parallel Proceedings, 1994.]]
[23]
{23} N. Kohout, S. Choi, D. Yeung, Multi-chain prefetching: exploiting natural memory parallelism in pointer-chasing codes, Technical Report UMD-SCA-TR-2000-01, University of Maryland at College Park, 2000.]]
[24]
{24} C.E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanović, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, K. Yelick, Scalable processors in the billion-transistor era: IRAM, IEEE Comput. (September 1997).]]
[25]
{25} D. Kroft, Lockup-free instruction fetch/prefetch cache organization, in: Proceedings of Eighth Symposium on Computer Archives, May 1981, pp. 81-87.]]
[26]
{26} M.H. Lipasti, W.J. Schmidt, S.R. Kunkel, R.R. Roediger, SPAID: software prefetching in pointer- and call-intensive environments, in: Proceedings of the 28th International Symposium on Microarchives, 1995.]]
[27]
{27} C.-K. Luk, T.C. Mowry, Compiler-based prefetching for recursive data structures, in: Proceedings of the Seventh International Conference on Archives Support for Programming Languages and Operating Systems, 1996.]]
[28]
{28} C.-K. Luk, T.C. Mowry, Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation, in: Proceedings of the 26th International Symposium on Computer Archives, 1999.]]
[29]
{29} S. Mehrotra, L. Harrison, Examination of a memory access classification scheme for pointer-intensive and numeric programs, in: Proceedings of the 10th International Conference on Supercomputing, 1996.]]
[30]
{30} T.C. Mowry, M.S. Lam, A. Gupta, Design and evaluation of a compiler algorithm for prefetching, in: Proceedings of the Fifth International Conference on Archives Support for Programming Languages and Operating Systems, 1992.]]
[31]
{31} M. Oskin, F.T. Chong, T. Sherwood, Active pages: a computation model for intelligent memory, in: Proceedings of the 25th International Symposium on Computer Archives, 1998.]]
[32]
{32} V.S. Pai, P. Ranganathan, H. Abdel-Shafi, S. Adve, The impact of exploiting instruction-level parallelism on shared-memory multiprocessors, IEEE Trans. Comput. (special issue on caches) (February 1999).]]
[33]
{33} V.S. Pai, P. Ranganathan, S.V. Adve, RSIM Reference Manual version 1.0, Technical Report 9705, Department of Electronics and Computer Engineering, Rice University, August 1997.]]
[34]
{34} S.S. Pinter, A. Yoaz, Tango: a hardware-based data prefetching technique for superscalar processors, in: Proceedings of the 29th International Symposium on Microarchives, 1996.]]
[35]
{35} A. Roth, A. Moshovos, G.S. Sohi, Dependence based prefetching for linked data structures, in: Proceedings of the Eighth International Conference on Archives Support for Programming Languages and Operating Systems, 1998.]]
[36]
{36} A. Roth, G.S. Sohi, Effective jump-pointer prefetching for linked data structures, in: Proceedings of the 26th International Symposium on Computer Archives, 1999.]]
[37]
{37} A. Saulsbury, F. Pong, A. Nowatzyk, Missing the memory wall: the case for processor/memory integration, in: Proceedings of the 23rd International Symposium on Computer Archives, 1996.]]
[38]
{38} S. Wallach, Billions and billions, Fourth International Symposium on High-Performance Computer Archives, 1998, Keynote address.]]
[39]
{39} S.C. Woo, et al., The SPLASH-2 programs: characterization and methodological considerations, in: Proceedings of the 22nd International Symposium on Computer Archives, June 1995, pp. 24-36.]]
[40]
{40} C.-L. Yang, A.R. Lebeck, Push vs. pull: data movement for linked data structures, in: Proceedings of the 2000 International Conference on Supercomputing, May 2000.]]
[41]
{41} Z. Zhang, J. Torrellas, Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching, in: Proceedings of the 22nd International Symposium on Computer Archives, 1995.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing  Volume 65, Issue 4
April 2005
157 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 April 2005

Author Tags

  1. Linked data structures
  2. Prefetching
  3. Processor-in-memory

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media