skip to main content
10.1145/2749469.2750386acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

A scalable processing-in-memory accelerator for parallel graph processing

Published: 13 June 2015 Publication History

Abstract

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations.
In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

References

[1]
ARM Cortex-A5 Processor. Available: http://www.arm.com/products/processors/cortex-a/cortex-a5.php
[2]
R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.
[3]
A. Basu et al., "Efficient virtual memory for big memory servers," in Proc. ISCA, 2013.
[4]
A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," ACM Trans. Comput. Syst., vol. 2, no. 1, pp. 39--59, 1984.
[5]
S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," in Proc. WWW, 1998.
[6]
T.-F. Chen and J.-L. Baer, "Effective hardware-based data prefetching for high-performance processors," IEEE Trans. Comput., vol. 44, no. 5, pp. 609--623, 1995.
[7]
E. S. Chung et al., "LINQits: Big data on little clients," in Proc. ISCA, 2013.
[8]
L. Dagum and R. Menon, "OpenMP: An industry-standard API for shared-memory programming," IEEE Comput. Sci. & Eng., vol. 5, no. 1, pp. 46--55, 1998.
[9]
Y. Eckert et al., "Thermal feasibility of die-stacked processing in memory," in WoNDP, 2014.
[10]
M. Ferdman et al., "Clearing the clouds: A study of emerging scale-out workloads on modern hardware," in Proc. ASPLOS, 2012.
[11]
M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Comput., vol. 28, no. 4, pp. 23--31, 1995.
[12]
J. E. Gonzalez et al., "PowerGraph: Distributed graph-parallel computation on natural graphs," in Proc. OSDI, 2012.
[13]
A. Gutierrez et al., "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in Proc. ASPLOS, 2014.
[14]
M. Hall et al., "Mapping irregular applications to DIVA, a PIM-based data-intensive architecture," in Proc. SC, 1999.
[15]
P. Harish and P. J. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA," in Proc. HiPC, 2007.
[16]
Harshvardhan et al., "KLA: A new algorithmic paradigm for parallel graph computations," in Proc. PACT, 2014.
[17]
S. Hong et al., "Green-Marl: A DSL for easy and efficient graph analysis," in Proc. ASPLOS, 2012.
[18]
S. Hong et al., "Accelerating CUDA graph algorithms at maximum warp," in Proc. PPoPP, 2011.
[19]
S. Hong et al., "Efficient parallel graph exploration on multi-core CPU and GPU," in Proc. PACT, 2011.
[20]
S. Hong et al., "Simplifying scalable graph processing with a domain-specific language," in Proc. CGO, 2014.
[21]
C. J. Hughes and S. V. Adve, "Memory-side prefetching for linked data structures for processor-in-memory systems," J. Parallel Distrib. Comput., vol. 65, no. 4, pp. 448--463, 2005.
[22]
"Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, Tech. Rep., Jan. 2013.
[23]
"Hybrid memory cube specification 2.0," Hybrid Memory Cube Consortium, Tech. Rep., Nov. 2014.
[24]
J. Jeddeloh and B. Keeth, "Hybrid memory cube new DRAM architecture increases density and performance," in Proc. VLSIT, 2012.
[25]
N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA, 1990.
[26]
Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in Proc. ICCD, 1999.
[27]
G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359--392, 1998.
[28]
T. Kgil et al., "PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in Proc. ASPLOS, 2006.
[29]
G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in Proc. PACT, 2013.
[30]
O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in Proc. MICRO, 2013.
[31]
P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in Proc. ICPP, 1994.
[32]
D. Kroft, "Lockup-free instruction fetch/prefetch cache organization," in Proc. ISCA, 1981.
[33]
Laboratory for Web Algorithmics. Available: http://law.di.unimi.it/datasets.php
[34]
K. Lim et al., "Thin servers with smart pipes: Designing SoC accelerators for memcached," in Proc. ISCA, 2013.
[35]
G. H. Loh, "3D-stacked memory architectures for multi-core processors," in Proc. ISCA, 2008.
[36]
G. H. Loh et al., "A processing-in-memory taxonomy and a case for studying fixed-function PIM," in WoNDP, 2013.
[37]
Y. Low et al., "Distributed GraphLab: A framework for machine learning and data mining in the cloud," Proc. VLDB Endow., vol. 5, no. 8, pp. 716--727, 2012.
[38]
C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.
[39]
G. Malewicz et al., "Pregel: A system for large-scale graph processing," in Proc. SIGMOD, 2010.
[40]
D. Merrill et al., "Scalable GPU graph traversal," in Proc. PPoPP, 2012.
[41]
2Gb: x4, x8, x16 DDR3 SDRAM, Micron Technology, 2006.
[42]
A. Mislove et al., "Measurement and analysis of online social networks," in Proc. IMC, 2007.
[43]
O. Mutlu et al., "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in Proc. HPCA, 2003.
[44]
Oracle TimesTen in-memory database. Available: http://www.oracle.com/technetwork/database/timesten/
[45]
M. Oskin et al., "Active pages: A computation model for intelligent memory," in Proc. ISCA, 1998.
[46]
J. Ousterhout et al., "The case for RAMClouds: Scalable high-performance storage entirely in DRAM," ACM SIGOPS Oper. Syst. Rev., vol. 43, no. 4, pp. 92--105, 2010.
[47]
D. Patterson et al., "Intelligent RAM (IRAM): Chips that remember and compute," in ISSCC Dig. Tech. Pap., 1997.
[48]
S. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in Proc. ISPASS, 2014.
[49]
W. Qadeer et al., "Convolution engine: Balancing efficiency & flexibility in specialized computing," in Proc. ISCA, 2013.
[50]
P. Ranganathan, "From microprocessors to Nanostores: Rethinking data-centric systems," IEEE Comput., vol. 44, no. 1, pp. 39--48, 2011.
[51]
S. Salihoglu and J. Widom, "GPS: A graph processing system," in Proc. SSDBM, 2013.
[52]
SAP HANA. Available: http://www.saphana.com/
[53]
V. Seshadri et al., "RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization," in Proc. MICRO, 2013.
[54]
M. Shevgoor et al., "Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device," in Proc. MICRO, 2013.
[55]
Y. Solihin et al., "Using a user-level memory thread for correlation prefetching," in Proc. ISCA, 2002.
[56]
S. Srinath et al., "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in Proc. HPCA, 2007.
[57]
M. A. Suleman et al., "Accelerating critical section execution with asymmetric multi-core architectures," in Proc. ASPLOS, 2009.
[58]
Y. Tian et al., "From "think like a vertex" to "think like a graph"," Proc. VLDB Endow., vol. 7, no. 3, pp. 193--204, 2013.
[59]
L. Wu et al., "Navigating big data with high-throughput, energy-efficient data partitioning," in Proc. ISCA, 2013.
[60]
C.-L. Yang and A. R. Lebeck, "Push vs. pull: Data movement for linked data structures," in Proc. ICS, 2000.
[61]
D. P. Zhang et al., "TOP-PIM: Throughput-oriented programmable processing in memory," in Proc. HPDC, 2014.
[62]
Q. Zhu et al., "A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing," in Proc. 3DIC, 2013.
[63]
Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in Proc. HPEC, 2013.

Cited By

View all

Index Terms

  1. A scalable processing-in-memory accelerator for parallel graph processing

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
        June 2015
        768 pages
        ISBN:9781450334020
        DOI:10.1145/2749469
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 June 2015

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        ISCA '15
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 543 of 3,203 submissions, 17%

        Upcoming Conference

        ISCA '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)604
        • Downloads (Last 6 weeks)61
        Reflects downloads up to 25 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media