skip to main content
10.1145/2600212.2600213acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

TOP-PIM: throughput-oriented programmable processing in memory

Published: 23 June 2014 Publication History

Abstract

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.
Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76\% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85\% reduction in EDP).

References

[1]
www.jedec.org/standards-documents/docs/jesd229.
[2]
www.jedec.org/standards-documents/docs/jesd235.
[3]
www.micron.com/products/hybrid-memory-cube.
[4]
ITRS interconnect working group, 2012 update. www.itrs.net/links/2012Summer/Interconnect.pptx.
[5]
Elpida begins sample shipments of ddr3 sdram (x32) based on tsv stacking technology. www.elpida.com/en/news/2011/06--27.html, 2011.
[6]
Initial hybrid memory cube short-reach interconnect specification issued to consortium adopters. Denali Memory Report, August 2012.
[7]
International Technology Roadmap for Semiconductors, 2011 Edition. 2012 update.
[8]
AMD. White paper: AMD graphics cores next (GCN) architecture. Jun 2012.
[9]
P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. Exascale workload characterization and architecture implications. In Proceedings of the High Performance Computing Symposium, 2013.
[10]
R. Balasubramonian. Exploiting 3D-stacked memory devices. IBM Research seminar, October 2012.
[11]
B. Black. Die stacking is happening! In 46th IEEE/ACM International Symposium on Microarchitecture Keynote, 2013.
[12]
B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. Loh, D. McCauley, P. Morrow, D. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking 3D microarchitecture. In International Symposium on Microarchitecture, 2006.
[13]
K. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In 5th International Conference on Data Mining, 2005.
[14]
S. Borkar. The exascale challenge. Keynote Presentation, 2010 Asia Academic Forum, Nov 2010.
[15]
S. Borkar. Exascale computing -- a fact or a fiction? Keynote Speech, 27th International Parallel & Distributed Processing Symposium, 2013.
[16]
D. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. Kim, and M. Schulte. Reevaluating the latency claims of 3D stacked memories. In 18th Asia and South Pacific Design Automation Conference, 2013.
[17]
S. Che, B. M. Beckmann, S. K. Reinhard, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In International Symposium on Workload Characterization, 2013.
[18]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization, 2009.
[19]
P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes. An efficient and scalable semiconductor architecture for parallel automata processing. To appear in IEEE Transactions on Parallel and Distributed Systems.
[20]
J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca. The architecture of the DIVA processing-in-memory chip. In 16th International Conference on Supercomputing, 2002.
[21]
B. R. Gaeke, P. Husbands, X. S. Li, L. Oliker, K. A. Yelick, and R. Biswas. Memory-intensive benchmarks: IRAM vs. cache-based machines. In Proceedings of the 16th International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2002.
[22]
T. Gartner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, volume 2777, pages 129--143. 2003.
[23]
A. Gutierrez, J. Pusdesris, R. Dreslinski, T. Mudge, C. Sudanthi, C. Emmons, and N. Paver. Sources of error in full-system simulation. In International Symposium on Performance Analysis of Systems and Software, 2014.
[24]
D. Heroux, M.A.and Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich. Improving performance via mini-applications. Technical report, SAND2009--5574, 2009.
[25]
D. Jevdjic, S. Volos, and B. Falsafi. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. In 40th International Symposium on Computer Architecture, 2013.
[26]
Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas. FlexRAM: toward an advanced intelligent memory system. In International Conference on Computer Design, 1999.
[27]
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7--17, 2011.
[28]
T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, and K. Flautner. PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006.
[29]
B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase change memory as a scalable DRAM alternative. SIGARCH Computer Architure News, 37(3):2--13, 2009.
[30]
C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. Keller. Energy management for commercial servers. Computer, 36(12):39--48, 2003.
[31]
G. Loh. 3D-stacked memory architectures for multi-core processors. In 35th International Symposium on Computer Architecture, 2008.
[32]
G. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and K. Banerjee. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In 43rd Design Automation Conference, 2006.
[33]
K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. SIGARCH Computer Architure News, 40(3):37--48, 2012.
[34]
M. Oskin, F. Chong, and T. Sherwood. Active pages: a computation model for intelligent memory. In 25th Annual International Symposium on Computer Architecture, 1998.
[35]
Y. Y. Pan and T. Zhang. Improving VLIW processor performance using three-dimensional (3d) DRAM stacking. In 20th International Conference on Application-specific Systems, Architectures and Processors, 2009.
[36]
D. Patterson. Why latency lags bandwidth, and what it means to computing. Keynote Address, Workshop on High Performance Embedded Computing, 2004.
[37]
J. T. Pawlowski. Hybrid memory cube (HMC). In Hot Chips 23, 2011.
[38]
S. Pllana, I. Brandic, and S. Benkner. Performance modeling and prediction of parallel and distributed computing systems: A survey of the state of the art. In 1st International Conference on Complex, Intelligent and Software Intensive Systems, 2007.
[39]
S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. Ndc: Analyzing the impact of 3d-stacked memory
[40]
logic devices on mapreduce workloads. In International Symposium on Performance Analysis of Systems and Software, 2014.
[41]
B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. W. Jiang, and Y. Solihin. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. 36th International Symposium on Computer Architecture, 2009.
[42]
R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. Wenisch. Sonic millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. In 19th International Symposium on High Performance Computer Architecture, 2013.
[43]
J. Torrellas. FlexRAM: Toward an advanced intelligent memory system: A retrospective paper. In 30th International Conference on Computer Design, 2012.
[44]
A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. P. Jouppi. Rethinking DRAM design and organization for energy-constrained multi-cores. In 37th International Symposium on Computer Architecture, 2010.
[45]
T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In 43rd International Symposium on Microarchitecture, 2010.
[46]
V. M. Weaver and S. A. McKee. Are cycle accurate simulations a waste of time? In The Annual Workshop on Duplicating, Deconstructing, and Debunking, 2008.
[47]
A. White. Exascale challenges: Applications, technologies, and co-design. In From Petascale to Exascale: R&D Challenges for HPC Simulation Environments ASC Exascale Workshop, March 2011.
[48]
D. H. Woo, N. H. Seong, D. Lewis, and H.-H. Lee. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In IEEE 16th International Symposium on High Performance Computer Architecture, 2010.
[49]
M. Yourst. PTLsim: A cycle accurate full system x86--64 microarchitectural simulator. In Performance Analysis of Systems Software International Symposium on, pages 23--34, 2007.

Cited By

View all

Index Terms

  1. TOP-PIM: throughput-oriented programmable processing in memory

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing
    June 2014
    334 pages
    ISBN:9781450327497
    DOI:10.1145/2600212
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. energy efficiency
    2. gpgpu
    3. performance modeling and analysis
    4. processing in memory

    Qualifiers

    • Research-article

    Conference

    HPDC'14
    Sponsor:

    Acceptance Rates

    HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;
    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)205
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 27 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)PIM-Potential: Broadening the Acceleration Reach of PIM ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695795(1-12)Online publication date: 30-Sep-2024
    • (2024)vPIM: Processing-in-Memory VirtualizationProceedings of the 25th International Middleware Conference10.1145/3652892.3700782(417-430)Online publication date: 2-Dec-2024
    • (2024)SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-ExplorationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651352(950-965)Online publication date: 27-Apr-2024
    • (2024)PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-OptimizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640376(879-896)Online publication date: 27-Apr-2024
    • (2024)AutoGMap: Learning to Map Large-Scale Sparse Graphs on Memristive CrossbarsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.326538335:9(12888-12898)Online publication date: Sep-2024
    • (2024)Designing Efficient Bit-Level Sparsity-Tolerant Memristive NetworksIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.325043735:9(11979-11988)Online publication date: Sep-2024
    • (2024)Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00105(1429-1443)Online publication date: 2-Nov-2024
    • (2024)PointCIM: A Computing-in-Memory Architecture for Accelerating Deep Point Cloud Analytics2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00097(1309-1322)Online publication date: 2-Nov-2024
    • (2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
    • (2024)Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00051(594-611)Online publication date: 2-Nov-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media