research-article

TOP-PIM: throughput-oriented programmable processing in memory

Authors:

Dongping Zhang,

Nuwan Jayasena,

Alexander Lyashevsky,

Joseph L. Greathouse,

Michael IgnatowskiAuthors Info & Claims

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

Pages 85 - 98

https://doi.org/10.1145/2600212.2600213

Published: 23 June 2014 Publication History

Abstract

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to reduce both energy and data movement overheads. We explore the use of 3D die stacking to move memory-intensive computations closer to memory. This approach to processing in memory addresses some drawbacks of prior research on in-memory computing and is commercially viable in the foreseeable future.

Because 3D stacking provides increased bandwidth, we study throughput-oriented computing using programmable GPU compute units across a broad range of benchmarks, including graph and HPC applications. We also introduce a methodology for rapid design space exploration by analytically predicting performance and energy of in-memory processors based on metrics obtained from execution on today's GPU hardware. Our results show that, on average, viable PIM configurations show moderate performance losses (27%) in return for significant energy efficiency improvements (76\% reduction in EDP) relative to a representative mainstream GPU at 22nm technology. At 16nm technology, on average, viable PIM configurations are performance competitive with a representative mainstream GPU (7% speedup) and provide even greater energy efficiency improvements (85\% reduction in EDP).

References

[1]

www.jedec.org/standards-documents/docs/jesd229.

[2]

www.jedec.org/standards-documents/docs/jesd235.

[3]

www.micron.com/products/hybrid-memory-cube.

[4]

ITRS interconnect working group, 2012 update. www.itrs.net/links/2012Summer/Interconnect.pptx.

[5]

Elpida begins sample shipments of ddr3 sdram (x32) based on tsv stacking technology. www.elpida.com/en/news/2011/06--27.html, 2011.

[6]

Initial hybrid memory cube short-reach interconnect specification issued to consortium adopters. Denali Memory Report, August 2012.

[7]

International Technology Roadmap for Semiconductors, 2011 Edition. 2012 update.

[8]

AMD. White paper: AMD graphics cores next (GCN) architecture. Jun 2012.

[9]

P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. Exascale workload characterization and architecture implications. In Proceedings of the High Performance Computing Symposium, 2013.

Digital Library

[10]

R. Balasubramonian. Exploiting 3D-stacked memory devices. IBM Research seminar, October 2012.

[11]

B. Black. Die stacking is happening! In 46th IEEE/ACM International Symposium on Microarchitecture Keynote, 2013.

[12]

B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. Loh, D. McCauley, P. Morrow, D. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking 3D microarchitecture. In International Symposium on Microarchitecture, 2006.

Digital Library

[13]

K. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In 5th International Conference on Data Mining, 2005.

Digital Library

[14]

S. Borkar. The exascale challenge. Keynote Presentation, 2010 Asia Academic Forum, Nov 2010.

[15]

S. Borkar. Exascale computing -- a fact or a fiction? Keynote Speech, 27th International Parallel & Distributed Processing Symposium, 2013.

Digital Library

[16]

D. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. Kim, and M. Schulte. Reevaluating the latency claims of 3D stacked memories. In 18th Asia and South Pacific Design Automation Conference, 2013.

[17]

S. Che, B. M. Beckmann, S. K. Reinhard, and K. Skadron. Pannotia: Understanding irregular GPGPU graph applications. In International Symposium on Workload Characterization, 2013.

[18]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization, 2009.

Digital Library

[19]

P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes. An efficient and scalable semiconductor architecture for parallel automata processing. To appear in IEEE Transactions on Parallel and Distributed Systems.

[20]

J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, and G. Daglikoca. The architecture of the DIVA processing-in-memory chip. In 16th International Conference on Supercomputing, 2002.

Digital Library

[21]

B. R. Gaeke, P. Husbands, X. S. Li, L. Oliker, K. A. Yelick, and R. Biswas. Memory-intensive benchmarks: IRAM vs. cache-based machines. In Proceedings of the 16th International Parallel and Distributed Processing Symposium. IEEE Computer Society, 2002.

Digital Library

[22]

T. Gartner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, volume 2777, pages 129--143. 2003.

[23]

A. Gutierrez, J. Pusdesris, R. Dreslinski, T. Mudge, C. Sudanthi, C. Emmons, and N. Paver. Sources of error in full-system simulation. In International Symposium on Performance Analysis of Systems and Software, 2014.

[24]

D. Heroux, M.A.and Doerfler, P. Crozier, J. Willenbring, H. Edwards, A. Williams, M. Rajan, E. Keiter, H. Thornquist, and R. Numrich. Improving performance via mini-applications. Technical report, SAND2009--5574, 2009.

[25]

D. Jevdjic, S. Volos, and B. Falsafi. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. In 40th International Symposium on Computer Architecture, 2013.

Digital Library

[26]

Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas. FlexRAM: toward an advanced intelligent memory system. In International Conference on Computer Design, 1999.

Digital Library

[27]

S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5):7--17, 2011.

Digital Library

[28]

T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt, and K. Flautner. PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In 12th International Conference on Architectural Support for Programming Languages and Operating Systems, 2006.

Digital Library

[29]

B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase change memory as a scalable DRAM alternative. SIGARCH Computer Architure News, 37(3):2--13, 2009.

Digital Library

[30]

C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. Keller. Energy management for commercial servers. Computer, 36(12):39--48, 2003.

Digital Library

[31]

G. Loh. 3D-stacked memory architectures for multi-core processors. In 35th International Symposium on Computer Architecture, 2008.

Digital Library

[32]

G. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, and K. Banerjee. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In 43rd Design Automation Conference, 2006.

Digital Library

[33]

K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, and M. Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. SIGARCH Computer Architure News, 40(3):37--48, 2012.

Digital Library

[34]

M. Oskin, F. Chong, and T. Sherwood. Active pages: a computation model for intelligent memory. In 25th Annual International Symposium on Computer Architecture, 1998.

Digital Library

[35]

Y. Y. Pan and T. Zhang. Improving VLIW processor performance using three-dimensional (3d) DRAM stacking. In 20th International Conference on Application-specific Systems, Architectures and Processors, 2009.

Digital Library

[36]

D. Patterson. Why latency lags bandwidth, and what it means to computing. Keynote Address, Workshop on High Performance Embedded Computing, 2004.

[37]

J. T. Pawlowski. Hybrid memory cube (HMC). In Hot Chips 23, 2011.

[38]

S. Pllana, I. Brandic, and S. Benkner. Performance modeling and prediction of parallel and distributed computing systems: A survey of the state of the art. In 1st International Conference on Complex, Intelligent and Software Intensive Systems, 2007.

Digital Library

[39]

S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. Ndc: Analyzing the impact of 3d-stacked memory

[40]

logic devices on mapreduce workloads. In International Symposium on Performance Analysis of Systems and Software, 2014.

[41]

B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. W. Jiang, and Y. Solihin. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. 36th International Symposium on Computer Architecture, 2009.

Digital Library

[42]

R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. Wenisch. Sonic millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound. In 19th International Symposium on High Performance Computer Architecture, 2013.

Digital Library

[43]

J. Torrellas. FlexRAM: Toward an advanced intelligent memory system: A retrospective paper. In 30th International Conference on Computer Design, 2012.

Digital Library

[44]

A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. P. Jouppi. Rethinking DRAM design and organization for energy-constrained multi-cores. In 37th International Symposium on Computer Architecture, 2010.

Digital Library

[45]

T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In 43rd International Symposium on Microarchitecture, 2010.

Digital Library

[46]

V. M. Weaver and S. A. McKee. Are cycle accurate simulations a waste of time? In The Annual Workshop on Duplicating, Deconstructing, and Debunking, 2008.

[47]

A. White. Exascale challenges: Applications, technologies, and co-design. In From Petascale to Exascale: R&D Challenges for HPC Simulation Environments ASC Exascale Workshop, March 2011.

[48]

D. H. Woo, N. H. Seong, D. Lewis, and H.-H. Lee. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In IEEE 16th International Symposium on High Performance Computer Architecture, 2010.

[49]

M. Yourst. PTLsim: A cycle accurate full system x86--64 microarchitectural simulator. In Performance Analysis of Systems Software International Symposium on, pages 23--34, 2007.

Cited By

Alsop JAga SIbrahim MIslam MJayasena NMcCrabb A(2024)PIM-Potential: Broadening the Acceleration Reach of PIM ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695795(1-12)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695795
Teguia DChen JBitchebe SBalmau OTchana ASchiavoni VEdinger JCao JJin Z(2024)vPIM: Processing-in-Memory VirtualizationProceedings of the 25th International Middleware Conference10.1145/3652892.3700782(417-430)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700782
Li CZhou ZZheng SZhang JLiang YSun GTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-ExplorationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651352(950-965)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651352
Show More Cited By

Index Terms

TOP-PIM: throughput-oriented programmable processing in memory
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

Energy Efficiency Analysis of GPUs
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

In the last few years, Graphics Processing Units (GPUs) have become a great tool for massively parallel computing. GPUs are specifically designed for throughput and face several design challenges, specially what is known as the Power and Memory Walls. ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Exploring Processing In-Memory for Different Technologies
GLSVLSI '19: Proceedings of the 2019 Great Lakes Symposium on VLSI

The recent emergence of IoT has led to a substantial increase in the amount of data processed. Today, a large number of applications are data intensive, involving massive data transfers between processing core and memory. These transfers act as a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

June 2014

334 pages

ISBN:9781450327497

DOI:10.1145/2600212

General Chairs:
Beth Plale
Indiana University, USA
,
Matei Ripeanu
University of British Columbia, CA
,
Program Chairs:
Franck Cappello
Argonne National Lab and INRIA, USA
,
Dongyan Xu
Purdue University, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'14

Sponsor:

SIGARCH

HPDC'14: The 23rd International Symposium on High-Performance Parallel and Distributed Computing

June 23 - 27, 2014

BC, Vancouver, Canada

Acceptance Rates

HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

267
Total Citations
View Citations
2,222
Total Downloads

Downloads (Last 12 months)205
Downloads (Last 6 weeks)24

Reflects downloads up to 27 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alsop JAga SIbrahim MIslam MJayasena NMcCrabb A(2024)PIM-Potential: Broadening the Acceleration Reach of PIM ArchitecturesProceedings of the International Symposium on Memory Systems10.1145/3695794.3695795(1-12)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695795
Teguia DChen JBitchebe SBalmau OTchana ASchiavoni VEdinger JCao JJin Z(2024)vPIM: Processing-in-Memory VirtualizationProceedings of the 25th International Middleware Conference10.1145/3652892.3700782(417-430)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3652892.3700782
Li CZhou ZZheng SZhang JLiang YSun GTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-ExplorationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651352(950-965)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651352
Li CZhou ZWang YYang FCao TYang MLiang YSun GTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-OptimizationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640376(879-896)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640376
Lyu BWang SWen SShi KYang YZeng LHuang T(2024)AutoGMap: Learning to Map Large-Scale Sparse Graphs on Memristive CrossbarsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.326538335:9(12888-12898)Online publication date: Sep-2024
https://doi.org/10.1109/TNNLS.2023.3265383
Lyu BWen SYang YChang XSun JChen YHuang T(2024)Designing Efficient Bit-Level Sparsity-Tolerant Memristive NetworksIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.325043735:9(11979-11988)Online publication date: Sep-2024
https://doi.org/10.1109/TNNLS.2023.3250437
Yun SKyung KCho JChoi JKim JKim BLee SSohn KAhn J(2024)Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00105(1429-1443)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00105
Chen XChen HYang C(2024)PointCIM: A Computing-in-Memory Architecture for Accelerating Deep Point Cloud Analytics2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00097(1309-1322)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00097
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Ham HHong JPark GShin YWoo OYang WBae JPark ESung HLim EKim G(2024)Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00051(594-611)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00051
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents