skip to main content
10.1145/3037697.3037749acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Moonwalk: NRE Optimization in ASIC Clouds

Published: 04 April 2017 Publication History

Abstract

Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds are a promising way to optimize the Total Cost of Ownership (TCO) of a given datacenter computation (e.g. YouTube transcoding) by reducing both energy consumption and marginal computation cost.
The feasibility of an ASIC Cloud for a particular application is directly gated by the ability to manage the Non-Recurring Engineering (NRE) costs of designing and fabricating the ASIC, so that it is significantly lower (e.g. 2X) than the TCO of the best available alternative.
In this paper, we show that technology node selection is a major tool for managing ASIC Cloud NRE, and allows the designer to trade off an accelerator's excess energy efficiency and cost performance for lower total cost.
We explore NRE and cross-technology optimization of ASIC Clouds for four different applications: Bitcoin mining, YouTube-style video transcoding, Litecoin, and Deep Learning. We address these challenges and show large reductions in the NRE, potentially enabling ASIC Clouds to address a wider variety of datacenter workloads. Our results suggest that advanced nodes like 16nm will lead to sub-optimal TCO for many workloads, and that use of older nodes like 65nm can enable a greater diversity of ASIC Clouds.

References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: a system for large-scale machine learning.In OSDI, 2016.
[2]
M. Abdelfattah, A. Hagiescu, and D. Singh.Gzip on a chip: High performance lossless data compression on FPGAs using opencl.In International Workshop on OpenCL (IWOC, 2014.
[3]
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi.A Scalable Processing-in-memory Accelerator for Parallel Graph Processing.In ISCA, 2015.
[4]
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Jerger, and A. Moshovos.Cnvlutin: ineffectual-neuron-free deep neural network computing.In ISCA, 2016.
[5]
K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, and J. Koenig.The Rocket Chip Generator.Technical Report No. UCB/EECS-2016--17, 2016.
[6]
J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic.Chisel: Constructing hardware in a Scala embedded language.In DAC, 2012.
[7]
J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang, M. Matl, and D. Wentzlaff.OpenPiton: An Open Source Manycore Research Framework.In ASPLOS, 2016.
[8]
L. Barroso, J. Clidaras, and U. Holzle.\ The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. SynthesisLectures on Computer Architecture, 2013.
[9]
J. Beetem, M. Denneau, and D. Weingarten.The GF11 Supercomputer.In ISCA, 1985.
[10]
M. Bojnordi, and E. Ipek.Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning.In HPCA, 2016.
[11]
I. Bolsens.2.5 D ICs: Just a Stepping Stone or a Long Term Alternative to 3D?. Keynote Talk at 3-D Architectures for Semiconductor Integration and Packaging Conference, 2011.
[12]
A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. BurgerA Cloud-Scale Acceleration Architecture.In MICRO, 2016.
[13]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam.DaDianNao: A Machine-Learning Supercomputer.In MICRO, 2014.
[14]
Y. Chen, J. Emer, and V. Sze.Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks.In ISCA, 2016.
[15]
Q. Chen, H. Yang, J. Mars, and L. Tang.Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers.In ASPLOS, 2016.
[16]
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie.PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory.In ISCA, 2016.
[17]
H. Esmaeilzadeh, E. Blem, R. Amant, K. Sankaralingam, and D. Burger.Dark Silicon and the End of Multicore Scaling.In ISCA, 2011.
[18]
V. Gangadhar, R. Balasubramanian, M. Drumond, Z. Guo, J. Menon, C. Joseph, R. Prakash, S. Prasad, P. Vallathol, and K. Sankaralingam.MIAOW: An open source GPGPU.In IEEE Hot Chips 27 Symposium, 2015.
[19]
Glassdoor.Glassdoor salaries, 2016.https://www.glassdoor.com
[20]
V. Gogte, A. Kolli, M. Cafarella, L. D'Antoni, and T. Wenisch.HARE: Hardware accelerator for regular expressions.In MICRO, 2016.
[21]
N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, J. Babb, M. Taylor, and S. Swanson.GreenDroid: A mobile application processor for a future of dark silicon.In IEEE Hot Chips 22 Symposium, 2010.
[22]
N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. Taylor.The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future.In IEEE MICRO, 2011.
[23]
B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang.Biscuit: a framework for near-data processing of big data workloads.In ISCA, 2016.
[24]
A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge.Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores.In ASPLOS, 2014.
[25]
T. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi.Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics.In MICRO, 2016.
[26]
R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz.Understanding sources of inefficiency in general-purpose chips.In ISCA, 2012.
[27]
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network.In ISCA, 2016.
[28]
J. Hauswald, M. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars.Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers.In ASPLOS, 2015.
[29]
Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. ChenNEUTRAMS: Neural Network Transformation and Co-design under Neuromorphic Hardware Constraints.In MICRO, 2016.
[30]
H. Jones.Strategies in Optimizing Market Positions for Semiconductor Vendors Based on IP Leverage.IBS White Paper, 2014.
[31]
C. Ju, T. Liu, K. Lee, Y. Chang, H. Chou, C. Wang, T. Wu, H. Lin, Y. Huang, C. Cheng, T. Lin, C. Chen, Y. Lin, M. Chiu, W. Li, S. Wang, Y. Lai, P. Chao, C. Chien, M. Hu, P. Wang, Y. Huang, S. Chuang, L. Chen, H. Lin, M. Wu, and C. Chen.A 0.5 nJ/Pixel 4 K H.265/HEVC Codec LSI for Multi-Format Smartphone Applications.In JSSC, 2016.
[32]
S. Jun, M. Liu, S. Lee, Hicks, Ankcorn, King, Myron, S. Xu, and Arvind.BlueDBM: An Appliance for Big Data Analytics.In ISCA, 2015.
[33]
A. Kannan, N. Jerger, and G. Loh.Enabling Interposer-based Disintegration of Multi-core Processors.In MICRO, 2015.
[34]
M. Kim, M. Mehrara, M. Oskin, and T. Austin.Architectural Implications of Brick and Mortar Silicon Manufacturing.In ISCA, 2007.
[35]
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay.Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.In ISCA, 2016.
[36]
O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan.Meet the Walkers: Accelerating Index Traversals for In-memory Databases.In MICRO, 2013.
[37]
K. Lim, D. Meisner, A. Saidi, P. Ranganathan, and T. Wenisch.Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached.In ISCA, 2013.
[38]
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen.Cambricon: An Instruction Set Architecture for Neural Networks.In ISCA, 2016.
[39]
I. Magaki, M. Khazraee, L. Vega, M. B. Taylor.ASIC Clouds: Specializing the Datacenter.In ISCA, 2016.
[40]
M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk.Energy efficient architecture for graph analytics accelerators.In ISCA, 2016.
[41]
A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger.A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services.In ISCA, 2014.
[42]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz.Convolution engine: balancing efficiency and flexibility in specialized computing.In ISCA, 2013.
[43]
B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Hernández-Lobato, G. Wei, and D. Brooks.Minerva: enabling low-power, highly-accurate deep neural network accelerators.In ISCA, 2016.
[44]
J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson and M. Taylor.Efficient Complex Operators for Irregular Codes.In HPCA, 2011.
[45]
R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. Wenisch.Sonic Millip3De: A Massively Parallel 3D-Stacked Accelerator for 3D Ultrasound.In HPCA, 2013.
[46]
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar.ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars.In ISCA, 2016.
[47]
Y. Shao, B. Reagen, G. Wei, and D. Brooks.Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.In ISCA, 2014.
[48]
D. Shaw, M. Deneroff, R. Dror, J. Kuskin, R. Larson, J. Salmon, C. Young, B. Batson, K. Bowers, J. Chao, M. Eastwood, J. Gagliardo, J. Grossman, C. Ho, D. Ierardi, I. Kolossváry, J. Klepeis, T. Layman, C. McLeavey, M. Moraes, R. Mueller, E. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, and S. Wang.Anton, a Special-purpose Machine for Molecular Dynamics Simulation.In ISCA, 2007.
[49]
A. Solomatnikov, A. Firoozshahian, W. Qadeer, O. Shacham, K. Kelley, Z. Asgar, M. Wachs, R. Hameed, and M. Horowitz.Chip Multi-processor Generator.In DAC, 2007.
[50]
A. Pedram, S. Richardson, S. Galal, S. Kvatinsky, and M. Horowitz.Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era.In IEEE Design Test, 2016.
[51]
P. Tandon, J. Chang, R. Dreslinski, V. Qazvinian, P. Ranganathan, and T. Wenisch.Hardware Acceleration for Similarity Measurement in Natural Language Processing.In ISLPED, 2013.
[52]
M. Taylor.A Landscape of the New Dark Silicon Design Regime.In IEEE Micro, 2013.
[53]
M. Taylor.Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse.In DAC, 2012.
[54]
M. Taylor.Bitcoin and the Age of Bespoke Silicon.In CASES, 2013.
[55]
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. Taylor.Conservation cores: reducing the energy of mature computationsIn ASPLOS, 2010.
[56]
G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. Kota Venkata, M. Taylor, and S. Swanson.QsCores: Configurable Co-processors to Trade Dark Silicon for Energy Efficiency in a Scalable Manner.In MICRO, 2011.
[57]
M. Wachs, O. Shacham, Z. Asgar, A. Firoozshahian, S. Richardson and M. Horowitz.Bringing up a chip on the cheap.\ IEEE Design Test of Computers, 2012.
[58]
J. Wong, F. Kourshanfar and M. Potkonjak.Flexible ASIC: shared masking for multiple media processors.In DAC, 2005.
[59]
K. Wu, and Y. Tsai.Structured ASIC, Evolution or Revolution?.In Proceedings of the International Symposium on Physical Design (ISPD), 2004.
[60]
L. Wu, A. Lottarini, T. Paine, M. Kim, and K. Ross.Q100: The Architecture and Design of a Database Processing Unit.In ASPLOS, 2014.
[61]
N. Xu, X. Cai, R. Gao, L. Zhang, and F. Hsu.FPGA Acceleration of RankBoost in Web Search Engines.In ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2009.
[62]
R. Yazdani, A. Segura, J. Arnau, and A. Gonzalez.An ultra low-power hardware accelerator for automatic speech recognition.In MICRO, 2016.
[63]
B. Zahiri.Structured ASICs: opportunities and challenges.In Proceedings of the 21st International Conference on Computer Design (ICCD), 2003.
[64]
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen.Cambricon-X: An accelerator for sparse neural networks.In MICRO, 2016.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
April 2017
856 pages
ISBN:9781450344654
DOI:10.1145/3037697
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. accelerator
  2. asic cloud
  3. datacenter
  4. nre
  5. tco

Qualifiers

  • Research-article

Funding Sources

  • Center for Future Architectures Research (C-FAR)
  • AMD Gift
  • NSF Award

Conference

ASPLOS '17

Acceptance Rates

ASPLOS '17 Paper Acceptance Rate 53 of 320 submissions, 17%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)8
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media