research-article

DaDianNao: A Neural Network Supercomputer

Authors:

Yunji ChenAuthors Info & Claims

IEEE Transactions on Computers, Volume 66, Issue 1

Pages 73 - 88

https://doi.org/10.1109/TC.2016.2574353

Published: 01 January 2017 Publication History

Abstract

Many companies are deploying services largely based on machine-learning algorithms for sophisticated processing of large amounts of data, either for consumers or industry. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines, and evaluate performance by integrating electrical and optical inter-chip interconnects separately. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of $656.63 \times$ over a GPU, and reduce the energy by $184.05 \times$ on average for a 64-chip system. We implement the node down to the place and route at 28 nm, containing a combination of custom storage and computational units, with electrical inter-chip interconnects.

References

[1]

Cacti 5.3. [Online]. Available: http://quid.hpl.hp.com:9081/cacti/

[2]

B. Belhadj, A. Joubert, Z. Li, R. Heliot, and O. Temam, “ Continuous real-world inputs can open up alternative accelerator designs,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 1–12.

Digital Library

[3]

C. Bienia, S. Kumar, J. P. Singh, and K. Li, “ The PARSEC benchmark suite: Characterization and architectural implications,” in Proc. 17th Int. Conf. Parallel Archit. Compilation Techn., 2008, pp. 72–81.

Digital Library

[4]

D. Chen, “ The IBM Blue Gene/Q interconnection fabric,” in IEEE Micro, vol. Volume 32, no. Issue 1, pp. 32–43, 2012.

Digital Library

[5]

T. Chen, “ DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. 19th Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2014, pp. 269–284.

Digital Library

[6]

D. Cirean, U. Meier, and J. Schmidhuber, “ Multi-column deep neural networks for image classification,” in Proc. Int. Conf. Pattern Recognition, 2012, pp. 3642–3649.

[7]

D. Ciresan, U. Meier, and J. Masci, “ Flexible, high performance convolutional neural networks for image classification,” in Proc. Int. Joint Conf. Artif. Intell., 2011, pp. 1237–1242.

[8]

A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng, “ Deep learning with COTS HPC systems,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 1337–1345.

[9]

G. Dahl, T. Sainath, and G. Hinton, “ Improving deep neural networks for lvcsr using rectified linear units and dropout,” in IEEE Int. Conf. Acoust. Speech Signal Process ., 2013, pp. 8609–8613.

[10]

W. Dally and B. Towles, Principles and Practices of Interconnection Networks . San Mateo, CA, USA: Morgan Kaufmann2003.

[11]

J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, “ Large scale distributed deep networks,”” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2012.

[12]

M. M. Deneroff, D. E. Shaw, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, and C. Young, “ A specialized ASIC for molecular dynamics,” in Proc. Hot Chips, 2008.

[13]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ ImageNet: A large-scale hierarchical image database,” in IEEE Conf. Comput. Vision Pattern Recog., 2009, pp. 248–255.

[14]

P. Dubey, “ Recognition, mining and synthesis moves computers to the era of tera,” Technology@Intel Magazine, vol. Volume 9, no. Issue 2, pp. 1–10, 2005.

[15]

H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “ Dark silicon and the end of multicore scaling,” in Proc. 38th Int. Symp. Comput. Archit., 2011, pp. 365–376.

Digital Library

[16]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “ Neural acceleration for general-purposed approximate programs,” in Proc. Int. Symp. Microarchitecture, no. Issue 3, pp. 1–6, 2012.

[17]

K. Fan, M. Kudlur, G. S. Dasika, and S. A. Mahlke, “ Bridging the computation gap between programmable processors and hardwired accelerators,” in Proc. IEEE 15th Int. Symp. High Perform. Comput. Archit., 2009, pp. 313–322.

[18]

C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “ NeuFlow: A runtime reconfigurable dataflow processor for vision,” in Proc. CVPR Workshop, pp. 109–116, 2011.

[19]

D. A. Ferrucci, “ Introduction to this is Watson,” IBM J. Res. Develop., vol. Volume 56, pp. 1:1–1:15, 2012.

Digital Library

[20]

R. Hadsell, “ Learning long-range vision for autonomous off-road driving,” J. Field Robotics, vol. Volume 26, pp. 120–144, 2009.

Digital Library

[21]

R. Hameed, “ Understanding sources of inefficiency in general-purpose chips,” in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 37–47.

Digital Library

[22]

A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “ Automatic abstraction and fault tolerance in cortical microarchitectures,” in Proc. 38th Annu. Int. Symp. Comput. Archit., 2011, pp. 1–10.

[23]

A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “ A case for neuromorphic ISAs,” in Proc. 16th Int. Conf. Archit. Support Program. Languages Oper. Syst., 2011, pp. 145–158.

Digital Library

[24]

S.-N. Hong and G. Caire, “ Compute-and-forward strategies for cooperative distributed antenna systems,” in IEEE Trans. Inf. Theory, vol. Volume 59, no. Issue 9, pp. 5227–5243, 2013.

[25]

K. Huang, “ A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate,” in Proc. IEEE Int. Electron Devices Meeting, 2011, pp. 24.7.1–24.7.4.

[26]

L. Huang, S. Ma, L. Shen, Z. Wang, and N. Xiao, “ Low-cost binary128 floating-point FMA unit design with SIMD support,” IEEE Trans. Comput., vol. Volume 61, no. Issue 5, pp. 745–751, 2012.

[27]

P. Huang, X. He, J. Gao and L. Deng, “ Learning deep structured semantic models for web search using clickthrough data,” in Proc. Int. Conf. Inf. Knowl. Manag., 2013.

Digital Library

[28]

K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “ What is the best multi-stage architecture for object recognition?” in Proc. 12th IEEE Int. Conf. Comput. Vis., 2009, pp. 2146–2153.

[29]

A. Kahng, B. Li, L.-S. Peh, and K. Samadi, “ Orion 2.0: A power-area simulator for interconnection networks,” IEEE Trans. Very Large Scale Integration Syst., vol. Volume 20, no. Issue 1, pp. 191–196, 2012.

Digital Library

[30]

M. M. Khan, “ SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2008, pp. 2849–2856.

[31]

A. Krizhevsky. [Online]. Available: https://code.google.com/p/cuda-convnet/

[32]

A. Krizhevsky, I. Sutskever, and G. Hinton, “ Imagenet classification with deep convolutional neural networks,” in Proc. Advances Neural Inf. Process. Syst., 2012, pp. 1–9.

[33]

H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “ An empirical evaluation of deep architectures on problems with many factors of variation,” in Int. Conf. Mach. Learn., 2007, pp. 473–480.

Digital Library

[34]

Q. V. Le, “ Building high-level features using large scale unsupervised learning,” in Int. Conf. Mach. Learn., 2012, pp. 81–88.

[35]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “ Gradient-based learning applied to document recognition,” Proc. IEEE, vol. Volume 86, no. Issue 11, pp. 2278–2324, 1998.

[36]

N. Maeda, S. Komatsu, M. Morimoto, and Y. Shimazaki, “ A 0.41 A standby leakage 32 Kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm HKMG CMOS,” in Proc. Int. Symp. VLSI Circuits, 2012.

[37]

A. Majumdar, S. Cadambi, M. Becchi, S. T. Chakradhar, and H. P. Graf, “ A massively parallel, energy efficient programmable accelerator for learning and classification,” ACM Trans. Archit. Code Optimization, vol. Volume 9, no. Issue 1, pp. 1–30, 2012.

Digital Library

[38]

R. E. Matick and S. E. Schuster, “ Logic-based eDRAM: Origins and rationale for use,” IBM J. Res. Develop., vol. Volume 49, no. Issue 1, pp. 145–165, 2005.

Digital Library

[39]

P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. Modha, “ A digital neurosynaptic core using embedded crossbar memory with 45 pJ per spike in 45 nm,” in Proc. IEEE Custom Integr. Circuits Conf., 2011, pp. 1–4.

[40]

Micron. Ddr3 sdram rdimm datasheet. [Online]. Available: http://www.micron.com/~/media/documents/products/data%20sheet/modules/parity_rdimm/jsf18c1 gx72pdz.pdf

[41]

V. Mnih and G. Hinton, “ Learning to label aerial images from noisy data,” in Proc. 29th Int. Conf. Mach. Learn., 2012, pp. 567–574.

[42]

N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P.-C. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. B. Taylor, “ The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future,” IEEE Micro, vol. 31, no. 2, pp. 86–95, 2011.

Digital Library

[43]

, “ Tesla K20X GPU Accelerator Board Specification,” NVIDIA, Santa Clara, CA, USA, Tech. Rep., 2012.

[44]

W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, “ Convolution engine: Balancing efficiency & flexibility in specialized computing,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 24–35.

Digital Library

[45]

R. Salakhutdinov and G. Hinton, “ An efficient learning procedure for deep Boltzmann machines,” Neural Comput., vol. Volume 24, no. Issue 8. pp. 1967–2006, 2012.

Digital Library

[46]

J. Schemmel, J. Fieres, and K. Meier, “ Wafer-scale integration of analog neural networks,” in Proc. Int. Joint Conf. Neural Netw., pp. 431–438, 2008.

[47]

O. Temam, “ A defect-tolerant accelerator for emerging high-performance applications,” in Proc. Int. Symp. Comput. Archit., 2012, pp. 356–367.

[48]

V. Vanhoucke, A. Senior, and M. Z. Mao, “ Improving the speed of neural networks on CPUs,” in Deep Learn. Unsupervised Feature Learn. Workshop, 2011.

[49]

G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, “ QsCORES : Trading dark silicon for scalable energy efficiency with quasi-specific cores categories and subject descriptors,” in Proc. Int. Symp. Microarchitecture, 2011.

[50]

G. Wang, “ Scaling deep trench based eDRAM on SOI to 32nm and beyond,” in Proc. IEEE Int. Electron Devices Meet., 2009, pp. 1–4.

[51]

Y. Chen, M. Kibune, A. Toda, A. Hayakawa, T. Akiyama, S. Sekiguchi, “ A 25 Gb/s hybrid integrated silicon photonic transceiver in 28 nm CMOS and SOI,” in IEEE Int. Solid-State Circuits Conf., 2015.

[52]

Y. Chen, “ Dadiannao: A machine-learning supercomputer,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2014, pp. 609–622.

Digital Library

[53]

M. Cignoli, “ A 1310 nm 3D-integrated silicon photonics Mach-Zehnder-based transmitter with 275 mW multistage CMOS driver achieving 6 dB extinction ratio at 25 Gb/s,” in Proc. IEEE Int. Solid-State Circuits Conf., 2015, pp. 1–3.

[54]

N. Ophir, C. Mineo, D. Mountain, and K. Bergman, “ Silicon photonic microring links for high-bandwidth-density, low-power chip I/O,” IEEE Micro, vol. Volume 33, no. Issue 1. pp. 54–67, 2013.

Digital Library

[55]

S. Rumley, D. Nikolova, R. Hendry, Q. Li, D. Calhoun, and K. Bergman, “ Silicon photonics for exascale systems,” J. Lightw. Technol., vol. Volume 33, no. 3. pp. 547–562, 2015.

[56]

M. Rakowski, “ A 420Gb/s WDM ring-based hybrid CMOS silicon photonics transceiver,” in Proc. IEEE Int. Solid-State Circuits Conf., 2015 pp. 408–409.

[57]

Y. A. Vlasov, “ Silicon CMOS-integrated nano-photonics for computer and data communications beyond 100G,” IEEE Commun. Mag., vol. Volume 50, no. Issue 2, pp. s67–s72, 2012.

[58]

I. A. Young, “ Optical I/O technology for tera-scale computing,” IEEE J. Solid-State Circuits, vol. Volume 45, no. Issue 1. pp. 235–248, 2010.

[59]

F. Akopyan, “ TrueNorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE Trans. Comput.-Aided Des. of Integr. Circuits Syst., vol. Volume 34, no. Issue 10, pp. 1537–1557, 2015.

[60]

D. E. Shaw, “ Anton 2: Raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer,” in Proc. Int. Conf. High Perform. Comput., Netw. Storage Anal., 2014, pp. 41–53.

Digital Library

[61]

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “ Deep learning with limited numerical precision,” in Proc. 32nd Int. Conf. on Mach. Learning, 2015, pp. 1–10.

[62]

C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-L. Lu, “ Reducing cache power with low-cost, multi-bit error-correcting codes,” in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 83–93.

Digital Library

[63]

K. Padmaraju, J. Chan, L. Chen, M. Lipson, and K. Bergman, “ Thermal stabilization of a microring modulator using feedback control,” Optics Express, vol. Volume 20, no. Issue 27. pp. 27999–28008, 2012.

[64]

K. Padmaraju, D. F. Logan, X.-L. Zhu, J. J. Ackert, A. P. Knights, and K. Bergman, “ Integrated thermal stabilization of a microring modulator,” in Proc. Opt. Fiber Commun. Conf. Expo. Nat. Fiber Optic Engineers Conf., 2013, pp. 14342–14350.

[65]

W. A. Zortman, D. Trotter, and M. R. Watts, “ Silicon photonics manufacturing,” Opt. Exp., vol. Volume 18, no. Issue 23, pp. 23598–23607, 2010.

[66]

A. V. Krishnamoorthy, “ Exploiting CMOS manufacturing to reduce tuning requirements for resonant optical devices,” IEEE Photon. J., vol. Volume 3 no. Issue 3, pp. 567–579, 2011.

Cited By

Zouzoula SMaleki MAzhar MTrancoso P(2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673115
Liu HGalindo MXie HWong LShuai HLi YCheng W(2024)Lightweight Deep Learning for Resource-Constrained Environments: A SurveyACM Computing Surveys10.1145/365728256:10(1-42)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3657282
Luo YLu ALuo YChang SAvci UYu S(2024)Endurance-Aware Compiler for 3-D Stackable FeRAM as Global Buffer in TPU-Like ArchitectureIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.341263132:9(1696-1703)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1109/TVLSI.2024.3412631
Show More Cited By

DaDianNao: A Neural Network Supercomputer
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

DaDianNao: A Machine-Learning Supercomputer
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture

Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms ...
PPMB: A Partial-Multiple-Bus Multiprocessor Architecture with Improved Cost-Effectiveness

The authors address the design and performance analysis of partial-multiple-bus interconnection networks. They are bus architectures that have evolved from the multiple-bus structure by dividing buses into groups and reducing bus connections. Their ...
A Torus-Based Hierarchical Optical-Electronic Network-on-Chip for Multiprocessor System-on-Chip

Networks-on-chip (NoCs) are emerging as a key on-chip communication architecture for multiprocessor systems-on-chip (MPSoCs). Optical communication technologies are introduced to NoCs in order to empower ultra-high bandwidth with low power consumption. ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 66, Issue 1

January 2017

177 pages

ISSN:0018-9340

Issue’s Table of Contents

Copyright © 2017.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 January 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zouzoula SMaleki MAzhar MTrancoso P(2024)Scratchpad Memory Management for Deep Learning AcceleratorsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673115(629-639)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673115
Liu HGalindo MXie HWong LShuai HLi YCheng W(2024)Lightweight Deep Learning for Resource-Constrained Environments: A SurveyACM Computing Surveys10.1145/365728256:10(1-42)Online publication date: 24-Jun-2024
https://dl.acm.org/doi/10.1145/3657282
Luo YLu ALuo YChang SAvci UYu S(2024)Endurance-Aware Compiler for 3-D Stackable FeRAM as Global Buffer in TPU-Like ArchitectureIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.341263132:9(1696-1703)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1109/TVLSI.2024.3412631
Zhang BKannan RBusart CPrasanna V(2024)VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision TasksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346689135:12(2405-2422)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3466891
Zhang BGu HZhang GYang YMa ZSchlichtmann U(2024)A 3D Hybrid Optical-Electrical NoC Using Novel Mapping Strategy Based DCNN Dataflow AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339474735:7(1139-1154)Online publication date: 3-May-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3394747
Pinto DArnau JRiera MCruz JGonzález A(2024)Exploiting beam search confidence for energy-efficient speech recognitionThe Journal of Supercomputing10.1007/s11227-024-06351-y80:17(24908-24937)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s11227-024-06351-y
De Moura RCarro L(2023)Reprogrammable Non-Linear Circuits Using ReRAM for NN AcceleratorsACM Transactions on Reconfigurable Technology and Systems10.1145/361789417:1(1-19)Online publication date: 10-Oct-2023
https://dl.acm.org/doi/10.1145/3617894
Li SYang HWong CSorger VGupta P(2023)ReFOCUS: Reusing Light for Efficient Fourier Optics-Based Photonic Neural Network AcceleratorProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623798(569-583)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623798
Okafor IRamanathan AChallapalle NLi ZNarayanan V(2023)Fusing In-storage and Near-storage Acceleration of Convolutional Neural NetworksACM Journal on Emerging Technologies in Computing Systems10.1145/359749620:1(1-22)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3597496
Muñoz-Martínez FAbellán JAcacio MKrishna T(2023)STIFT: A Spatio-Temporal Integrated Folding Tree for Efficient Reductions in Flexible DNN AcceleratorsACM Journal on Emerging Technologies in Computing Systems10.1145/353101119:4(1-20)Online publication date: 8-Sep-2023
https://dl.acm.org/doi/10.1145/3531011
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents