skip to main content
research-article

DaDianNao: A Neural Network Supercomputer

Published: 01 January 2017 Publication History

Abstract

Many companies are deploying services largely based on machine-learning algorithms for sophisticated processing of large amounts of data, either for consumers or industry. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines, and evaluate performance by integrating electrical and optical inter-chip interconnects separately. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of $656.63 \times$ over a GPU, and reduce the energy by $184.05 \times$ on average for a 64-chip system. We implement the node down to the place and route at 28 nm, containing a combination of custom storage and computational units, with electrical inter-chip interconnects.

References

[1]
Cacti 5.3. [Online]. Available: http://quid.hpl.hp.com:9081/cacti/
[2]
B. Belhadj, A. Joubert, Z. Li, R. Heliot, and O. Temam, “ Continuous real-world inputs can open up alternative accelerator designs,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 1–12.
[3]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, “ The PARSEC benchmark suite: Characterization and architectural implications,” in Proc. 17th Int. Conf. Parallel Archit. Compilation Techn., 2008, pp. 72–81.
[4]
D. Chen, “ The IBM Blue Gene/Q interconnection fabric,” in IEEE Micro, vol. Volume 32, no. Issue 1, pp. 32–43, 2012.
[5]
T. Chen, “ DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. 19th Int. Conf. Archit. Support Program. Lang. Oper. Syst., 2014, pp. 269–284.
[6]
D. Cirean, U. Meier, and J. Schmidhuber, “ Multi-column deep neural networks for image classification,” in Proc. Int. Conf. Pattern Recognition, 2012, pp. 3642–3649.
[7]
D. Ciresan, U. Meier, and J. Masci, “ Flexible, high performance convolutional neural networks for image classification,” in Proc. Int. Joint Conf. Artif. Intell., 2011, pp. 1237–1242.
[8]
A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng, “ Deep learning with COTS HPC systems,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 1337–1345.
[9]
G. Dahl, T. Sainath, and G. Hinton, “ Improving deep neural networks for lvcsr using rectified linear units and dropout,” in IEEE Int. Conf. Acoust. Speech Signal Process ., 2013, pp. 8609–8613.
[10]
W. Dally and B. Towles, Principles and Practices of Interconnection Networks . San Mateo, CA, USA: Morgan Kaufmann2003.
[11]
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, “ Large scale distributed deep networks,”” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2012.
[12]
M. M. Deneroff, D. E. Shaw, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, and C. Young, “ A specialized ASIC for molecular dynamics,” in Proc. Hot Chips, 2008.
[13]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ ImageNet: A large-scale hierarchical image database,” in IEEE Conf. Comput. Vision Pattern Recog., 2009, pp. 248–255.
[14]
P. Dubey, “ Recognition, mining and synthesis moves computers to the era of tera,” Technology@Intel Magazine, vol. Volume 9, no. Issue 2, pp. 1–10, 2005.
[15]
H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “ Dark silicon and the end of multicore scaling,” in Proc. 38th Int. Symp. Comput. Archit., 2011, pp. 365–376.
[16]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “ Neural acceleration for general-purposed approximate programs,” in Proc. Int. Symp. Microarchitecture, no. Issue 3, pp. 1–6, 2012.
[17]
K. Fan, M. Kudlur, G. S. Dasika, and S. A. Mahlke, “ Bridging the computation gap between programmable processors and hardwired accelerators,” in Proc. IEEE 15th Int. Symp. High Perform. Comput. Archit., 2009, pp. 313–322.
[18]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “ NeuFlow: A runtime reconfigurable dataflow processor for vision,” in Proc. CVPR Workshop, pp. 109–116, 2011.
[19]
D. A. Ferrucci, “ Introduction to this is Watson,” IBM J. Res. Develop., vol. Volume 56, pp. 1:1–1:15, 2012.
[20]
R. Hadsell, “ Learning long-range vision for autonomous off-road driving,” J. Field Robotics, vol. Volume 26, pp. 120–144, 2009.
[21]
R. Hameed, “ Understanding sources of inefficiency in general-purpose chips,” in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 37–47.
[22]
A. Hashmi, H. Berry, O. Temam, and M. Lipasti, “ Automatic abstraction and fault tolerance in cortical microarchitectures,” in Proc. 38th Annu. Int. Symp. Comput. Archit., 2011, pp. 1–10.
[23]
A. Hashmi, A. Nere, J. J. Thomas, and M. Lipasti, “ A case for neuromorphic ISAs,” in Proc. 16th Int. Conf. Archit. Support Program. Languages Oper. Syst., 2011, pp. 145–158.
[24]
S.-N. Hong and G. Caire, “ Compute-and-forward strategies for cooperative distributed antenna systems,” in IEEE Trans. Inf. Theory, vol. Volume 59, no. Issue 9, pp. 5227–5243, 2013.
[25]
K. Huang, “ A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate,” in Proc. IEEE Int. Electron Devices Meeting, 2011, pp. 24.7.1–24.7.4.
[26]
L. Huang, S. Ma, L. Shen, Z. Wang, and N. Xiao, “ Low-cost binary128 floating-point FMA unit design with SIMD support,” IEEE Trans. Comput., vol. Volume 61, no. Issue 5, pp. 745–751, 2012.
[27]
P. Huang, X. He, J. Gao and L. Deng, “ Learning deep structured semantic models for web search using clickthrough data,” in Proc. Int. Conf. Inf. Knowl. Manag., 2013.
[28]
K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “ What is the best multi-stage architecture for object recognition?” in Proc. 12th IEEE Int. Conf. Comput. Vis., 2009, pp. 2146–2153.
[29]
A. Kahng, B. Li, L.-S. Peh, and K. Samadi, “ Orion 2.0: A power-area simulator for interconnection networks,” IEEE Trans. Very Large Scale Integration Syst., vol. Volume 20, no. Issue 1, pp. 191–196, 2012.
[30]
M. M. Khan, “ SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor,” in Proc. IEEE Int. Joint Conf. Neural Netw., 2008, pp. 2849–2856.
[31]
A. Krizhevsky. [Online]. Available: https://code.google.com/p/cuda-convnet/
[32]
A. Krizhevsky, I. Sutskever, and G. Hinton, “ Imagenet classification with deep convolutional neural networks,” in Proc. Advances Neural Inf. Process. Syst., 2012, pp. 1–9.
[33]
H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “ An empirical evaluation of deep architectures on problems with many factors of variation,” in Int. Conf. Mach. Learn., 2007, pp. 473–480.
[34]
Q. V. Le, “ Building high-level features using large scale unsupervised learning,” in Int. Conf. Mach. Learn., 2012, pp. 81–88.
[35]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “ Gradient-based learning applied to document recognition,” Proc. IEEE, vol. Volume 86, no. Issue 11, pp. 2278–2324, 1998.
[36]
N. Maeda, S. Komatsu, M. Morimoto, and Y. Shimazaki, “ A 0.41 A standby leakage 32 Kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28 nm HKMG CMOS,” in Proc. Int. Symp. VLSI Circuits, 2012.
[37]
A. Majumdar, S. Cadambi, M. Becchi, S. T. Chakradhar, and H. P. Graf, “ A massively parallel, energy efficient programmable accelerator for learning and classification,” ACM Trans. Archit. Code Optimization, vol. Volume 9, no. Issue 1, pp. 1–30, 2012.
[38]
R. E. Matick and S. E. Schuster, “ Logic-based eDRAM: Origins and rationale for use,” IBM J. Res. Develop., vol. Volume 49, no. Issue 1, pp. 145–165, 2005.
[39]
P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. Modha, “ A digital neurosynaptic core using embedded crossbar memory with 45 pJ per spike in 45 nm,” in Proc. IEEE Custom Integr. Circuits Conf., 2011, pp. 1–4.
[40]
Micron. Ddr3 sdram rdimm datasheet. [Online]. Available: http://www.micron.com/~/media/documents/products/data%20sheet/modules/parity_rdimm/jsf18c1 gx72pdz.pdf
[41]
V. Mnih and G. Hinton, “ Learning to label aerial images from noisy data,” in Proc. 29th Int. Conf. Mach. Learn., 2012, pp. 567–574.
[42]
N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P.-C. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. B. Taylor, “ The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future,” IEEE Micro, vol. 31, no. 2, pp. 86–95, 2011.
[43]
, “ Tesla K20X GPU Accelerator Board Specification,” NVIDIA, Santa Clara, CA, USA, Tech. Rep., 2012.
[44]
W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, “ Convolution engine: Balancing efficiency & flexibility in specialized computing,” in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013, pp. 24–35.
[45]
R. Salakhutdinov and G. Hinton, “ An efficient learning procedure for deep Boltzmann machines,” Neural Comput., vol. Volume 24, no. Issue 8. pp. 1967–2006, 2012.
[46]
J. Schemmel, J. Fieres, and K. Meier, “ Wafer-scale integration of analog neural networks,” in Proc. Int. Joint Conf. Neural Netw., pp. 431–438, 2008.
[47]
O. Temam, “ A defect-tolerant accelerator for emerging high-performance applications,” in Proc. Int. Symp. Comput. Archit., 2012, pp. 356–367.
[48]
V. Vanhoucke, A. Senior, and M. Z. Mao, “ Improving the speed of neural networks on CPUs,” in Deep Learn. Unsupervised Feature Learn. Workshop, 2011.
[49]
G. Venkatesh, J. Sampson, N. Goulding-hotta, S. K. Venkata, M. B. Taylor, and S. Swanson, “ QsCORES : Trading dark silicon for scalable energy efficiency with quasi-specific cores categories and subject descriptors,” in Proc. Int. Symp. Microarchitecture, 2011.
[50]
G. Wang, “ Scaling deep trench based eDRAM on SOI to 32nm and beyond,” in Proc. IEEE Int. Electron Devices Meet., 2009, pp. 1–4.
[51]
Y. Chen, M. Kibune, A. Toda, A. Hayakawa, T. Akiyama, S. Sekiguchi, “ A 25 Gb/s hybrid integrated silicon photonic transceiver in 28 nm CMOS and SOI,” in IEEE Int. Solid-State Circuits Conf., 2015.
[52]
Y. Chen, “ Dadiannao: A machine-learning supercomputer,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2014, pp. 609–622.
[53]
M. Cignoli, “ A 1310 nm 3D-integrated silicon photonics Mach-Zehnder-based transmitter with 275 mW multistage CMOS driver achieving 6 dB extinction ratio at 25 Gb/s,” in Proc. IEEE Int. Solid-State Circuits Conf., 2015, pp. 1–3.
[54]
N. Ophir, C. Mineo, D. Mountain, and K. Bergman, “ Silicon photonic microring links for high-bandwidth-density, low-power chip I/O,” IEEE Micro, vol. Volume 33, no. Issue 1. pp. 54–67, 2013.
[55]
S. Rumley, D. Nikolova, R. Hendry, Q. Li, D. Calhoun, and K. Bergman, “ Silicon photonics for exascale systems,” J. Lightw. Technol., vol. Volume 33, no. 3. pp. 547–562, 2015.
[56]
M. Rakowski, “ A 420Gb/s WDM ring-based hybrid CMOS silicon photonics transceiver,” in Proc. IEEE Int. Solid-State Circuits Conf., 2015 pp. 408–409.
[57]
Y. A. Vlasov, “ Silicon CMOS-integrated nano-photonics for computer and data communications beyond 100G,” IEEE Commun. Mag., vol. Volume 50, no. Issue 2, pp. s67–s72, 2012.
[58]
I. A. Young, “ Optical I/O technology for tera-scale computing,” IEEE J. Solid-State Circuits, vol. Volume 45, no. Issue 1. pp. 235–248, 2010.
[59]
F. Akopyan, “ TrueNorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE Trans. Comput.-Aided Des. of Integr. Circuits Syst., vol. Volume 34, no. Issue 10, pp. 1537–1557, 2015.
[60]
D. E. Shaw, “ Anton 2: Raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer,” in Proc. Int. Conf. High Perform. Comput., Netw. Storage Anal., 2014, pp. 41–53.
[61]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “ Deep learning with limited numerical precision,” in Proc. 32nd Int. Conf. on Mach. Learning, 2015, pp. 1–10.
[62]
C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-L. Lu, “ Reducing cache power with low-cost, multi-bit error-correcting codes,” in Proc. 37th Annu. Int. Symp. Comput. Archit., 2010, pp. 83–93.
[63]
K. Padmaraju, J. Chan, L. Chen, M. Lipson, and K. Bergman, “ Thermal stabilization of a microring modulator using feedback control,” Optics Express, vol. Volume 20, no. Issue 27. pp. 27999–28008, 2012.
[64]
K. Padmaraju, D. F. Logan, X.-L. Zhu, J. J. Ackert, A. P. Knights, and K. Bergman, “ Integrated thermal stabilization of a microring modulator,” in Proc. Opt. Fiber Commun. Conf. Expo. Nat. Fiber Optic Engineers Conf., 2013, pp. 14342–14350.
[65]
W. A. Zortman, D. Trotter, and M. R. Watts, “ Silicon photonics manufacturing,” Opt. Exp., vol. Volume 18, no. Issue 23, pp. 23598–23607, 2010.
[66]
A. V. Krishnamoorthy, “ Exploiting CMOS manufacturing to reduce tuning requirements for resonant optical devices,” IEEE Photon. J., vol. Volume 3 no. Issue 3, pp. 567–579, 2011.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 66, Issue 1
January 2017
177 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 January 2017

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media