Article

Open access

CHaNAS: coordinated search for network architecture and scheduling policy

Authors:

Lei ZhangAuthors Info & Claims

LCTES 2021: Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Pages 42 - 53

https://doi.org/10.1145/3461648.3463846

Published: 22 June 2021 Publication History

Abstract

Automatically design an efficient DNN solution for a given deep learning task on the target hardware mainly decided by the neural network architecture and the schedule mapping strategy, where the two goals are closely coupled with each other to fully exploit the advantages of the underlying hardware. Prior hardware-aware Neural Architecture Search (NAS) methods mostly ignore the impacts of different scheduling policies (e.g., graph-level optimization, loop transformations, parallelization, etc.) on network candidates being evaluated in the search process. Thus, they may miss the true-optimal architecture that can only be discovered by trying-out different scheduling policies. This work proposes a NAS framework (CHaNAS) that searches for not only the network architecture but also the dedicated scheduling policy, as the optimal co-design solution on target hardware that fully exploits the advantages of the underlying hardware. We propose to use a block-based pre-scheduling methodology to reduce the co-design search space, and enable the automatic generation of the optimal co-design, including the network architecture and the tensor programs that practice the scheduling policy. We evaluate CHaNAS on Imagenet on different hardware back-ends against the state-of-the-art hardware-aware search method MobileNet-v3. Experimental results show that the co-design solutions obtained by ChaNAS show up to 1.6x, 1.9x, and 1.7x performance boost on NVIDIA P100 GPU, Intel Xeon 8163 CPU, and Samsung Note 10 Mobile, respectively, over the baselines of the same-level accuracy.

References

[1]

2020. Google(R), XLA. https://www.tensorflow.org/xla

[2]

2020. Inter(R) MKL-DNN. https://github.com/intel/mkl-dnn

[3]

2020. NVIDIA(R), CUBLAS Library. https://www.nvidia.com/

[4]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2017. Tensorflow: A system for large-scale machine learning. In 12th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 17). 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

[5]

Mohamed S Abdelfattah, Ł ukasz Dudziak, Thomas Chau, Royson Lee, Hyeji Kim, and Nicholas D Lane. 2020. Best of both worlds: Automl codesign of a cnn and its hardware accelerator. In 57th ACM/IEEE Design Automation Conference (DAC), 1–6. https://doi.org/10.1109/DAC18072.2020.9218596

[6]

Aayush Ankit, Abhronil Sengupta, and Kaushik Roy. 2018. Neuromorphic computing across the stack: Devices, circuits and architectures. In 2018 IEEE International Workshop on Signal Processing Systems (SiPS). 1–6. https://doi.org/10.1109/SiPS.2018.8598419

[7]

Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, and Marco Domenico Santambrogio. 2017. A pipelined and scalable dataflow implementation of convolutional neural networks on FPGA. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 90–97. https://doi.org/10.1109/IPDPSW.2017.44

[8]

Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. 2018. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning (ICML), 550–559. http://proceedings.mlr.press/v80/bender18a.html

[9]

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2020. Once-for-all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations (ICLR), https://openreview.net/forum?id=HylxE1HKwS

[10]

Han Cai, Ligeng Zhu, and Song Han. 2019. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), https://openreview.net/forum?id=HylVB3AqYm

[11]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274, arxiv:1512.01274

[12]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, and Luis Ceze. 2018. $TVM$: An automated end-to-end optimizing compiler for deep learning. In 13th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 18). 578–594. https://www.usenix.org/conference/osdi18/presentation/chen

[13]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. In Conference on Neural Information Processing Systems (NeurIPS), 31 (2018), 3389–3400. https://proceedings.neurips.cc/paper/2018/hash/8b5700012be65c9da25f49408d959ca0-Abstract.html

[14]

Weiwei Chen, Ying Wang, Shuang Yang, Chen Liu, and Lei Zhang. 2020. You Only Search Once: A Fast Automation Framework for Single-Stage DNN/Accelerator Co-design. In 2020 Design, Automation Test in Europe Conference Exhibition (DATE). 1283–1286. https://doi.org/10.23919/DATE48585.2020.9116474

[15]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2017. Using dataflow to optimize energy efficiency of deep neural network accelerators. In Proceedings of the 50nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro), 37, 3 (2017), 12–21. https://doi.org/10.1109/MM.2017.54

Digital Library

[16]

Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. In IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 292–308. https://doi.org/10.1109/JETCAS.2019.2910232

[17]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. In CoRR, abs/1410.0759 (2014), arxiv:1410.0759

[18]

Scott Cyphers, Arjun K Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, and Omar Kanawi. 2018. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058, arxiv:1801.08058

[19]

Anup Das, Akash Kumar, and Bharadwaj Veeravalli. 2014. Energy-aware task mapping and scheduling for reliable embedded computing systems. In ACM Transactions on Embedded Computing Systems (TECS), 13, 2s (2014), 1–27. https://doi.org/10.1145/2544375.2544392

Digital Library

[20]

Anup Das, Akash Kumar, and Bharadwaj Veeravalli. 2015. Reliability and energy-aware mapping and scheduling of multimedia applications on multiprocessor systems. In IEEE Transactions on Parallel and Distributed Systems, 27, 3 (2015), 869–884. https://doi.org/10.1109/TPDS.2015.2412137

Digital Library

[21]

Suyog Gupta and Berkin Akin. 2020. Accelerator-aware Neural Network Design using AutoML. arXiv: 2003.02838, arxiv:2003.02838

[22]

Cong Hao, Xiaofan Zhang, Yuhong Li, Sitao Huang, Jinjun Xiong, Kyle Rupnow, Wen-mei Hwu, and Deming Chen. 2019. FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge. In 56th ACM/IEEE Design Automation Conference (DAC), https://doi.org/10.1145/3316781.3317829

Digital Library

[23]

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, and Vijay Vasudevan. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1314–1324. https://doi.org/10.1109/ICCV.2019.00140

[24]

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, arxiv:1704.04861

[25]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv:1602.07360, arxiv:1602.07360

[26]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675–678. https://doi.org/10.1145/2647868.2654889

Digital Library

[27]

Weiwen Jiang, Xinyi Zhang, Edwin H-M Sha, Lei Yang, Qingfeng Zhuge, Yiyu Shi, and Jingtong Hu. 2019. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search. In 56th ACM/IEEE Design Automation Conference (DAC), 1–6. https://doi.org/10.1145/3316781.3317757

Digital Library

[28]

Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro). 754–768. https://doi.org/10.1145/3352460.3358252

Digital Library

[29]

Kiseok Kwon, Alon Amid, Amir Gholami, Bichen Wu, Krste Asanovic, and Kurt Keutzer. 2018. Co-design of deep neural nets and neural net accelerators for embedded vision applications. In 55th ACM/IEEE Design Automation Conference (DAC), 1–6. https://doi.org/10.1147/JRD.2019.2942284

[30]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. Darts: Differentiable architecture search. In International Conference on Learning Representations (ICLR), https://openreview.net/forum?id=S1eYHoC5FX

[31]

Qing Lu, Weiwen Jiang, Xiaowei Xu, Yiyu Shi, and Jingtong Hu. 2019. On neural architecture search for resource-constrained hardware platforms. arXiv:1911.00105, arxiv:1911.00105

[32]

Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2017. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 45–54. http://dl.acm.org/citation.cfm?id=3021736

Digital Library

[33]

Diana Marculescu, Dimitrios Stamoulis, and Ermao Cai. 2018. Hardware-Aware Machine Learning: Modeling and Optimization. In Proceedings of the International Conference on Computer-Aided Design (ICCAD), https://doi.org/10.1145/3240765.3243479

Digital Library

[34]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGARCH Computer Architecture News, 43, 1 (2015), 429–443. https://doi.org/10.1145/2694344.2694364

Digital Library

[35]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703, https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html

[36]

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10428–10436. https://doi.org/10.1109/CVPR42600.2020.01044

[37]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Acm Sigplan Notices, 48, 6 (2013), 519–530. https://doi.org/10.1145/2491956.2462176

Digital Library

[38]

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2820–2828. https://doi.org/10.1109/CVPR.2019.00293

[39]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv:1802.04730, arxiv:1802.04730

[40]

Anand Venkat, Tharindu Rusira, Raj Barik, Mary Hall, and Leonard Truong. 2019. SWIRL: High-performance many-core CPU code generation for deep neural networks. In International Journal of High Performance Computing Applications, 33, 6 (2019), 1275–1289. https://doi.org/10.1177/1094342019866247

Digital Library

[41]

Ying Wang, Shengwen Liang, Huawei Li, and Xiaowei Li. 2019. A None-Sparse Inference Accelerator That Distills and Reuses the Computation Redundancy in CNNs. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). Association for Computing Machinery, New York, NY, USA. isbn:9781450367257

Digital Library

[42]

Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In 53th ACM/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1145/2897937.2898003

Digital Library

[43]

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. 2019. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10734–10742. https://doi.org/10.1109/CVPR.2019.01099

[44]

Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6. https://doi.org/10.1145/3061639.3062244

Digital Library

[45]

Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong. 2020. Latency-Aware Differentiable Neural Architecture Search. arXiv: 2001.06392, arxiv:2001.06392

[46]

Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, and John Wawrzynek. 2019. Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), https://doi.org/10.1145/3289602.3293902

Digital Library

[47]

Wang Ying, Xu Jie, Han Yinhe, Li Huawei, and Li Xiaowei. 2016. DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family. In 53th ACM/IEEE Design Automation Conference (DAC), https://doi.org/10.1145/2897937.2898003

Digital Library

[48]

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6848–6856. https://doi.org/10.1109/CVPR.2018.00716

[49]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, and Koushik Sen. 2020. Ansor: Generating high-performance tensor programs for deep learning. In 14th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 19). 863–879. https://www.usenix.org/conference/osdi20/presentation/zheng

Digital Library

[50]

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. 2020. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 859–873. https://doi.org/10.1145/3373376.3378508

Digital Library

[51]

Barret Zoph and Quoc V Le. 2017. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), https://openreview.net/forum?id=r1Ue8Hcxg

Index Terms

CHaNAS: coordinated search for network architecture and scheduling policy
1. Computing methodologies
  1. Machine learning
2. Software and its engineering
  1. Software notations and tools

Recommendations

Neural Architecture Search Survey: A Hardware Perspective
We review the problem of automating hardware-aware architectural design process of Deep Neural Networks (DNNs). The field of Convolutional Neural Network (CNN) algorithm design has led to advancements in many fields, such as computer vision, virtual ...
A Framework for Neural Network Architecture and Compile Co-optimization
The efficiency of deep neural network (DNN) solutions on real hardware devices are mainly decided by the DNN architecture and the compiler-level scheduling strategy on the hardware. When we try to fully exploit the underlying hardware and obtain the ...
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

LCTES 2021: Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

June 2021

162 pages

ISBN:9781450384728

DOI:10.1145/3461648

General Chair:
Jörg Henkel
KIT, Germany
,
Program Chair:
Xu Liu
North Carolina State University, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Funding Sources

National Natural Science Foundation of China
2025 Key Technology Innovation Program of Ningbo City
Strategic Priority Research Program of Chinese Academy of Science,

Conference

LCTES '21

Sponsor:

SIGPLAN

LCTES '21: 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

June 22, 2021

Virtual, Canada

Acceptance Rates

Overall Acceptance Rate 116 of 438 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
468
Total Downloads

Downloads (Last 12 months)162
Downloads (Last 6 weeks)29

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten