skip to main content
research-article
Public Access

SimNet: Accurate and High-Performance Computer Architecture Simulation using Deep Learning

Published: 06 June 2022 Publication History

Abstract

While cycle-accurate simulators are essential tools for architecture research, design, and development, their practicality is limited by an extremely long time-to-solution for realistic applications under investigation. This work describes a concerted effort, where machine learning (ML) is used to accelerate microarchitecture simulation. First, an ML-based instruction latency prediction framework that accounts for both static instruction properties and dynamic processor states is constructed. Then, a GPU-accelerated parallel simulator is implemented based on the proposed instruction latency predictor, and its simulation accuracy and throughput are validated and evaluated against a state-of-the-art simulator. Leveraging modern GPUs, the ML-based simulator outperforms traditional CPU-based simulators significantly.

References

[1]
2020. DGX A100: Universal System for AI Infrastructure. https://www.nvidia.com/en-us/data-center/dgx-a100/
[2]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16). 265--283.
[3]
Mohammad Agbarya, Idan Yaniv, Jayneel Gandhi, and Dan Tsafrir. 2020. Predicting execution times with partial simulations in virtual memory research: why and how. In IEEE/ACM International Symposium on Microarchitecture (MICRO). Global Online Event. (to appear).
[4]
Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-Architecture Performance Prediction (XAPP) Using CPU Code to Predict GPU Performance. In Proceedings of the 48th International Symposium on Microarchitecture (Waikiki, Hawaii) (MICRO-48). Association for Computing Machinery, New York, NY, USA, 725--737. https://doi.org/10.1145/2830772.2830780
[5]
I. Baldini, S. J. Fink, and E. Altman. 2014. Predicting GPU Performance from CPU Runs Using Machine Learning. In 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing. 254--261. https://doi.org/10.1109/SBAC-PAD.2014.30 Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 2, Article 25. Publication date: June 2022. 25:22 Lingda Li, Santosh Pandey, Thomas Flynn, Hang Liu, Noel Wheeler, and Adolfy Hoisie
[6]
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX annual technical conference, FREENIX Track, Vol. 41. Califor-nia, USA, 46.
[7]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, and et al. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7. https://doi.org/10.1145/2024716.2024718
[8]
James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 41--42.
[9]
T. E. Carlson, W. Heirman, K. Van Craeynest, and L. Eeckhout. 2014. BarrierPoint: Sampled simulation of multi-threaded applications. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--12. https://doi.org/10.1109/ISPASS.2014.6844456
[10]
T. M. Conte, M. A. Hirsch, and K. N. Menezes. 1996. Reducing state loss for effective trace sampling of superscalar processors. In Proceedings International Conference on Computer Design. VLSI in Computers and Processors. 468--477. https://doi.org/10.1109/ICCD.1996.563595
[11]
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. The Journal of Machine Learning Research 20, 1 (2019), 1997--2017.
[12]
Stijn Eyerman, Kenneth Hoste, and Lieven Eeckhout. 2011. Mechanistic-empirical processor performance modeling for constructing CPI stacks on real hardware. In (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software. 216--226. https://doi.org/10.1109/ISPASS.2011.5762738
[13]
A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, C. D. Emmons, M. Hayenga, and N. Paver. 2014. Sources of error in full-system simulation. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 13--22. https://doi.org/10.1109/ISPASS.2014.6844457
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[16]
Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. 2006. Efficiently Exploring Architectural Design Spaces via Predictive Modeling. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA) (ASPLOS XII). Association for Computing Machinery, New York, NY, USA, 195--206. https://doi.org/10.1145/1168857.1168882
[17]
Aamer Jaleel, Robert S Cohn, Chi-Keung Luk, and Bruce Jacob. 2008. CMPim: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA. 28--36.
[18]
Curtis L Janssen, Helgi Adalsteinsson, Scott Cranford, Joseph P Kenny, Ali Pinar, David A Evensky, and Jackson Mayo. 2010. A simulator for large-scale parallel computer architectures. International Journal of Distributed Systems and Technologies (IJDST) 1, 2 (2010), 57--73.
[19]
Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Lin Lin, Roberto Car, Weinan E, and Linfeng Zhang. 2020. Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 5, 14 pages.
[20]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3079856.3080246
[21]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, and Krste Asanovi?. 2018. Firesim: FPGA-Accelerated Cycle-Exact Scale-out System Simulation in the Public Cloud. In Proceedings of the 45th Annual International Symposium on Computer Architecture (Los Angeles, California) (ISCA '18). IEEE Press, 29--42. https://doi.org/10.1109/ISCA.2018.00014 Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 2, Article 25. Publication date: June 2022. SimNet: Accurate and High-Performance Computer Architecture Simulation using Deep Learning 25:23
[22]
P. Diederik Kingma and Lei Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. international conference on learning representations (2015).
[23]
Yuetsu Kodama, Tetsuya Odajima, Akira Asato, and Mitsuhisa Sato. 2019. Evaluation of the riken post-k processor simulator. arXiv preprint arXiv:1904.06451 (2019).
[24]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012), 1097--1105.
[25]
B. C. Lee and D. M. Brooks. 2007. Illustrative Design Space Studies with Microarchitectural Regression Models. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture. 340--351. https://doi.org/10.1109/ HPCA.2007.346211
[26]
Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. 2007. Methods of Inference and Learning for Performance Modeling of Parallel Applications. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Jose, California, USA) (PPoPP '07). Association for Computing Machinery, New York, NY, USA, 249--258. https://doi.org/10.1145/1229428.1229479
[27]
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774.
[28]
Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In International Conference on Machine Learning. PMLR, 4505--4515.
[29]
Daniel Nemirovsky, Tugberk Arkose, Nikola Markovic, Mario Nemirovsky, Osman Unsal, and Adrian Cristal. 2017. A Machine Learning Approach for Performance Prediction and Scheduling on Heterogeneous CPUs. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 121--128. https: //doi.org/10.1109/SBAC-PAD.2017.23
[30]
Nikos Nikoleris, Lieven Eeckhout, Erik Hagersten, and Trevor E. Carlson. 2019. Directed Statistical Warming through Time Traveling. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). Association for Computing Machinery, New York, NY, USA, 1037--1049. https://doi.org/10. 1145/3352460.3358264
[31]
Kenneth O'neal, Philip Brisk, Ahmed Abousamra, Zack Waters, and Emily Shriver. 2017. GPU Performance Estimation Using Software Rasterization and Machine Learning. ACM Trans. Embed. Comput. Syst. 16, 5s, Article 148 (Sept. 2017), 21 pages. https://doi.org/10.1145/3126557
[32]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., 8026--8037. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
[33]
A. Patel, F. Afram, S. Chen, and K. Ghose. 2011. MARSS: A full system simulator for multicore x86 CPUs. In 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC). 1050--1055.
[34]
H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. 2004. Pinpointing Representative Portions of Large Intel ® Itanium ® Programs with Dynamic Instrumentation. In 37th International Symposium on Microarchitecture (MICRO-37'04). 81--92. https://doi.org/10.1109/MICRO.2004.28
[35]
Drew D Penney and Lizhong Chen. 2019. A survey of machine learning applied to computer architecture design. arXiv preprint arXiv:1909.12373 (2019).
[36]
Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for Accurate and Efficient Simulation. SIGMETRICS Perform. Eval. Rev. 31, 1 (June 2003), 318--319.
[37]
Alex Renda, Yishen Chen, Charith Mendis, and Michael Carbin. 2020. DiffTune: Optimizing CPU Simulator Parameters with Learned Differentiable Surrogates. In IEEE/ACM International Symposium on Microarchitecture.
[38]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533--536.
[39]
Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. SIGARCH Comput. Archit. News 41, 3 (June 2013), 475--486. https://doi.org/10.1145/2508148.2485963
[40]
A. Sandberg, N. Nikoleris, T. E. Carlson, E. Hagersten, S. Kaxiras, and D. Black-Schaffer. 2015. Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed. In 2015 IEEE International Symposium on Workload Characterization. 183--192. https://doi.org/10.1109/IISWC.2015.29
[41]
M. Sato, Y. Ishikawa, H. Tomita, Y. Kodama, T. Odajima, M. Tsuji, H. Yashiro, M. Aoki, N. Shida, I. Miyoshi, K. Hirai, A. Furuya, A. Asato, K. Morita, and T. Shimizu. 2020. Co-Design for A64FX Manycore Processor and "Fugaku". In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Proc. ACM Meas. Anal. Comput. Syst., Vol. 6, No. 2, Article 25. Publication date: June 2022. 25:24 Lingda Li, Santosh Pandey, Thomas Flynn, Hang Liu, Noel Wheeler, and Adolfy Hoisie
[42]
Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin ídek, Alexander WR Nelson, Alex Bridgland, et al. 2020. Improved protein structure prediction using potentials from deep learning. Nature 577, 7792 (2020), 706--710.
[43]
André Seznec. 2016. TAGE-SC-L Branch Predictors Again. In 5th JILP Workshop on Computer Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5). Seoul, South Korea. https://hal.inria.fr/hal-01354253
[44]
Lloyd S Shapley. 1953. A value for n-person games. Contributions to the Theory of Games 2, 28 (1953), 307--317.
[45]
Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 45--57. https://doi.org/10.1145/605397.605403
[46]
Julius O. Smith. 2011. Spectral Audio Signal Processing. https://ccrma.stanford.edu/~jos/sasp online book, 2011 edition.
[47]
Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. 2019. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48]
Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 6105--6114. http://proceedings.mlr.press/v97/tan19a.html
[49]
Tran Van Dung, Ittetsu Taniguchi, and Hiroyuki Tomiyama. 2014. Cache Simulation for Instruction Set Simulator QEMU. In 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure Computing. 441--446. https: //doi.org/10.1109/DASC.2014.85
[50]
Han Vanholder. 2016. Efficient inference with tensorrt.
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[52]
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. 2006. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro 26, 4 (2006), 18--31. https://doi.org/10.1109/MM.2006.79
[53]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 564--576. https://doi.org/10.1109/HPCA.2015.7056063
[54]
Nan Wu and Yuan Xie. 2021. A Survey of Machine Learning for Computer Architecture and Systems. arXiv preprint arXiv:2102.07952 (2021).
[55]
Roland E. Wunderlich, Thomas F. Wenisch, Babak Falsafi, and James C. Hoe. 2003. SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (San Diego, California) (ISCA '03). Association for Computing Machinery, New York, NY, USA, 84--97. https://doi.org/10.1145/859618.859629
[56]
Toshio Yoshida. 2018. Fujitsu high performance CPU for the Post-K Computer. In Hot Chips, Vol. 30.
[57]
Xinnian Zheng, Lizy K. John, and Andreas Gerstlauer. 2016. Accurate Phase-Level Cross-Platform Power and Performance Estimation. In Proceedings of the 53rd Annual Design Automation Conference (Austin, Texas) (DAC '16). Association for Computing Machinery, New York, NY, USA, Article 4, 6 pages. https://doi.org/10.1145/2897937.2897977
[58]
Xinnian Zheng, Pradeep Ravikumar, Lizy K John, and Andreas Gerstlauer. 2015. Learning-based analytical crossplatform performance prediction. In 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). IEEE, 52--59

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 6, Issue 2
POMACS
June 2022
499 pages
EISSN:2476-1249
DOI:10.1145/3543145
Issue’s Table of Contents
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2022
Published in POMACS Volume 6, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computer architecture simulation
  2. deep learning
  3. gpu

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)508
  • Downloads (Last 6 weeks)64
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media