skip to main content
10.1145/3366428.3380770acmconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article
Public Access

The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems

Published: 23 February 2020 Publication History

Abstract

Hardware specialization has become the silver bullet to achieve efficient high performance, from Systems-on-Chip systems, where hardware specialization can be "extreme", to large-scale HPC systems. As the complexity of the systems increases, so does the complexity of programming such architectures in a portable way.
This work introduces the Minos Computing Library (MCL), as system software, programming model, and programming model runtime that facilitate programming extremely heterogeneous systems. MCL supports the execution of several multi-threaded applications within the same compute node, performs asynchronous execution of application tasks, efficiently balances computation across hardware resources, and provides performance portability.
We show that code developed on a personal desktop automatically scales up to fully utilize powerful workstations with 8 GPUs and down to power-efficient embedded systems. MCL provides up to 17.5x speedup over OpenCL on NVIDIA DGX-1 systems and up to 1.88x speedup on single-GPU systems. In multi-application workloads, MCL's dynamic resource allocation provides up to 2.43x performance improvement over manual, static resources allocation.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http://tensorflow.org/ Software available from tensorflow.org.
[2]
C. Augonnet, S. Thibault, R. Namyst, and P. A. Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198.
[3]
D. F. Bacon, R. Rabbah, and S. Shukla. 2013. FPGA programming for the masses. Commun. ACM 56, 4 (2013), 56--63.
[4]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing locality and independence with logical regions. International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2012).
[5]
M. E. Belviranli, L. N. Bhuyan, and R. Gupta. 2013. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization 9, 4 (2013), 1--20.
[6]
R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. 1995. Cilk: An efficient multithreaded runtime system. In ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPOPP). ACM New York, NY, USA, 207--216.
[7]
J. Bueno, J. Planas, A. Duran, R. M. Badia, X. Martorell, E. Ayguade, and J. Labarta. 2012. Productive Programming of GPU Clusters with OmpSs. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. 557--568.
[8]
L. Dagum and R. Menon. 1998. OpenMP: : An Industry-Standard API for Shared-Memory Programming. IEEE Computational Science AND Engineering 5, 1 (1998), 46--55.
[9]
B. Dally. 2010. GPU Computing to Exascale and Beyond.
[10]
M. Frigo, C. E. Leiserson, and K. H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). ACM Press New York, NY, USA, 212--223.
[11]
Intel. [n. d.]. Threading Building Blocks. [Online]. Available: https://software.intel.com/en-us/intel-tbb. (Accessed Feb. 1, 2019).
[12]
Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. pocl: A Performance-Portable OpenCL Implementation. International Journal of Parallel Programming 43, 5 (01 Oct 2015), 752--785.
[13]
H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey. 2014. HPX: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. ACM, 6.
[14]
L. V. Kale and S. Krishnan. 1993. CHARM++: a portable concurrent object oriented system based on C++. Vol. 28. ACM.
[15]
J. Kim, H. Kim, J. H. Lee, and J. Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In 16th ACM symposium on Principles and practice of parallel programming. ACM, San Antonio, TX, USA, 277--288.
[16]
M. Kotsifakou, P. Srivastava, M.D. Sinclair, R. Komuravelli, V. Adve, and S. Adve. 2018. HPVM: heterogeneous parallel virtual machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, Vienna, Austria, 68--80.
[17]
S. Mittal and J. S. Vetter. 2015. A Survey of CPU-GPU Heterogeneous Computing Techniques. Comput. Surveys 47, 4 (2015), 1--35.
[18]
J. Nickolls and I. Buck. 2007. NVIDIA CUDA software and GPU parallel computing architecture. In Microprocessor Forum.
[19]
OpenACC. 2015. OpenACC: Directives for Accelerators.
[20]
J. Planas, R.M. Badia, E. Ayguadé, and J. Labarta. 2013. Self-Adaptive OmpSs Tasks in Heterogeneous Environments. In IEEE 27th International Symposium on Parallel and Distributed Processing. 138--149.
[21]
N. Ravi, Y. Yang, T. Bao, and S. Chakradhar. 2013. Semi-automatic restructuring of offloadable tasks for many-core accelerators. (2013), 1--12.
[22]
C.J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. 2011. PTask: operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, Cascais, Portugal, 233--248.
[23]
C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly. 2013. Dandelion. (2013), 49--68.
[24]
T. R. W. Scogland, B. Rountree, W. C. Feng, and B. R. De Supinski. 2012. Heterogeneous task scheduling for accelerated OpenMP. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012 (2012), 144--155.
[25]
K. Spafford, J. Meredith, and J. S. Vetter. 2010. Maestro: Data Orchestration and Tuning for OpenCL Devices. In Euro-Par 2010 - Parallel Processing, Pasqua D'Ambra, Mario Guarracino, and Domenico Talia (Eds.), Vol. 6272. Springer Berlin Heidelberg, 275--286.
[26]
J. E. Stone, D. Gohara, and G. Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering 12, 3 (2010), 66--73.
[27]
J.S. Vetter, R. Brightwell, M. Gokhale, P. McCormick, R. Ross, J. Shalf, K. Antypas, D. Donofrio, T. Humble, C. Schuman, B. Van Essen, S. Yoo, A. Aiken, D. Bernholdt, S. Byna, K. Cameron, F. Cappello, B. Chapman, A. Chien, M. Hall, R. Hartman-Baker, Z. Lan, M. Lang, J. Leidel, S. Li, R. Lucas, J. Mellor-Crummey, P. Peltz Jr., T. Peterka, M. Strout, and J. Wilke. 2018. Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity. Technical Report. USDOE Office of Science (SC) (United States).
[28]
Wikipedia. [n. d.]. NVLink. [Online]. Available: https://en.wikipedia.org/wiki/NVLink. (Accessed Feb. 1,2019).

Cited By

View all

Index Terms

  1. The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit
      February 2020
      77 pages
      ISBN:9781450370257
      DOI:10.1145/3366428
      © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 February 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU
      2. asynchronous runtime
      3. heterogeneous systems
      4. system software
      5. task-based runtime

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      PPoPP '20

      Acceptance Rates

      GPGPU '20 Paper Acceptance Rate 7 of 12 submissions, 58%;
      Overall Acceptance Rate 57 of 129 submissions, 44%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)128
      • Downloads (Last 6 weeks)19
      Reflects downloads up to 28 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media