skip to main content
10.1145/3432261.3432272acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article
Open access

CSPACER: A Reduced API Set Runtime for the Space Consistency Model

Published: 20 January 2021 Publication History

Abstract

We present our design and implementation of a runtime for the Space Consistency model. The Space Consistency model is a generalized form of the full-empty bit synchronization for distributed memory programming, where a memory region is associated with a counter that determines its consistency and readiness for consumption. The model allows for efficient implementation of point-to-point data transfers and collective communication primitives as well. We present the interface design, implementation details, and performance results on Cray XC systems. Our runtime adopts a reduced API design to provide low-overhead initiation and processing of communication primitives, enable threaded execution of runtime functions, and provide efficient pipelining, thus improving the computation-communication overlap. We show the performance benefits of using this runtime both at the microbenchmark level and in application settings.

References

[1]
A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, Beng-Hong Lim, K. Mackenzie, and D. Yeung. 1995. The MIT Alewife machine: architecture and performance. In Proceedings 22nd Annual International Symposium on Computer Architecture. 2–13.
[2]
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39, 5 (1995), 575–582.
[3]
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. [n.d.]. XPMEM Linux kernel module. https://github.com/hjelmn/xpmem.
[4]
Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+Threads: Runtime Contention and Remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Francisco, CA, USA) (PPoPP 2015). 239–248.
[5]
J. Bachan, S. B. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, P. H. Hargrove, and H. Ahmed. 2019. UPC++: A High-Performance Communication Framework for Asynchronous Computation. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 963–973.
[6]
Roberto Belli and Torsten Hoefler. 2015. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization. 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 00(2015), 871–881. https://doi.org/doi.ieeecomputersociety.org/10.1109/IPDPS.2015.30
[7]
C. Bernard, T. Burch, C. DeTar, J. Osborn, Steven Gottlieb, E. B. Gregory, D. Toussaint, U. M. Heller, and R. Sugar. 2005. QCD thermodynamics with three flavors of improved staggered quarks. Physical Review D 71, 3 (Feb 2005). https://doi.org/10.1103/physrevd.71.034504
[8]
Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5764.html
[9]
W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. 1999. Introduction to UPC and Language Specification. Technical Report. Technical report CCS-TR-99-157, IDA Center for Computing Sciences.
[10]
UPC Consortium, Dan Bonachea, and Gary Funck. 2013. UPC Language and Library Specifications, Version 1.3. Berkeley Lab reports (11 2013).
[11]
D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. 1993. Parallel programming in Split-C. Supercomputing 3. (Nov 1993), 262–273.
[12]
H. Dang, S. Seo, A. Amer, and P. Balaji. 2017. Advanced Thread Synchronization for Multithreaded MPI Implementations. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 314–324.
[13]
Hoang-Vu Dang, Marc Snir, and William Gropp. 2016. Towards Millions of Communicating Threads. In Proceedings of the 23rd European MPI Users’ Group Meeting. 1–14.
[14]
A. Denis. 2019. Scalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 371–380. https://doi.org/10.1109/CCGRID.2019.00051
[15]
James Dinan, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, and Rajeev Thakur. 2013. An Implementation and Evaluation of the MPI 3.0 One-Sided Communication. ANL/MCS-P4014-0113.
[16]
James Dinan, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. [n.d.]. Enabling MPI Interoperability through Flexible Communication Endpoints. In Proceedings of the 20th European MPI Users’ Group Meeting (Madrid, Spain) (EuroMPI ’13). 13–18.
[17]
James Dinan, Clement Cole, Gabriele Jost, Stan Smith, Keith Underwood, and RobertW. Wisniewski. 2014. Reducing Synchronization Overhead Through Bundled Communication. OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools 8356(2014), 163–177.
[18]
Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. 2015. HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Sci. Program. 2015, Article 9 (Jan. 2015). https://doi.org/10.1155/2015/502593
[19]
Ryan E. Grant, Matthew G. F. Dosanjh, Michael J. Levenhagen, Ron Brightwell, and Anthony Skjellum. 2019. Finepoints: Partitioned Multithreaded MPI Communication. In High Performance Computing, Michèle Weiland, Guido Juckeland, Carsten Trinitis, and Ponnuswamy Sadayappan (Eds.). Springer International Publishing, 330–350.
[20]
K. Ibrahim. 2019. Optimizing Breadth-First Search at Scale Using Hardware-Accelerated Space Consistency. In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 23–33.
[21]
Khaled Z. Ibrahim. 2018. Space Consistency for Distributed Memory Systems. In PMAW’18, The Programming Models and Algorithms Workshop, IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[22]
Khaled Z. Ibrahim and Katherine Yelick. 2014. On the Conditions for Efficient Interoperability with Threads: An Experience with PGAS Languages Using Cray Communication Domains. In Proceedings of the 28th ACM International Conference on Supercomputing(ICS ’14). 23–32.
[23]
Hyuk-Jae Lee, James P. Robertson, and José A. B. Fortes. 1997. Generalized Cannon’s Algorithm for Parallel Matrix Multiplication. In Proceedings of the 11th International Conference on Supercomputing. 44–51.
[24]
Daichi Mukunoki and Toshiyuki Imamura. 2018. Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster. In Computational Science – ICCS 2018, Yong Shi, Haohuan Fu, Yingjie Tian, Valeria V. Krzhizhanovskaya, Michael Harold Lees, Jack Dongarra, and Peter M. A. Sloot (Eds.). 853–858.
[25]
Jarek Nieplocha, Bruce Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold Trease, and Edoardo Aprà. 2006. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit. The International Journal of High Performance Computing Applications 20, 2(2006), 203–231. https://doi.org/10.1177/1094342006064503
[26]
Robert W. Numrich and John Reid. 1998. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1–31. https://doi.org/10.1145/289918.289920
[27]
OSU benchmarks. [n.d.]. OMB 5.6.2. Network-Based Computing Laboratory, Ohio State University, http://mvapich.cse.ohio-state.edu/benchmarks/.
[28]
T. Patinyasakdikul, D. Eberius, G. Bosilca, and N. Hjelm. 2019. Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). 1–11.
[29]
J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. 2005. Performance analysis of MPI collective operations. In 19th IEEE International Parallel and Distributed Processing Symposium. 8 pp.–.
[30]
Stephen W. Poole, Oscar Hernandez, Jeffery A. Kuehn, Galen M. Shipman, Anthony Curtis, and Karl Feind. 2011. OpenSHMEM - Toward a Unified RMA Model. Springer US, Boston, MA, 1379–1391.
[31]
Min Si, Antonio J. Peña, Pavan Balaji, Masamichi Takagi, and Yutaka Ishikawa. 2014. MT-MPI: Multithreaded MPI for Many-Core Environments(ICS ’14). Association for Computing Machinery, New York, NY, USA, 125–134. https://doi.org/10.1145/2597652.2597658
[32]
Berkeley Laboratory Software. 2020. CSPACER: Consistent SPACE Runtime. https://bitbucket.org/kibrahim/cspacer_xc/
[33]
Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 90–109.
[34]
Rajeev Thakur and William D. Gropp. 2003. Improving the Performance of Collective Operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Domenico Laforenza, and Salvatore Orlando (Eds.). 257–267.
[35]
US Lattice Quantum Chromodynamics. [n.d.]. https://www.usqcd.org/index.html.
[36]
Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report. University of Texas at Austin, USA.
[37]
Hans Weeks, Matthew G. F. Dosanjh, Patrick G. Bridges, and Ryan E. Grant. 2016. SHMEM-MT: A Benchmark Suite for Assessing Multi-threaded SHMEM Performance. In OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments, Manjunath Gorentla Venkata, Neena Imam, Swaroop Pophale, and Tiffany M. Mintz (Eds.). 227–231.
[38]
Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and Performance Using Partitioned Global Address Space Languages. In The 2007 International Workshop on Parallel Symbolic Computation. 24–32.
[39]
R. Zambre, A. Chandramowlishwaran, and P. Balaji. 2018. Scalable Communication Endpoints for MPI+Threads Applications. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). 803–812.
[40]
Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick. 2014. UPC++: A PGAS Extension for C++. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1105–1114.

Cited By

View all
  • (2022)Enhancing scalability of a matrix-free eigensolver for studying many-body localizationInternational Journal of High Performance Computing Applications10.1177/1094342021106036536:3(307-319)Online publication date: 1-May-2022
  • (2021)Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemExParallel Computing10.1016/j.parco.2021.102829108:COnline publication date: 1-Dec-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HPCAsia '21: The International Conference on High Performance Computing in Asia-Pacific Region
January 2021
143 pages
ISBN:9781450388429
DOI:10.1145/3432261
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2021

Check for updates

Author Tags

  1. Parallel Programming Models
  2. Runtime Systems

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HPC Asia 2021

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)124
  • Downloads (Last 6 weeks)26
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Enhancing scalability of a matrix-free eigensolver for studying many-body localizationInternational Journal of High Performance Computing Applications10.1177/1094342021106036536:3(307-319)Online publication date: 1-May-2022
  • (2021)Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemExParallel Computing10.1016/j.parco.2021.102829108:COnline publication date: 1-Dec-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media