research-article

Open access

CSPACER: A Reduced API Set Runtime for the Space Consistency Model

Author:

Khaled Z. IbrahimAuthors Info & Claims

HPCAsia '21: The International Conference on High Performance Computing in Asia-Pacific Region

Pages 58 - 68

https://doi.org/10.1145/3432261.3432272

Published: 20 January 2021 Publication History

All formats PDF

Abstract

We present our design and implementation of a runtime for the Space Consistency model. The Space Consistency model is a generalized form of the full-empty bit synchronization for distributed memory programming, where a memory region is associated with a counter that determines its consistency and readiness for consumption. The model allows for efficient implementation of point-to-point data transfers and collective communication primitives as well. We present the interface design, implementation details, and performance results on Cray XC systems. Our runtime adopts a reduced API design to provide low-overhead initiation and processing of communication primitives, enable threaded execution of runtime functions, and provide efficient pipelining, thus improving the computation-communication overlap. We show the performance benefits of using this runtime both at the microbenchmark level and in application settings.

References

[1]

A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, Beng-Hong Lim, K. Mackenzie, and D. Yeung. 1995. The MIT Alewife machine: architecture and performance. In Proceedings 22nd Annual International Symposium on Computer Architecture. 2–13.

[2]

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development 39, 5 (1995), 575–582.

Digital Library

[3]

Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. [n.d.]. XPMEM Linux kernel module. https://github.com/hjelmn/xpmem.

[4]

Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+Threads: Runtime Contention and Remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (San Francisco, CA, USA) (PPoPP 2015). 239–248.

Digital Library

[5]

J. Bachan, S. B. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, P. H. Hargrove, and H. Ahmed. 2019. UPC++: A High-Performance Communication Framework for Asynchronous Computation. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 963–973.

[6]

Roberto Belli and Torsten Hoefler. 2015. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization. 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 00(2015), 871–881. https://doi.org/doi.ieeecomputersociety.org/10.1109/IPDPS.2015.30

[7]

C. Bernard, T. Burch, C. DeTar, J. Osborn, Steven Gottlieb, E. B. Gregory, D. Toussaint, U. M. Heller, and R. Sugar. 2005. QCD thermodynamics with three flavors of improved staggered quarks. Physical Review D 71, 3 (Feb 2005). https://doi.org/10.1103/physrevd.71.034504

[8]

Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5764.html

[9]

W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, and K. Warren. 1999. Introduction to UPC and Language Specification. Technical Report. Technical report CCS-TR-99-157, IDA Center for Computing Sciences.

[10]

UPC Consortium, Dan Bonachea, and Gary Funck. 2013. UPC Language and Library Specifications, Version 1.3. Berkeley Lab reports (11 2013).

[11]

D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. 1993. Parallel programming in Split-C. Supercomputing 3. (Nov 1993), 262–273.

[12]

H. Dang, S. Seo, A. Amer, and P. Balaji. 2017. Advanced Thread Synchronization for Multithreaded MPI Implementations. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 314–324.

[13]

Hoang-Vu Dang, Marc Snir, and William Gropp. 2016. Towards Millions of Communicating Threads. In Proceedings of the 23rd European MPI Users’ Group Meeting. 1–14.

Digital Library

[14]

A. Denis. 2019. Scalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 371–380. https://doi.org/10.1109/CCGRID.2019.00051

[15]

James Dinan, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, and Rajeev Thakur. 2013. An Implementation and Evaluation of the MPI 3.0 One-Sided Communication. ANL/MCS-P4014-0113.

[16]

James Dinan, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. [n.d.]. Enabling MPI Interoperability through Flexible Communication Endpoints. In Proceedings of the 20th European MPI Users’ Group Meeting (Madrid, Spain) (EuroMPI ’13). 13–18.

[17]

James Dinan, Clement Cole, Gabriele Jost, Stan Smith, Keith Underwood, and RobertW. Wisniewski. 2014. Reducing Synchronization Overhead Through Bundled Communication. OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools 8356(2014), 163–177.

Digital Library

[18]

Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. 2015. HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Sci. Program. 2015, Article 9 (Jan. 2015). https://doi.org/10.1155/2015/502593

[19]

Ryan E. Grant, Matthew G. F. Dosanjh, Michael J. Levenhagen, Ron Brightwell, and Anthony Skjellum. 2019. Finepoints: Partitioned Multithreaded MPI Communication. In High Performance Computing, Michèle Weiland, Guido Juckeland, Carsten Trinitis, and Ponnuswamy Sadayappan (Eds.). Springer International Publishing, 330–350.

[20]

K. Ibrahim. 2019. Optimizing Breadth-First Search at Scale Using Hardware-Accelerated Space Consistency. In 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC). 23–33.

[21]

Khaled Z. Ibrahim. 2018. Space Consistency for Distributed Memory Systems. In PMAW’18, The Programming Models and Algorithms Workshop, IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[22]

Khaled Z. Ibrahim and Katherine Yelick. 2014. On the Conditions for Efficient Interoperability with Threads: An Experience with PGAS Languages Using Cray Communication Domains. In Proceedings of the 28th ACM International Conference on Supercomputing(ICS ’14). 23–32.

[23]

Hyuk-Jae Lee, James P. Robertson, and José A. B. Fortes. 1997. Generalized Cannon’s Algorithm for Parallel Matrix Multiplication. In Proceedings of the 11th International Conference on Supercomputing. 44–51.

[24]

Daichi Mukunoki and Toshiyuki Imamura. 2018. Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster. In Computational Science – ICCS 2018, Yong Shi, Haohuan Fu, Yingjie Tian, Valeria V. Krzhizhanovskaya, Michael Harold Lees, Jack Dongarra, and Peter M. A. Sloot (Eds.). 853–858.

Digital Library

[25]

Jarek Nieplocha, Bruce Palmer, Vinod Tipparaju, Manojkumar Krishnan, Harold Trease, and Edoardo Aprà. 2006. Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit. The International Journal of High Performance Computing Applications 20, 2(2006), 203–231. https://doi.org/10.1177/1094342006064503

Digital Library

[26]

Robert W. Numrich and John Reid. 1998. Co-Array Fortran for Parallel Programming. SIGPLAN Fortran Forum 17, 2 (Aug. 1998), 1–31. https://doi.org/10.1145/289918.289920

Digital Library

[27]

OSU benchmarks. [n.d.]. OMB 5.6.2. Network-Based Computing Laboratory, Ohio State University, http://mvapich.cse.ohio-state.edu/benchmarks/.

[28]

T. Patinyasakdikul, D. Eberius, G. Bosilca, and N. Hjelm. 2019. Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs. In 2019 IEEE International Conference on Cluster Computing (CLUSTER). 1–11.

[29]

J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. J. Dongarra. 2005. Performance analysis of MPI collective operations. In 19th IEEE International Parallel and Distributed Processing Symposium. 8 pp.–.

Digital Library

[30]

Stephen W. Poole, Oscar Hernandez, Jeffery A. Kuehn, Galen M. Shipman, Anthony Curtis, and Karl Feind. 2011. OpenSHMEM - Toward a Unified RMA Model. Springer US, Boston, MA, 1379–1391.

[31]

Min Si, Antonio J. Peña, Pavan Balaji, Masamichi Takagi, and Yutaka Ishikawa. 2014. MT-MPI: Multithreaded MPI for Many-Core Environments(ICS ’14). Association for Computing Machinery, New York, NY, USA, 125–134. https://doi.org/10.1145/2597652.2597658

Digital Library

[32]

Berkeley Laboratory Software. 2020. CSPACER: Consistent SPACE Runtime. https://bitbucket.org/kibrahim/cspacer_xc/

[33]

Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 90–109.

Digital Library

[34]

Rajeev Thakur and William D. Gropp. 2003. Improving the Performance of Collective Operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Domenico Laforenza, and Salvatore Orlando (Eds.). 257–267.

[35]

US Lattice Quantum Chromodynamics. [n.d.]. https://www.usqcd.org/index.html.

[36]

Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report. University of Texas at Austin, USA.

[37]

Hans Weeks, Matthew G. F. Dosanjh, Patrick G. Bridges, and Ryan E. Grant. 2016. SHMEM-MT: A Benchmark Suite for Assessing Multi-threaded SHMEM Performance. In OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments, Manjunath Gorentla Venkata, Neena Imam, Swaroop Pophale, and Tiffany M. Mintz (Eds.). 227–231.

[38]

Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and Performance Using Partitioned Global Address Space Languages. In The 2007 International Workshop on Parallel Symbolic Computation. 24–32.

[39]

R. Zambre, A. Chandramowlishwaran, and P. Balaji. 2018. Scalable Communication Endpoints for MPI+Threads Applications. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). 803–812.

[40]

Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick. 2014. UPC++: A PGAS Extension for C++. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1105–1114.

Cited By

Van Beeumen RIbrahim KKahanamoku–Meyer GYao NYang C(2022)Enhancing scalability of a matrix-free eigensolver for studying many-body localizationInternational Journal of High Performance Computing Applications10.1177/1094342021106036536:3(307-319)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1177/10943420211060365
Williams-Young DBagusetty Ade Jong WDoerfler Dvan Dam HVázquez-Mayagoitia ÁWindus TYang C(2021)Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemExParallel Computing10.1016/j.parco.2021.102829108:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.parco.2021.102829

Index Terms

CSPACER: A Reduced API Set Runtime for the Space Consistency Model
1. Computing methodologies
  1. Distributed computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems

Index terms have been assigned to the content through auto-classification.

Recommendations

A runtime implementation of OpenMP tasks
IWOMP'11: Proceedings of the 7th international conference on OpenMP in the Petascale era

Many task-based programming models have been developed and refined in recent years to support application development for shared memory platforms. Asynchronous tasks are a powerful programming abstraction that offer flexibility in conjunction with great ...
A high-productivity task-based programming model for clusters

Programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism ...
Execution model of three parallel languages: OpenMP, UPC and CAF
International Symposium of Parallel and Distributed Computing & International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogenous Networks

The aim of this paper is to present a qualitative evaluation of three state-of-the-art parallel languages: OpenMP, Unified Parallel C (UPC) and Co-Array Fortran (CAF). OpenMP and UPC are explicit parallel programming languages based on the ANSI ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '21: The International Conference on High Performance Computing in Asia-Pacific Region

January 2021

143 pages

ISBN:9781450388429

DOI:10.1145/3432261

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 January 2021

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HPC Asia 2021

HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region

January 20 - 22, 2021

Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
353
Total Downloads

Downloads (Last 12 months)124
Downloads (Last 6 weeks)26

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Van Beeumen RIbrahim KKahanamoku–Meyer GYao NYang C(2022)Enhancing scalability of a matrix-free eigensolver for studying many-body localizationInternational Journal of High Performance Computing Applications10.1177/1094342021106036536:3(307-319)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1177/10943420211060365
Williams-Young DBagusetty Ade Jong WDoerfler Dvan Dam HVázquez-Mayagoitia ÁWindus TYang C(2021)Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemExParallel Computing10.1016/j.parco.2021.102829108:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.parco.2021.102829

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents