skip to main content
10.1145/2464996.2467282acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
poster

Inspector/executor load balancing algorithms for block-sparse tensor contractions

Published: 10 June 2013 Publication History

Abstract

Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.

References

[1]
Gagan Agrawal, Alan Sussman, and Joel Saltz. An integrated runtime and compile-time approach for parallelizing structured and block structured applications. IEEE Transactions on Parallel and Distributed Systems, 6:747--754, 1995.
[2]
Yuri Alexeev, Ashutosh Mahajan, Sven Leyffer, Graham Fletcher, and Dmitri Fedorov. Heuristic static load-balancing algorithm applied to the fragment molecular orbital method. Supercomputing, 2012.
[3]
Humayun Arafat, P. Sadayappan, James Dinan, Sriram Krishnamoorthy, and Theresa L. Windus. Load balancing of dynamical nucleation theory Monte Carlo simulations through resource sharing barriers. In IPDPS, pages 285--295, 2012.
[4]
Alexander A. Auer, Gerald Baumgartner, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 104(2):211--228, 2006.
[5]
R.J. Bartlett. Coupled-cluster approach to molecular structure and spectra: a step toward predictive quantum chemistry. Journal of Physical Chemistry, 93(5):1697--1708, 1989.
[6]
R.J. Bartlett and M. Musiał. Coupled-cluster theory in quantum chemistry. Reviews of Modern Physics, 79(1):291--352, 2007.
[7]
S. H. Bokhari. On the mapping problem. IEEE Trans. Comput., 30(3):207--214, March 1981.
[8]
Yannick J. Bomble, John F. Stanton, Mihály Kállay, and Jürgen Gauss. Coupled-cluster methods including noniterative corrections for quadruple excitations. Journal of Chemical Physics, 123(5):054101, 2005.
[9]
E. J. Bylaska et. al. NWChem, a computational chemistry package for parallel computers, version 6.1.1, 2012.
[10]
F.A. Cotton. Chemical Applications of Group Theory. John Wiley & Sons, 2008.
[11]
T. D. Crawford and H. F. Schaefer III. An introduction to coupled cluster theory for computational chemists. In K. B. Lipkowitz and D. B. Boyd, editors, Reviews in Computational Chemistry, volume 14, chapter 2, pages 33--136. VCH Publishers, New York, 2000.
[12]
Karen Devine, Erik Boman, Robert Heaphy, Bruce Hendrickson, and Courtenay Vaughan. Zoltan data management services for parallel dynamic applications. Computing in Science and Engineering, 4(2):90--97, 2002.
[13]
James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 53:1--53:11, New York, 2009. ACM.
[14]
Stephen Fink and Scott Baden. Runtime support for multi-tier programming of block-structured applications on SMP clusters. In Yutaka Ishikawa, Rodney Oldehoeft, John Reynders, and Marydell Tholburn, editors, Scientific Computing in Object-Oriented Parallel Environments, volume 1343 of Lecture Notes in Computer Science, pages 1--8. Springer Berlin / Heidelberg, 1997.
[15]
Stephen J. Fink, Scott B. Baden, and Scott R. Kohn. Efficient run-time support for irregular block-structured applications. Journal of Parallel and Distributed Computing, 50(1--2):61--82, 1998.
[16]
J. Fuchs and C. Schweigert. Symmetries, Lie Algebras and Representations: A Graduate Course for Physicists. Cambridge University Press, 2003.
[17]
Jürgen Gauss, John F. Stanton, and Rodney J. Bartlett. Coupled-cluster open-shell analytic gradients: Implementation of the direct product decomposition approach in energy gradient calculations. Journal of Chemical Physics, 95(4):2623--2638, 1991.
[18]
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.
[19]
Jeff R. Hammond, Sriram Krishnamoorthy, Sameer Shende, Nichols A. Romero, and Allen D. Malony. Performance characterization of global address space applications: a case study with NWChem. Concurrency and Computation: Practice and Experience, 24(2):135--154, 2012.
[20]
Robert J. Harrison. Portable tools and applications for parallel computers. International Journal of Quantum Chemistry, 40(6):847--863, 1991.
[21]
So Hirata. Tensor Contraction Engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. Journal of Physical Chemistry A, 107:9887--9897, 2003.
[22]
Mihály Kállay and Jürgen Gauss. Approximate treatment of higher excitations in coupled-cluster theory. Journal of Chemical Physics, 123(21):214105, 2005.
[23]
Jon Kleinberg and Éva Tardos. Algorithm Design. Addison Wesley, Massachusetts, USA, 2005.
[24]
Karol Kowalski, Jeff R. Hammond, Wibe A. de Jong, Peng-Dong Fan, Marat Valiev, Dunyou Wang, and Niranjan Govind. Coupled cluster calculations for large molecular and extended systems. In Jeffrey R. Reimers, editor, Computational Methods for Large Systems: Electronic Structure Approaches for Biotechnology and Nanotechnology. Wiley, 2011.
[25]
Sriram Krishnamoorthy, Umit Catalyurek, Jarek Nieplocha, and Atanas Rountev. Hypergraph partitioning for automatic memory hierarchy management. In Supercomputing (SC06), 2006.
[26]
Stanislaw A. Kucharski and Rodney J. Bartlett. Coupled-cluster methods that include connected quadruple excitations, $T_4$: CCSDTQ-1 and Q(CCSDT). Chemical Physics Letters, 158(6):550--555, 1989.
[27]
Stanislaw A. Kucharski and Rodney J. Bartlett. Recursive intermediate factorization and complete computational linearization of the coupled-cluster single, double, triple, and quadruple excitation equations. Theoretical Chemistry Accounts: Theory, Computation, and Modeling (Theoretica Chimica Acta), 80:387--405, 1991.
[28]
Stanislaw A. Kucharski and Rodney J. Bartlett. The coupled-cluster single, double, triple, and quadruple excitation method. Journal of Chemical Physics, 97(6):4282--4288, 1992.
[29]
Stanislaw A. Kucharski and Rodney J. Bartlett. An efficient way to include connected quadruple contributions into the coupled cluster method. Journal of Chemical Physics, 108(22):9221--9226, 1998.
[30]
Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. Global arrays: a portable "shared-memory" programming model for distributed memory computers. In Supercomputing '94: Proceedings of the 1994 ACM/IEEE conference on Supercomputing, pages 340--349, New York, 1994. ACM.
[31]
Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. Global arrays: A non-uniform-memory-access programming model for high-performance computers. The Journal of Supercomputing, 10:10--197, 1996.
[32]
J. Noga and R.J. Bartlett. The full CCSDT model for molecular electronic structure. Journal of Chemical Physics, 86(12):7041--7050, 1987.
[33]
Nevin Oliphant and Ludwik Adamowicz. Coupled-cluster method truncated at quadruples. The Journal of Chemical Physics, 95(9):6645--6651, 1991.
[34]
Elmar Peise and Paolo Bientinesi. Performance modeling for dense linear algebra. In Proceedings of the 3rd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS12), November 2012.
[35]
G.D. Purvis III and R.J. Bartlett. A full coupled-cluster singles and doubles model: the inclusion of disconnected triples. Journal of Chemical Physics, 76(4):1910--1918, 1982.
[36]
Krishnan Raghavachari, John A. Pople, Eric S. Replogle, and Martin Head-Gordon. Fifth-order Møller-Plesset perturbation theory: comparison of existing correlation methods and implementation of new methods correct to fifth order. Journal of Physical Chemistry, 94:5579--5586, 1990.
[37]
Krishnan Raghavachari, Gary W. Trucks, John A. Pople, and Martin Head-Gordon. A fifth-order perturbation comparison of electron correlation theories. Chemical Physics Letters, 157:479--483, May 1989.
[38]
Sameer S. Shende and Allen D. Malony. The tau parallel performance system. Int. J. High Perform. Comput. Appl., 20(2):287--311, May 2006.
[39]
Edgar Solomonik. Cyclops Tensor Framework. http://www.eecs.berkeley.edu/ solomon/cyclopstf/index.html.
[40]
Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), may 2013.
[41]
John Stanton. This remark is attributed to Devin Matthews.
[42]
John F. Stanton. Why CCSD(T) works: a different perspective. Chemical Physics Letters, 281:130--134, 1997.
[43]
John F. Stanton, Jürgen Gauss, John D. Watts, and Rodney J. Bartlett. A direct product decomposition approach for symmetry exploitation in many-body methods, I: Energy calculations. Journal of Chemical Physics, 94(6):4334--4345, 1991.
[44]
JivríuCívzek. On the correlation problem in atomic and molecular systems. calculation of wavefunction components in Ursell-Type expansion using quantum-field theoretical methods. Journal of Chemical Physics, 45(11):4256--4266, December 1966.
[45]
Miroslav Urban, Jozef Noga, Samuel J. Cole, and R.J. Bartlett. Towards a full CCSDT model for electron correlation. Journal of Chemical Physics, 83(8):4041--4046, 1985.
[46]
R. A. Van De Geijn and J. Watts. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience, 9(4):255--274, 1997.
[47]
John D. Watts and R.J. Bartlett. The coupled-cluster single, double, and triple excitation model for open-shell single reference functions. Journal of Chemical Physics, 93(8):6104--6105, 1990.

Cited By

View all

Index Terms

  1. Inspector/executor load balancing algorithms for block-sparse tensor contractions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
    June 2013
    512 pages
    ISBN:9781450321303
    DOI:10.1145/2464996
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 June 2013

    Check for updates

    Author Tags

    1. dynamic load balancing
    2. global arrays
    3. quantum chemistry
    4. static partitioning
    5. tensor contractions

    Qualifiers

    • Poster

    Conference

    ICS'13
    Sponsor:
    ICS'13: International Conference on Supercomputing
    June 10 - 14, 2013
    Oregon, Eugene, USA

    Acceptance Rates

    ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media