skip to main content
research-article

Spin Summations: A High-Performance Perspective

Published: 14 March 2019 Publication History

Abstract

In addition to tensor contractions, one of the most pronounced computational bottlenecks in the nonorthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms—including CCSD(T) and CCSDT(Q)—are spin summations. At a first sight, spin summations are operations similar to tensor transpositions, but a closer look reveals additional challenges to high-performance calculations, including temporal locality and scattered memory accesses. This article explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g., caches, vectorization) or the problem itself (e.g., factorizability). The final algorithm combines the advantages of all the solutions while avoiding their drawbacks; this algorithm achieves high performance through parallelization and vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4× and 5.5× over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2× over an out-of-place variant.

References

[1]
B. G. Adams and J. Paldus. 1979. Orthogonally spin-adapted coupled-cluster theory for closed-shell systems including triexcited clusters. Physical Review A 20, 1 (1979), 1.
[2]
Roberto Ansaloni, Gian Luigi Bendazzoli, Stefano Evangelisti, and Elda Rossi. 2000. A parallel full-CI algorithm. Computer Physics Communications 128, 1 (2000), 496--515.
[3]
Edoardo Apra, Michael Klemm, and Karol Kowalski. 2014. Efficient implementation of many-body quantum chemical methods on the Intel<sup>®</sup> Xeon Phi coprocessor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE, Los Alamitos, CA, 674--684.
[4]
Rodney J. Bartlett and Monika Musiał. 2007. Coupled-cluster theory in quantum chemistry. Reviews in Modern Physics 79, 1 (2007), 291--352.
[5]
G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, et al. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE 93, 2 (2005), 276--292.
[6]
Yannick J. Bomble, John F. Stanton, Mihály Kállay, and Jürgen Gauss. 2005. Coupled-cluster methods including noniterative corrections for quadruple excitations. Journal of Chemical Physics 123 (Aug. 2005), 4101.
[7]
Randal Bryant and David Richard O’Hallaron. 2003. Computer Systems: A Programmer’s Perspective. Vol. 2. Prentice Hall, Upper Saddle River, NJ.
[8]
R. J. Buenker and S. Krebs. 1999. The configuration-driven approach for multireference configuration interaction calculations. In Recent Advances in Multireference Methods, K. Hirao (Ed.). World Scientific, River Edge, NJ, 1--29.
[9]
Bryan Catanzaro, Alexander Keller, and Michael Garland. 2014. A decomposition for in-place matrix transposition. ACM SIGPLAN Notices 49, 8 (2014), 193--206.
[10]
S. Chatterjee and S. Sen. 2000. Cache-efficient matrix transposition. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA-6). 195--205.
[11]
J. Čížek. 1966. On the correlation problem in atomic and molecular systems. Calculation of wavefunction components in Ursell-type expansion using quantum-field theoretical methods. Journal of Chemical Physics 45, 11 (1966), 4256--4266.
[12]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, et al. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC’08). IEEE, Los Alamitos, CA, 4:1--4:12. http://dl.acm.org/citation.cfm?id&equals;1413370.1413375
[13]
Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Orti, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Applied Mathematics and Computation 235 (May 2014), 454--468.
[14]
C. H. Q. Ding. 2001. An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures. IEEE Transactions on Parallel and Distributed Systems 12, 3 (2001), 306--315.
[15]
Zhengting Gan, Yuri Alexeev, Mark S. Gordon, and Ricky A. Kendall. 2003. The parallel implementation of a full configuration interaction program. Journal of Chemical Physics 119, 1 (2003), 47--59.
[16]
Geoffrey C. Goldbogen. 1981. PRIM: A fast matrix transpose method. IEEE Transactions on Software Engineering 7, 2 (1981), 255--257.
[17]
Kazushige Goto and Robert A. Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (2008), 12.
[18]
John A. Gunnels, Greg M. Henry, and Robert A. Van De Geijn. 2001. A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Sciences (ICCS’01). 51--60.
[19]
M. Hanrath and A. Engels-Putzka. 2010. An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster. Journal of Chemical Physics 133, 6 (2010), 064108.
[20]
A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baumgartner, et al. 2009. Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. Journal of Physical Chemistry A 113, 45 (2009), 12715--12723.
[21]
Y. He and C. H. Q. Ding. 2002. MPI and OpenMP paradigms on cluster of SMP architectures: The vacancy tracking algorithm for multi-dimensional array transposition. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. 6.
[22]
T. Helgaker, P. Jørgensen, and J. Olsen. 2013. Molecular Electronic-Structure Theory. Wiley, New York, NY.
[23]
So Hirata. 2003. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. Journal of Physical Chemistry A 107 (2003), 9887--9897.
[24]
Intel Corporation. 2015. Intel 64 and IA-32 Architectures Optimization Reference Manual. Available at http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
[25]
Wanyi Jiang, Yuriy G. Khait, and Mark R. Hoffmann. 2009. Configuration-driven unitary group approach for generalized van Vleck variant multireference perturbation theory. Journal of Physical Chemistry A 113, 16 (2009), 4374--4380.
[26]
Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. 2015. Efficient 3D transpositions in graphics processing units. International Journal of Parallel Programming 43, 5 (Oct. 2015), 876--891.
[27]
Michael Klene and Michael A. Robb. 2000. Parallel implementation of the CI-vector evaluation in full CI/CAS-SCF. Journal of Chemical Physics 113, 14 (2000), 5653--5665.
[28]
Stanislaw A. Kucharski and Rodney J. Bartlett. 1992. The coupled-cluster single, double, triple, and quadruple excitation method. Journal of Chemical Physics 97, 6 (1992), 4282--4288.
[29]
P.-W. Lai, H. Zhang, S. Rajbhandari, E. Valeev, K. Kowalski, and P. Sadayappan. 2012. Effective utilization of tensor symmetry in operation optimization of tensor contraction expressions. Procedia Computer Science 9 (2012), 412--421.
[30]
Q. Lu, S. Krishnamoorthy, and P. Sadayappan. 2006. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 233--242.
[31]
Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189 (2015), 84--91.
[32]
Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, and Karol Kowalski. 2011. GPU-based implementations of the noniterative regularized-CCSD(T) corrections: Applications to strongly correlated systems. Journal of Chemical Theory and Computation 7, 5 (2011), 1316--1327. 26610126.
[33]
Gabriel Mateescu, Gregory H. Bauer, and Robert A. Fiedler. 2012. Optimizing matrix transposes using a POWER7 cache model and explicit prefetching. ACM SIGMETRICS Performance Evaluation Review 40, 2 (2012), 68--73.
[34]
Devin A. Matthews. 2018. High-performance tensor contraction without transposition. SIAM Journal on Scientific Computing 40, 1 (2018), C1--24.
[35]
Devin A. Matthews, Jürgen Gauss, and John F. Stanton. 2013. Revisitation of nonorthogonal spin adaptation in coupled cluster theory. Journal of Chemical Theory and Computation 9, 6 (2013), 2567--2572.
[36]
Devin A. Matthews and John F. Stanton. 2015. Non-orthogonal spin-adaptation of coupled cluster methods: A new implementation of methods including quadruple excitations. Journal of Chemical Physics 142, 6 (Feb. 2015), 064108.
[37]
John McCalpin and Mark Smotherman. 1995. Automatic benchmark generation for cache optimization of matrix operations. In Proceedings of the 33rd Annual Southeast Regional Conference. ACM, New York, NY, 195--204.
[38]
John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter 2 (Dec. 1995), 19--25.
[39]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. 1--13.
[40]
Jozef Noga and Rodney J. Bartlett. 1987. The full CCSDT model for molecular electronic structure. Journal of Chemical Physics 86, 12 (June 1987), 7041--7050.
[41]
David A. Patterson and John L. Hennessy. 2007. Large and fast: Exploiting memory hierarchy. In Computer Organization and Design. Morgan Kaufmann, Burlington, MA, 474--476.
[42]
Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.
[43]
V. R. Saunders and J. H. van Lenthe. 1983. The direct CI method. Molecular Physics 48, 5 (1983), 923--954.
[44]
Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Transactions on Mathematical Software 44, 3 (Jan. 2018), Article 28, 29 pages.
[45]
Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. 2017a. TTC: A high-performance compiler for tensor transpositions. ACM Transactions on Mathematical Software 44, 2 (Aug. 2017), Article 15, 21 pages.
[46]
Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A tensor transposition compiler for multiple architectures. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’16). ACM, New York, NY, 41--46.
[47]
Paul Springer, Tong Su, and Paolo Bientinesi. 2017b. HPTT: A high-performance tensor transposition C++ library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’17). ACM, New York, NY, 56--62.
[48]
J. F. Stanton, J. Gauss, M. E. Harding, P. G. Szalay, A. A. Auer, R. J. Bartlett, et al. 2014. CFOUR, Coupled-Cluster techniques for Computational Chemistry, a quantum-chemical program package with the integral packages MOLECULE (J. Almlöf and P. R. Taylor), PROPS (P. R. Taylor), ABACUS (T. Helgaker, H. J. Jensen, P. Jørgensen, and J. Olsen), and ECP routines by A. V. Mitin and C. van Wüllen. For the current version, see http://www.cfour.de.
[49]
Marin van Heel. 1991. A fast algorithm for transposing large multidimensional image data sets. Ultramicroscopy 38, 1 (1991), 75--83.
[50]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software 41, 3 (2015), 14:1--14:33.
[51]
Andrey Vladimirov. 2013. Multithreaded Transposition of Square Matrices With Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors. Available at https://hgpu.org/?p&equals;11002.
[52]
Lai Wei and John Mellor-Crummey. 2014. Autotuning tensor transposition. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’14). IEEE, Los Alamitos, CA, 342--351.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 45, Issue 1
March 2019
278 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3314951
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2019
Accepted: 01 November 2018
Revised: 01 July 2018
Received: 01 May 2017
Published in TOMS Volume 45, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. High-performance computing
  2. in-place
  3. tensor transposition

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Intel Corporation through Parallel Computing Center grants to RWTH Aachen and UT Austin
  • Arnold and Mabel Beckman Foundation
  • Deutsche Forschungsgemeinschaft (DFG)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media