research-article

Spin Summations: A High-Performance Perspective

Authors:

Devin Matthews,

Paolo BientinesiAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 45, Issue 1

Article No.: 10, Pages 1 - 22

https://doi.org/10.1145/3301319

Published: 14 March 2019 Publication History

Abstract

In addition to tensor contractions, one of the most pronounced computational bottlenecks in the nonorthogonally spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate forms—including CCSD(T) and CCSDT(Q)—are spin summations. At a first sight, spin summations are operations similar to tensor transpositions, but a closer look reveals additional challenges to high-performance calculations, including temporal locality and scattered memory accesses. This article explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of either the underlying hardware (e.g., caches, vectorization) or the problem itself (e.g., factorizability). The final algorithm combines the advantages of all the solutions while avoiding their drawbacks; this algorithm achieves high performance through parallelization and vectorization, and by exploiting the temporal locality inherent to spin summations. Combined, these optimizations result in speedups between 2.4× and 5.5× over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm can perform the spin summations in-place, thus reducing the memory footprint by 2× over an out-of-place variant.

References

[1]

B. G. Adams and J. Paldus. 1979. Orthogonally spin-adapted coupled-cluster theory for closed-shell systems including triexcited clusters. Physical Review A 20, 1 (1979), 1.

[2]

Roberto Ansaloni, Gian Luigi Bendazzoli, Stefano Evangelisti, and Elda Rossi. 2000. A parallel full-CI algorithm. Computer Physics Communications 128, 1 (2000), 496--515.

[3]

Edoardo Apra, Michael Klemm, and Karol Kowalski. 2014. Efficient implementation of many-body quantum chemical methods on the Intel<sup>®</sup> Xeon Phi coprocessor. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE, Los Alamitos, CA, 674--684.

Digital Library

[4]

Rodney J. Bartlett and Monika Musiał. 2007. Coupled-cluster theory in quantum chemistry. Reviews in Modern Physics 79, 1 (2007), 291--352.

[5]

G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, et al. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE 93, 2 (2005), 276--292.

[6]

Yannick J. Bomble, John F. Stanton, Mihály Kállay, and Jürgen Gauss. 2005. Coupled-cluster methods including noniterative corrections for quadruple excitations. Journal of Chemical Physics 123 (Aug. 2005), 4101.

[7]

Randal Bryant and David Richard O’Hallaron. 2003. Computer Systems: A Programmer’s Perspective. Vol. 2. Prentice Hall, Upper Saddle River, NJ.

[8]

R. J. Buenker and S. Krebs. 1999. The configuration-driven approach for multireference configuration interaction calculations. In Recent Advances in Multireference Methods, K. Hirao (Ed.). World Scientific, River Edge, NJ, 1--29.

[9]

Bryan Catanzaro, Alexander Keller, and Michael Garland. 2014. A decomposition for in-place matrix transposition. ACM SIGPLAN Notices 49, 8 (2014), 193--206.

Digital Library

[10]

S. Chatterjee and S. Sen. 2000. Cache-efficient matrix transposition. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA-6). 195--205.

[11]

J. Čížek. 1966. On the correlation problem in atomic and molecular systems. Calculation of wavefunction components in Ursell-type expansion using quantum-field theoretical methods. Journal of Chemical Physics 45, 11 (1966), 4256--4266.

[12]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, et al. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC’08). IEEE, Los Alamitos, CA, 4:1--4:12. http://dl.acm.org/citation.cfm?id=1413370.1413375

Digital Library

[13]

Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Orti, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Applied Mathematics and Computation 235 (May 2014), 454--468.

[14]

C. H. Q. Ding. 2001. An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures. IEEE Transactions on Parallel and Distributed Systems 12, 3 (2001), 306--315.

Digital Library

[15]

Zhengting Gan, Yuri Alexeev, Mark S. Gordon, and Ricky A. Kendall. 2003. The parallel implementation of a full configuration interaction program. Journal of Chemical Physics 119, 1 (2003), 47--59.

[16]

Geoffrey C. Goldbogen. 1981. PRIM: A fast matrix transpose method. IEEE Transactions on Software Engineering 7, 2 (1981), 255--257.

Digital Library

[17]

Kazushige Goto and Robert A. Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (2008), 12.

Digital Library

[18]

John A. Gunnels, Greg M. Henry, and Robert A. Van De Geijn. 2001. A family of high-performance matrix multiplication algorithms. In Proceedings of the International Conference on Computational Sciences (ICCS’01). 51--60.

Digital Library

[19]

M. Hanrath and A. Engels-Putzka. 2010. An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster. Journal of Chemical Physics 133, 6 (2010), 064108.

[20]

A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baumgartner, et al. 2009. Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. Journal of Physical Chemistry A 113, 45 (2009), 12715--12723.

[21]

Y. He and C. H. Q. Ding. 2002. MPI and OpenMP paradigms on cluster of SMP architectures: The vacancy tracking algorithm for multi-dimensional array transposition. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. 6.

Digital Library

[22]

T. Helgaker, P. Jørgensen, and J. Olsen. 2013. Molecular Electronic-Structure Theory. Wiley, New York, NY.

[23]

So Hirata. 2003. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. Journal of Physical Chemistry A 107 (2003), 9887--9897.

[24]

Intel Corporation. 2015. Intel 64 and IA-32 Architectures Optimization Reference Manual. Available at http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.

[25]

Wanyi Jiang, Yuriy G. Khait, and Mark R. Hoffmann. 2009. Configuration-driven unitary group approach for generalized van Vleck variant multireference perturbation theory. Journal of Physical Chemistry A 113, 16 (2009), 4374--4380.

[26]

Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. 2015. Efficient 3D transpositions in graphics processing units. International Journal of Parallel Programming 43, 5 (Oct. 2015), 876--891.

Digital Library

[27]

Michael Klene and Michael A. Robb. 2000. Parallel implementation of the CI-vector evaluation in full CI/CAS-SCF. Journal of Chemical Physics 113, 14 (2000), 5653--5665.

[28]

Stanislaw A. Kucharski and Rodney J. Bartlett. 1992. The coupled-cluster single, double, triple, and quadruple excitation method. Journal of Chemical Physics 97, 6 (1992), 4282--4288.

[29]

P.-W. Lai, H. Zhang, S. Rajbhandari, E. Valeev, K. Kowalski, and P. Sadayappan. 2012. Effective utilization of tensor symmetry in operation optimization of tensor contraction expressions. Procedia Computer Science 9 (2012), 412--421.

[30]

Q. Lu, S. Krishnamoorthy, and P. Sadayappan. 2006. Combining analytical and empirical approaches in tuning matrix transposition. In Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques. ACM, New York, NY, 233--242.

Digital Library

[31]

Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189 (2015), 84--91.

[32]

Wenjing Ma, Sriram Krishnamoorthy, Oreste Villa, and Karol Kowalski. 2011. GPU-based implementations of the noniterative regularized-CCSD(T) corrections: Applications to strongly correlated systems. Journal of Chemical Theory and Computation 7, 5 (2011), 1316--1327. 26610126.

[33]

Gabriel Mateescu, Gregory H. Bauer, and Robert A. Fiedler. 2012. Optimizing matrix transposes using a POWER7 cache model and explicit prefetching. ACM SIGMETRICS Performance Evaluation Review 40, 2 (2012), 68--73.

Digital Library

[34]

Devin A. Matthews. 2018. High-performance tensor contraction without transposition. SIAM Journal on Scientific Computing 40, 1 (2018), C1--24.

Digital Library

[35]

Devin A. Matthews, Jürgen Gauss, and John F. Stanton. 2013. Revisitation of nonorthogonal spin adaptation in coupled cluster theory. Journal of Chemical Theory and Computation 9, 6 (2013), 2567--2572.

[36]

Devin A. Matthews and John F. Stanton. 2015. Non-orthogonal spin-adaptation of coupled cluster methods: A new implementation of methods including quadruple excitations. Journal of Chemical Physics 142, 6 (Feb. 2015), 064108.

[37]

John McCalpin and Mark Smotherman. 1995. Automatic benchmark generation for cache optimization of matrix operations. In Proceedings of the 33rd Annual Southeast Regional Conference. ACM, New York, NY, 195--204.

Digital Library

[38]

John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter 2 (Dec. 1995), 19--25.

[39]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. 1--13.

Digital Library

[40]

Jozef Noga and Rodney J. Bartlett. 1987. The full CCSDT model for molecular electronic structure. Journal of Chemical Physics 86, 12 (June 1987), 7041--7050.

[41]

David A. Patterson and John L. Hennessy. 2007. Large and fast: Exploiting memory hierarchy. In Computer Organization and Design. Morgan Kaufmann, Burlington, MA, 474--476.

[42]

Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.

[43]

V. R. Saunders and J. H. van Lenthe. 1983. The direct CI method. Molecular Physics 48, 5 (1983), 923--954.

[44]

Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Transactions on Mathematical Software 44, 3 (Jan. 2018), Article 28, 29 pages.

Digital Library

[45]

Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. 2017a. TTC: A high-performance compiler for tensor transpositions. ACM Transactions on Mathematical Software 44, 2 (Aug. 2017), Article 15, 21 pages.

Digital Library

[46]

Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A tensor transposition compiler for multiple architectures. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’16). ACM, New York, NY, 41--46.

Digital Library

[47]

Paul Springer, Tong Su, and Paolo Bientinesi. 2017b. HPTT: A high-performance tensor transposition C++ library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’17). ACM, New York, NY, 56--62.

Digital Library

[48]

J. F. Stanton, J. Gauss, M. E. Harding, P. G. Szalay, A. A. Auer, R. J. Bartlett, et al. 2014. CFOUR, Coupled-Cluster techniques for Computational Chemistry, a quantum-chemical program package with the integral packages MOLECULE (J. Almlöf and P. R. Taylor), PROPS (P. R. Taylor), ABACUS (T. Helgaker, H. J. Jensen, P. Jørgensen, and J. Olsen), and ECP routines by A. V. Mitin and C. van Wüllen. For the current version, see http://www.cfour.de.

[49]

Marin van Heel. 1991. A fast algorithm for transposing large multidimensional image data sets. Ultramicroscopy 38, 1 (1991), 75--83.

[50]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software 41, 3 (2015), 14:1--14:33.

Digital Library

[51]

Andrey Vladimirov. 2013. Multithreaded Transposition of Square Matrices With Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors. Available at https://hgpu.org/?p=11002.

[52]

Lai Wei and John Mellor-Crummey. 2014. Autotuning tensor transposition. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’14). IEEE, Los Alamitos, CA, 342--351.

Digital Library

Cited By

Liu LHuang YHuang S(2023)Prediction of battery capacity based on improved model of support vector regressionJournal of Physics: Conference Series10.1088/1742-6596/2427/1/0120122427:1(012012)Online publication date: 1-Feb-2023
https://doi.org/10.1088/1742-6596/2427/1/012012
Esselman BZdanovskaia MOwen AStanton JWoods RMcMahon R(2023)Precise Equilibrium Structure of BenzeneJournal of the American Chemical Society10.1021/jacs.3c03109145:40(21785-21797)Online publication date: 29-Sep-2023
https://doi.org/10.1021/jacs.3c03109
Herrmann NHanrath M(2021)Analysis of different sets of spin-adapted substitution operators in open-shell coupled cluster theoryMolecular Physics10.1080/00268976.2021.2005836120:5Online publication date: 23-Nov-2021
https://doi.org/10.1080/00268976.2021.2005836

Index Terms

Spin Summations: A High-Performance Perspective

Recommendations

Spin---orbit hybrid entangled channel for spin state quantum teleportation using genetic algorithms

We present a physical model of spin quantum teleportation protocol (QTP) in a triple quantum dot array using a genetic algorithm approach. The information to teleport is spin-coded in one electron confined in a single quantum dot (SQD). The remaining ...
Single-spin entanglement

We show that the operators and the quadrupole and Zeeman Hamiltonians for a spin $$\frac{3}{2}$$32 can be represented in terms for a system of two coupling fictitious spins $$\frac{1}{2}$$12 using the Kronecker product of Pauli matrices. Particularly, ...
Controlling Spin Qubits in Quantum Dots

We review progress on the spintronics proposal for quantum computing where the quantum bits (qubits) are implemented with electron spins. We calculate the exchange interaction of coupled quantum dots and present experiments, where the exchange coupling ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 45, Issue 1

March 2019

278 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3314951

Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2019

Accepted: 01 November 2018

Revised: 01 July 2018

Received: 01 May 2017

Published in TOMS Volume 45, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Intel Corporation through Parallel Computing Center grants to RWTH Aachen and UT Austin
Arnold and Mabel Beckman Foundation
Deutsche Forschungsgemeinschaft (DFG)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
484
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu LHuang YHuang S(2023)Prediction of battery capacity based on improved model of support vector regressionJournal of Physics: Conference Series10.1088/1742-6596/2427/1/0120122427:1(012012)Online publication date: 1-Feb-2023
https://doi.org/10.1088/1742-6596/2427/1/012012
Esselman BZdanovskaia MOwen AStanton JWoods RMcMahon R(2023)Precise Equilibrium Structure of BenzeneJournal of the American Chemical Society10.1021/jacs.3c03109145:40(21785-21797)Online publication date: 29-Sep-2023
https://doi.org/10.1021/jacs.3c03109
Herrmann NHanrath M(2021)Analysis of different sets of spin-adapted substitution operators in open-shell coupled cluster theoryMolecular Physics10.1080/00268976.2021.2005836120:5Online publication date: 23-Nov-2021
https://doi.org/10.1080/00268976.2021.2005836

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents