skip to main content
10.1145/3404397.3404470acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures

Published: 17 August 2020 Publication History

Abstract

Automatic differentiation, back-propagation, differentiable programming and related methods have received widespread attention, due to their ability to compute accurate gradients of numerical programs for optimization, uncertainty quantification, and machine learning. Two strategies are commonly used. The forward mode, which is easy to implement but has an overhead compared to the original program that grows linearly with the number of inputs, and the reverse mode, which can compute gradients for an arbitrary number of program inputs with a constant factor overhead, although the constant can be large, more memory is required, and the implementation is often challenging. Previous literature has shown that the forward mode can be more easily parallelized and vectorized than the reverse mode, but case studies investigating when either mode is the best choice are lacking, especially for modern CPUs and GPUs. In this paper, we demonstrate that the forward mode can outperform the reverse mode for programs with tens or hundreds of directional derivatives, a number that may yet increase if current hardware trends continue.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 265–283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
[2]
Volkan Akçelik, George Biros, Andrei Draganescu, Judith Hill, Omar Ghattas, and B Van Bloemen Waanders. 2005. Dynamic data-driven inversion for terascale simulations: Real-time identification of airborne contaminants. In SC’05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. IEEE, 43–43.
[3]
Brett M Averick, Jorge J Moré, Christian H Bischof, Alan Carle, and Andreas Griewank. 1994. Computing large sparse Jacobian matrices using automatic differentiation. SIAM Journal on Scientific Computing 15, 2 (1994), 285–294.
[4]
Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2018. Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research 18 (2018), 1–43.
[5]
Tim Besard, Christophe Foket, and Bjorn De Sutter. 2018. Effective Extensible Programming: Unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems (2018). https://doi.org/10.1109/TPDS.2018.2872064arxiv:1712.03112 [cs.PL]
[6]
Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. 1992. ADIFOR–Generating Derivative Codes from Fortran Programs. Scientific Programming 1, 1 (1992), 11–29.
[7]
Christian Bischof, Andreas Griewank, and David Juedes. 1991. Exploiting Parallelism in Automatic Differentiation. In Proceedings of the 5th International Conference on Supercomputing (Cologne, West Germany) (ICS ’91). ACM, New York, NY, USA, 146–153. https://doi.org/10.1145/109025.109067
[8]
C. Bischof, P. Khademi, A. Mauer, and A. Carle. 1996. Adifor 2.0: automatic differentiation of Fortran 77 programs. IEEE Computational Science and Engineering 3, 3 (Fall 1996), 18–32. https://doi.org/10.1109/99.537089
[9]
C. Bischof, T. Knauff, Jr., L. Green, and K. Haigler. 1994. Parallel calculation of sensitivity derivatives for aircraft design using automatic differentiation. https://doi.org/10.2514/6.1994-4261
[10]
H. Martin Bücker, Bruno Lang, Dieter an Mey, and Christian H. Bischof. 2001. Bringing Together Automatic Differentiation and OpenMP. In Proceedings of the 15th International Conference on Supercomputing (Sorrento, Italy) (ICS ’01). Association for Computing Machinery, New York, NY, USA, 246–251. https://doi.org/10.1145/377792.377842
[11]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174(2016).
[12]
Thomas F Coleman and Jorge J Moré. 1983. Estimation of sparse Jacobian matrices and graph coloring blems. SIAM journal on Numerical Analysis 20, 1 (1983), 187–209.
[13]
H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202 – 3216. https://doi.org/10.1016/j.jpdc.2014.07.003 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
[14]
Patrick E Farrell, David A Ham, Simon W Funke, and Marie E Rognes. 2013. Automated derivation of the adjoint of high-level transient finite element programs. SIAM Journal on Scientific Computing 35, 4 (2013), C369–C393.
[15]
Michael Förster. 2014. Algorithmic Differentiation of Pragma-Defined Parallel Regions: Differentiating Computer Programs Containing OpenMP. Ph.D. Dissertation. RWTH Aachen.
[16]
Assefaw Hadish Gebremedhin, Fredrik Manne, and Alex Pothen. 2005. What color is your Jacobian? Graph coloring for computing derivatives. SIAM review 47, 4 (2005), 629–705.
[17]
Ralf Giering and Thomas Kaminski. 1998. Recipes for Adjoint Code Construction. ACM Trans. Math. Softw. 24, 4 (Dec. 1998), 437–474. https://doi.org/10.1145/293686.293695
[18]
Ralf Giering and Thomas Kaminski. 2002. Recomputations in Reverse Mode AD. Springer New York, New York, NY, 283–291. https://doi.org/10.1007/978-1-4613-0075-5_33
[19]
Ralf Giering and Michael Voßbeck. 2012. Increasing Memory Locality by Executing Several Model Instances Simultaneously. In Recent Advances in Algorithmic Differentiation, Shaun Forth, Paul Hovland, Eric Phipps, Jean Utke, and Andrea Walther (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 93–101. https://doi.org/10.1007/978-3-642-30023-3_9
[20]
Michael Giles. 2007. LIBOR testcase. https://people.maths.ox.ac.uk/gilesm/codes/libor_AD/.
[21]
Michael Giles and Paul Glasserman. 2005. Smoking Adjoints: fast evaluation of Greeks in Monte Carlo Calculations.
[22]
M. B. Giles. 2007. Monte Carlo evaluation of sensitivities in computational finance. Technical Report. Oxford University Computing Laboratory.
[23]
Paul Glasserman and Xiaoliang Zhao. 1999. Fast Greeks by simulation in forward LIBOR models. Journal of Computational Finance 3 (1999), 5–39.
[24]
Felix Gremse, Andreas Höfter, Lukas Razik, Fabian Kiessling, and Uwe Naumann. 2016. GPU-accelerated adjoint algorithmic differentiation. Computer Physics Communications 200 (2016), 300 – 311. https://doi.org/10.1016/j.cpc.2015.10.027
[25]
Andreas Griewank. 1992. Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optimization Methods and software 1, 1 (1992), 35–54.
[26]
Andreas Griewank, David Juedes, and Jean Utke. 1996. Algorithm 755: ADOL-C: A Package for the Automatic Differentiation of Algorithms Written in C/C++. ACM Trans. Math. Softw. 22, 2 (June 1996), 131–167. https://doi.org/10.1145/229473.229474
[27]
Andreas Griewank and Andrea Walther. 2000. Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26, 1 (March 2000), 19–45.
[28]
Andreas Griewank and Andrea Walther. 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation. Vol. 105. Siam.
[29]
Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (second ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
[30]
Laurent Hascoët, Uwe Naumann, and Valérie Pascual. 2005. ”To be recorded” analysis in reverse-mode automatic differentiation. Future Generation Computer Systems 21, 8 (2005), 1401 – 1417. https://doi.org/10.1016/j.future.2004.11.009
[31]
L. Hascoët and V. Pascual. 2013. The Tapenade Automatic Differentiation tool: Principles, Model, and Specification. ACM Transactions On Mathematical Software 39, 3(2013). https://doi.org/10.1145/2450153.2450158
[32]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J Ramanujam, and P Sadayappan. 2011. Data layout transformation for stencil computations on short-vector simd architectures. In International Conference on Compiler Construction. Springer, 225–245.
[33]
Ian A Hiskens and Robert J Davy. 2001. Exploring the power flow solution space boundary. IEEE transactions on power systems 16, 3 (2001), 389–395.
[34]
Robin J. Hogan. 2014. Fast Reverse-Mode Automatic Differentiation Using Expression Templates in C++. ACM Trans. Math. Softw. 40, 4, Article 26 (July 2014), 16 pages. https://doi.org/10.1145/2560359
[35]
Jan Hückelheim, Paul Hovland, Sri Hari Krishna Narayanan, and Paulius Velesko. 2018. Vectorised Computation of Diverging Ensembles. In Proceedings of the 47th International Conference on Parallel Processing (Eugene, OR, USA) (ICPP 2018). Association for Computing Machinery, New York, NY, USA, Article 51, 10 pages. https://doi.org/10.1145/3225058.3225138
[36]
J.C. Hückelheim, P.D. Hovland, M.M. Strout, and J.-D. Müller. 2018. Parallelizable adjoint stencil computations using transposed forward-mode algorithmic differentiation. Optimization Methods and Software 33, 4-6 (2018), 672–693. https://doi.org/10.1080/10556788.2018.1435654
[37]
Jan Hückelheim, Navjot Kukreja, Sri Hari Krishna Narayanan, Fabio Luporini, Gerard Gorman, and Paul Hovland. 2019. Automatic Differentiation for Adjoint Stencil Loops. In Proceedings of the 48th International Conference on Parallel Processing (Kyoto, Japan) (ICPP 2019). ACM, New York, NY, USA, Article 83, 10 pages. https://doi.org/10.1145/3337821.3337906
[38]
R Intel. 2019. Intel 64 and IA-32 Architectures Optimization Reference Manual. Technical Report. Intel Corporation.
[39]
Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E Gonzalez. 2019. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. arXiv preprint arXiv:1910.02653(2019).
[40]
Charles C. Margossian. 2019. A review of automatic differentiation and its efficient implementation. WIREs Data Mining and Knowledge Discovery 9, 4 (2019), e1305. https://doi.org/10.1002/widm.1305 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1305
[41]
J.-D. Müller and P. Cusdin. 2005. On the performance of discrete adjoint CFD codes using automatic differentiation. International Journal for Numerical Methods in Fluids 47, 8‐9(2005), 939–945. https://doi.org/10.1002/fld.885 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.885
[42]
Uwe Naumann. 2012. The art of differentiating computer programs: an introduction to algorithmic differentiation. Vol. 24. SIAM.
[43]
Uwe Naumann and Jean Utke. 2005. Source Templates for the Automatic Generation of Adjoint Code Through Static Call Graph Reversal. In Computational Science – ICCS 2005, Vaidy S. Sunderam, Geert Dick van Albada, Peter M. A. Sloot, and Jack J. Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 338–346.
[44]
M.L. Overton. 2001. Numerical computing with IEEE floating point arithmetic. Society for Industrial and Applied Mathematics. https://books.google.com/books?id=pcx4EBBl_L0C
[45]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
[46]
E. Phipps, M. D’Elia, H. C. Edwards, M. Hoemmen, J. Hu, and S. Rajamanickam. 2017. Embedded Ensemble Propagation for Improving Performance, Portability, and Scalability of Uncertainty Quantification on Emerging Computational Architectures. SIAM Journal on Scientific Computing 39, 2 (2017), C162–C193. https://doi.org/10.1137/15M1044679
[47]
Christopher Rackauckas, Yingbo Ma, Vaibhav Dixit, Xingjian Guo, Mike Innes, Jarrett Revels, Joakim Nyberg, and Vijay Ivaturi. 2018. A comparison of automatic differentiation and continuous sensitivity analysis for derivatives of differential equation solutions. arXiv preprint arXiv:1812.01892(2018).
[48]
Jarrett Revels, Tim Besard, Valentin Churavy, Bjorn De Sutter, and Juan Pablo Vielma. 2018. Dynamic Automatic Differentiation of GPU Broadcast Kernels. arxiv:1810.08297 [cs.MS]
[49]
J. Revels, M. Lubin, and T. Papamarkou. 2016. Forward-Mode Automatic Differentiation in Julia. arXiv:1607.07892 [cs.MS](2016). https://arxiv.org/abs/1607.07892
[50]
Nicole Rostaing, Stéphane Dalmas, and André Galligo. 1993. Automatic differentiation in Odyssée. Tellus A 45, 5 (1993), 558–568. https://doi.org/10.1034/j.1600-0870.1993.00016.x
[51]
Michel Schanen, Daniel Adrian Maldonado, and Mihai Anitescu. 2019. A Framework for Distributed Approximation of Moments with Higher-Order Derivatives Through Automatic Differentiation. In Computational Science – ICCS 2019, João M. F. Rodrigues, Pedro J. S. Cardoso, Jânio Monteiro, Roberto Lam, Valeria V. Krzhizhanovskaya, Michael H. Lees, Jack J. Dongarra, and Peter M.A. Sloot (Eds.). Springer International Publishing, Cham, 251–260.
[52]
Filip Srajer, Zuzana Kukelova, and Andrew Fitzgibbon. 2018. A benchmark of selected algorithmic differentiation tools on some problems in computer vision and machine learning. Optimization Methods and Software 33, 4-6 (2018), 889–906. https://doi.org/10.1080/10556788.2018.1435651 arXiv:https://doi.org/10.1080/10556788.2018.1435651
[53]
S. Stamatiadis and S.C. Farantos. 2010. auto_deriv: Tool for automatic differentiation of a Fortran code. Computer Physics Communications 181, 10 (2010), 1818 – 1819. https://doi.org/10.1016/j.cpc.2010.06.043
[54]
Christian W. Straka. 2005. ADF95: Tool for automatic differentiation of a FORTRAN code designed for large numbers of independent variables. Computer Physics Communications 168, 2 (2005), 123 – 139. https://doi.org/10.1016/j.cpc.2005.01.011
[55]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).
[56]
Jean Utke, Laurent Hascoet, Patrick Heimbach, Chris Hill, Paul Hovland, and Uwe Naumann. 2009. Toward adjoinable MPI. In 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1–8.
[57]
Andreas Wächter and Lorenz T Biegler. 2006. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical programming 106, 1 (2006), 25–57.
[58]
Andrea Walther. 1999. Program Reversal Schedules for Single- and Multi-processor Machines. Ph.D. Dissertation. Institute of Scientific Computing, Technical University Dresden, Germany.
[59]
Hongye Wang, Carlos E Murillo-Sanchez, Ray D Zimmerman, and Robert J Thomas. 2007. On computational issues of market-based optimal power flow. IEEE Transactions on Power Systems 22, 3 (2007), 1185–1193.
[60]
Shenren Xu, Wolfram Jahn, and Jens-Dominik Müller. 2014. CAD-based shape optimisation with CFD using a discrete adjoint. International Journal for Numerical Methods in Fluids 74, 3 (2014), 153–168.
[61]
Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 52, 44 pages. https://doi.org/10.1145/3295500.3356210

Cited By

View all
  • (2023)Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMDProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624251(1705-1712)Online publication date: 12-Nov-2023
  • (2023)Efficient GPU Implementation of Automatic Differentiation for Computational Fluid Dynamics2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00055(377-386)Online publication date: 18-Dec-2023
  • (2022)Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with SacadoACM Transactions on Mathematical Software10.1145/356026248:4(1-29)Online publication date: 19-Dec-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
August 2020
844 pages
ISBN:9781450388160
DOI:10.1145/3404397
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automatic Differentiation
  2. GPU
  3. Julia Language
  4. Reduced Precision
  5. SIMD
  6. Vector Forward Mode

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '20

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)3
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMDProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624251(1705-1712)Online publication date: 12-Nov-2023
  • (2023)Efficient GPU Implementation of Automatic Differentiation for Computational Fluid Dynamics2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00055(377-386)Online publication date: 18-Dec-2023
  • (2022)Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with SacadoACM Transactions on Mathematical Software10.1145/356026248:4(1-29)Online publication date: 19-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media