Search | arXiv e-print repository

arXiv:2407.00026 [pdf, other]

Preparing for HPC on RISC-V: Examining Vectorization and Distributed Performance of an Astrophyiscs Application with HPX and Kokkos

Authors: Patrick Diehl, Panagiotis Syskakis, Gregor Daiß, Steven R. Brandt, Alireza Kheirkhahan, Srinivas Yadav Singanaboina, Dominic Marcello, Chris Taylor, John Leidel, Hartmut Kaiser

Abstract: In recent years, interest in RISC-V computing architectures has moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a concern. As of this year, the first single board RISC-V CPUs implementing the finalized ratified vector specification are being released. The RISC-V vector specification follows in the tradition of vecto… ▽ More In recent years, interest in RISC-V computing architectures has moved from academic to mainstream, especially in the field of High Performance Computing where energy limitations are increasingly a concern. As of this year, the first single board RISC-V CPUs implementing the finalized ratified vector specification are being released. The RISC-V vector specification follows in the tradition of vector processors found in the CDC STAR-100, the Cray-1, the Convex C-Series, and the NEC SX machines and accelerators. The family of vector processors offers support for variable-length array processing as opposed to the fixed-length processing functionality offered by SIMD. Vector processors offer opportunities to perform vector-chaining which allows temporary results to be used without the need to resolve memory references. In this work, we use the Octo-Tiger multi-physics, multi-scale, 3D adaptive mesh refinement astrophysics application to study these early RISC-V chips with vector machine support. We report on our experience in porting this modern C++ code (which is built upon several open-source libraries such as HPX and Kokkos) to RISC-V. In addition, we show the impact of the RISC-V Vector extension on a RISC-V single board computer by implementing the std::experimental:simd interface and integrating it with our code. We also compare the application's performance, scalability, and power consumption on desktop-grade RISC-V computer to an A64FX system. △ Less

Submitted 15 August, 2024; v1 submitted 10 May, 2024; originally announced July 2024.

arXiv:2405.13101 [pdf, other]

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Authors: Patrick Diehl, Noujoud Nader, Steve Brandt, Hartmut Kaiser

Abstract: This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat eq… ▽ More This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly. △ Less

Submitted 5 July, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

Comments: 9 pages, 3 figures

arXiv:2405.00016 [pdf, ps, other]

doi 10.1007/978-3-031-61763-8_17

HPX with Spack and Singularity Containers: Evaluating Overheads for HPX/Kokkos using an astrophysics application

Authors: Patrick Diehl, Steven R. Brandt, Gregor Daiß, Hartmut Kaiser

Abstract: Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers… ▽ More Cloud computing for high performance computing resources is an emerging topic. This service is of interest to researchers who care about reproducible computing, for software packages with complex installations, and for companies or researchers who need the compute resources only occasionally or do not want to run and maintain a supercomputer on their own. The connection between HPC and containers is exemplified by the fact that Microsoft Azure's Eagle cloud service machine is number three on the November 23 Top 500 list. For cloud services, the HPC application and dependencies are installed in containers, e.g. Docker, Singularity, or something else, and these containers are executed on the physical hardware. Although containerization leverages the existing Linux kernel and should not impose overheads on the computation, there is the possibility that machine-specific optimizations might be lost, particularly machine-specific installs of commonly used packages. In this paper, we will use an astrophysics application using HPX-Kokkos and measure overheads on homogeneous resources, e.g. Supercomputer Fugaku, using CPUs only and on heterogenous resources, e.g. LSU's hybrid CPU and GPU system. We will report on challenges in compiling, running, and using the containers as well as performance performance differences. △ Less

Submitted 7 May, 2024; v1 submitted 11 February, 2024; originally announced May 2024.

arXiv:2405.00015 [pdf, other]

doi 10.1007/978-3-031-61763-8_11

Experiences Porting Distributed Applications to Asynchronous Tasks: A Multidimensional FFT Case-study

Authors: Alexander Strack, Christopher Taylor, Patrick Diehl, Dirk Pflüger

Abstract: Parallel algorithms relying on synchronous parallelization libraries often experience adverse performance due to global synchronization barriers. Asynchronous many-task runtimes offer task futurization capabilities that minimize or remove the need for global synchronization barriers. This paper conducts a case study of the multidimensional Fast Fourier Transform to identify which applications will… ▽ More Parallel algorithms relying on synchronous parallelization libraries often experience adverse performance due to global synchronization barriers. Asynchronous many-task runtimes offer task futurization capabilities that minimize or remove the need for global synchronization barriers. This paper conducts a case study of the multidimensional Fast Fourier Transform to identify which applications will benefit from the asynchronous many-task model. Our basis is the popular FFTW library. We use the asynchronous many-task model HPX and a one-dimensional FFTW backend to implement multiple versions using different HPX features and highlight overheads and pitfalls during migration. Furthermore, we add an HPX threading backend to FFTW. The case study analyzes shared memory scaling properties between our HPX-based parallelization and FFTW with its pthreads, OpenMP, and HPX backends. The case study also compares FFTW's MPI+X backend to a purely HPX-based distributed implementation. The FFT application does not profit from asynchronous task execution. In contrast, enforcing task synchronization results in better cache performance and thus better runtime. Nonetheless, the HPX backend for FFTW is competitive with existing backends. Our distributed HPX implementation based on HPX collectives using MPI parcelport performs similarly to FFTW's MPI+OpenMP. However, the LCI parcelport of HPX accelerated communication up to a factor of 5. △ Less

Submitted 2 May, 2024; v1 submitted 9 February, 2024; originally announced May 2024.

arXiv:2405.00011 [pdf, other]

doi 10.1016/j.mechrescom.2024.104275

A Multiscale Fracture Model using Peridynamic Enrichment of Finite Elements within an Adaptive Partition of Unity: Experimental Validation

Authors: Matthias Birner, Patrick Diehl, Robert Lipton, Marc Alexander Schweitzer

Abstract: Partition of unity methods (PUM) are of domain decomposition type and provide the opportunity for multiscale and multiphysics numerical modeling. Within the PUM global-local enrichment scheme [1, 2] different physical models can exist to capture multiscale behavior. For instance, we consider classical linear elasticity globally and local zones where fractures occur. The elastic fields of the undam… ▽ More Partition of unity methods (PUM) are of domain decomposition type and provide the opportunity for multiscale and multiphysics numerical modeling. Within the PUM global-local enrichment scheme [1, 2] different physical models can exist to capture multiscale behavior. For instance, we consider classical linear elasticity globally and local zones where fractures occur. The elastic fields of the undamaged media provide appropriate boundary data for local PD simulations on a subdomain containing the crack tip to grow the crack path. Once the updated crack path is found, the elastic field in the body and surrounding the crack is updated using PUM basis with appropriate enrichment near the crack. The subdomain for the PD simulation is chosen to include the current crack tip as well as nearby features that will influence crack growth. This paper is part II of this series and validates the combined PD/PUM simulator against the experimental results presented in [3]. The presented results show that we can attain good agreement between experimental and simulation data with a local PD subdomain that is moving with the crack tip and adaptively chosen size. △ Less

Submitted 1 February, 2024; originally announced May 2024.

arXiv:2404.15388 [pdf, other]

doi 10.1615/JMachLearnModelComput.2024053706

ML-based identification of the interface regions for coupling local and nonlocal models

Authors: Noujoud Nader, Patrick Diehl, Marta D'Elia, Christian Glusa, Serge Prudhomme

Abstract: Local-nonlocal coupling approaches combine the computational efficiency of local models and the accuracy of nonlocal models. However, the coupling process is challenging, requiring expertise to identify the interface between local and nonlocal regions. This study introduces a machine learning-based approach to automatically detect the regions in which the local and nonlocal models should be used i… ▽ More Local-nonlocal coupling approaches combine the computational efficiency of local models and the accuracy of nonlocal models. However, the coupling process is challenging, requiring expertise to identify the interface between local and nonlocal regions. This study introduces a machine learning-based approach to automatically detect the regions in which the local and nonlocal models should be used in a coupling approach. This identification process uses the loading functions and provides as output the selected model at the grid points. Training is based on datasets of loading functions for which reference coupling configurations are computed using accurate coupled solutions, where accuracy is measured in terms of the relative error between the solution to the coupling approach and the solution to the nonlocal model. We study two approaches that differ from one another in terms of the data structure. The first approach, referred to as the full-domain input data approach, inputs the full load vector and outputs a full label vector. In this case, the classification process is carried out globally. The second approach consists of a window-based approach, where loads are preprocessed and partitioned into windows and the problem is formulated as a node-wise classification approach in which the central point of each window is treated individually. The classification problems are solved via deep learning algorithms based on convolutional neural networks. The performance of these approaches is studied on one-dimensional numerical examples using F1-scores and accuracy metrics. In particular, it is shown that the windowing approach provides promising results, achieving an accuracy of 0.96 and an F1-score of 0.97. These results underscore the potential of the approach to automate coupling processes, leading to more accurate and computationally efficient solutions for material science applications. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 23 pages, 14 figures, research paper

arXiv:2401.03353 [pdf, other]

HPX -- An open source C++ Standard Library for Parallelism and Concurrency

Authors: Thomas Heller, Patrick Diehl, Zachary Byerly, John Biddiscombe, Hartmut Kaiser

Abstract: To achieve scalability with today's heterogeneous HPC resources, we need a dramatic shift in our thinking; MPI+X is not enough. Asynchronous Many Task (AMT) runtime systems break down the global barriers imposed by the Bulk Synchronous Programming model. HPX is an open-source, C++ Standards compliant AMT runtime system that is developed by a diverse international community of collaborators called… ▽ More To achieve scalability with today's heterogeneous HPC resources, we need a dramatic shift in our thinking; MPI+X is not enough. Asynchronous Many Task (AMT) runtime systems break down the global barriers imposed by the Bulk Synchronous Programming model. HPX is an open-source, C++ Standards compliant AMT runtime system that is developed by a diverse international community of collaborators called The Ste||ar Group. HPX provides features which allow application developers to naturally use key design patterns, such as overlapping communication and computation, decentralizing of control flow, oversubscribing execution resources and sending work to data instead of data to work. The Ste||ar Group comprises physicists, engineers, and computer scientists; men and women from many different institutions and affiliations, and over a dozen different countries. We are committed to advancing the development of scalable parallel applications by providing a platform for collaborating and exchanging ideas. In this paper, we give a detailed description of the features HPX provides and how they help achieve scalability and programmability, a list of applications of HPX including two large NSF funded collaborations (STORM, for storm surge forecasting; and STAR (OctoTiger) an astro-physics project which runs at 96.8% parallel efficiency on 643,280 cores), and we end with a description of how HPX and the Ste||ar Group fit into the open source community. △ Less

Submitted 11 August, 2023; originally announced January 2024.

Journal ref: Proceedings of OpenSuCo 2017, Denver, Colorado USA, November 2017 (OpenSuCo 17)

arXiv:2309.06530 [pdf, other]

doi 10.1145/3624062.3624230

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger

Authors: Parick Diehl, Gregor Daiss, Steven R. Brandt, Alireza Kheirkhahan, Hartmut Kaiser, Christopher Taylor, John Leidel

Abstract: In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-… ▽ More In recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is essential. In this paper, we describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. Considering the (limited) capabilities of the RISC-V test systems we used, Octo-Tiger already shows promising results and good scaling. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results. △ Less

Submitted 17 August, 2023; originally announced September 2023.

arXiv:2308.11456 [pdf]

Deep learning-based denoising streamed from mobile phones improves speech-in-noise understanding for hearing aid users

Authors: Peter Udo Diehl, Hannes Zilly, Felix Sattler, Yosef Singer, Kevin Kepp, Mark Berry, Henning Hasemann, Marlene Zippel, Müge Kaya, Paul Meyer-Rachner, Annett Pudszuhn, Veit M. Hofmann, Matthias Vormann, Elias Sprengel

Abstract: The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 7… ▽ More The hearing loss of almost half a billion people is commonly treated with hearing aids. However, current hearing aids often do not work well in real-world noisy environments. We present a deep learning based denoising system that runs in real time on iPhone 7 and Samsung Galaxy S10 (25ms algorithmic latency). The denoised audio is streamed to the hearing aid, resulting in a total delay of around 75ms. In tests with hearing aid users having moderate to severe hearing loss, our denoising system improves audio across three tests: 1) listening for subjective audio ratings, 2) listening for objective speech intelligibility, and 3) live conversations in a noisy environment for subjective ratings. Subjective ratings increase by more than 40%, for both the listening test and the live conversation compared to a fitted hearing aid as a baseline. Speech reception thresholds, measuring speech understanding in noise, improve by 1.6 dB SRT. Ours is the first denoising system that is implemented on a mobile device, streamed directly to users' hearing aids using only a single channel as audio input while improving user satisfaction on all tested aspects, including speech intelligibility. This includes overall preference of the denoised and streamed signal over the hearing aid, thereby accepting the higher latency for the significant improvement in speech understanding. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2307.01117 [pdf, other]

doi 10.1007/978-3-031-48803-0_11

Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HPX, Go, Julia, Python, Rust, Swift, and Java

Authors: Patrick Diehl, Steven R. Brandt, Max Morris, Nikunj Gupta, Hartmut Kaiser

Abstract: Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the appropriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using various programming systems and languages. We focu… ▽ More Many scientific high performance codes that simulate e.g. black holes, coastal waves, climate and weather, etc. rely on block-structured meshes and use finite differencing methods to iteratively solve the appropriate systems of differential equations. In this paper we investigate implementations of an extremely simple simulation of this type using various programming systems and languages. We focus on a shared memory, parallelized algorithm that simulates a 1D heat diffusion using asynchronous queues for the ghost zone exchange. We discuss the advantages of the various platforms and explore the performance of this model code on different computing architectures: Intel, AMD, and ARM64FX. As a result, Python was the slowest of the set we compared. Java, Go, Swift, and Julia were the intermediate performers. The higher performing platforms were C++, Rust, Chapel, Charm++, and HPX. △ Less

Submitted 10 July, 2023; v1 submitted 18 May, 2023; originally announced July 2023.

arXiv:2304.11002 [pdf, other]

doi 10.1109/IPDPSW59300.2023.00116

Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku

Authors: Patrick Diehl, Gregor Daiß, Kevin Huck, Dominic Marcello, Sagiv Shiber, Hartmut Kaiser, Dirk Pflüger

Abstract: The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different… ▽ More The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE to Octo-Tiger was trivial thanks to using standard C++ △ Less

Submitted 15 March, 2023; originally announced April 2023.

arXiv:2303.08058 [pdf, other]

doi 10.1145/3585341.3585354

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Authors: Gregor Daiß, Patrick Diehl, Hartmut Kaiser, Dirk Pflüger

Abstract: Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us… ▽ More Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogeneity of available accelerator cards within current supercomputers, portability is a key aspect for modern HPC applications. In Octo-Tiger, we rely on Kokkos and its various execution spaces for portable compute kernels. In turn, we use HPX to coordinate kernel launches, CPU tasks, and communication. This combination allows us to have a fine interleaving between portable CPU/GPU computations and communication, enabling scalability on various supercomputers. However, for HPX and Kokkos to work together optimally, we need to be able to treat Kokkos kernels as HPX tasks. Otherwise, instead of integrating asynchronous Kokkos kernel launches into HPX's task graph, we would have to actively wait for them with fence commands, which wastes CPU time better spent otherwise. Using an integration layer called HPX-Kokkos, treating Kokkos kernels as tasks already works for some Kokkos execution spaces (like the CUDA one), but not for others (like the SYCL one). In this work, we started making Octo-Tiger and HPX itself compatible with SYCL. To do so, we introduce numerous software changes, most notably an HPX-SYCL integration. This integration allows us to treat SYCL events as HPX tasks, which in turn allows us to better integrate Kokkos by extending the support of HPX-Kokkos to also fully support Kokkos' SYCL execution space. We show two ways to implement this HPX-SYCL integration and test them using Octo-Tiger and its Kokkos kernels, on both an NVIDIA A100 and an AMD MI100. We find modest, yet noticeable, speedups by enabling this integration, even when just running simple single-node scenarios with Octo-Tiger where communication and CPU utilization are not yet an issue. △ Less

Submitted 8 May, 2023; v1 submitted 4 March, 2023; originally announced March 2023.

arXiv:2302.07191 [pdf, ps, other]

doi 10.1007/978-3-031-32316-4_3

Shared memory parallelism in Modern C++ and HPX

Authors: Patrick Diehl, Steven R. Brandt, Hartmut Kaiser

Abstract: Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for… ▽ More Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, numerous solutions have arisen, many of them requiring new programming languages, extensions to programming languages, or the addition of pragmas. Support for these various tools and extensions is available to a varying degree. In recent years, the C++ standards committee has worked to refine the language features and libraries needed to support parallel programming on a single computational node. Eventually, all major vendors and compilers will provide robust and performant implementations of these standards. Until then, the HPX library and runtime provides cutting edge implementations of the standards, as well as proposed standards and extensions. Because of these advances, it is now possible to write high performance parallel code without custom extensions to C++. We provide an overview of modern parallel programming in C++, describing the language and library features, and providing brief examples of how to use them. △ Less

Submitted 9 August, 2023; v1 submitted 16 January, 2023; originally announced February 2023.

Comments: Extended paper for the special issue

arXiv:2210.06439 [pdf, other]

doi 10.1109/ESPM256814.2022.00007

From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types

Authors: Gregor Daiß, Srinivas Yadav Singanaboina, Patrick Diehl, Hartmut Kaiser, Dirk Pflüger

Abstract: Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to… ▽ More Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU. △ Less

Submitted 8 May, 2023; v1 submitted 26 September, 2022; originally announced October 2022.

arXiv:2210.06438 [pdf, other]

doi 10.1109/P3HPC56579.2022.00014

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Authors: Gregor Daiß, Patrick Diehl, Dominic Marcello, Alireza Kheirkhahan, Hartmut Kaiser, Dirk Pflüger

Abstract: Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the com… ▽ More Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups. △ Less

Submitted 4 March, 2023; v1 submitted 26 September, 2022; originally announced October 2022.

arXiv:2210.06437 [pdf, other]

Distributed, combined CPU and GPU profiling within HPX using APEX

Authors: Patrick Diehl, Gregor Daiss, Kevin Huck, Dominic Marcello, Sagiv Shiber, Hartmut Kaiser, Juhan Frank, Geoffrey C. Clayton, Dirk Pflueger

Abstract: Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a… ▽ More Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a simulation built on HPX, a highly scalable, distributed AMT runtime. We examine the performance of the astrophysics simulation carried-out by Octo-Tiger on two different supercomputing architectures. We analyze the results of scaling and measurement overheads. In addition, we look in-depth at two similarly configured executions on the two systems to study how architectural differences affect performance and identify opportunities for optimization. As one such opportunity, we optimize the communication for the hydro solver and investigated its performance impact. △ Less

Submitted 21 September, 2022; originally announced October 2022.

arXiv:2207.12127 [pdf, other]

doi 10.1007/978-3-031-31209-0_1

Quantifying Overheads in Charm++ and HPX using Task Bench

Authors: Nanmiao Wu, Ioannis Gonidelis, Simeng Liu, Zane Fink, Nikunj Gupta, Karame Mohammadiporshokooh, Patrick Diehl, Hartmut Kaiser, Laxmikant V. Kale

Abstract: Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this paper, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C++, supporting st… ▽ More Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling. In this paper, we present the comparison of the AMT systems Charm++ and HPX with the main stream MPI, OpenMP, and MPI+OpenMP libraries using the Task Bench benchmarks. Charm++ is a parallel programming language based on C++, supporting stackless tasks as well as light-weight threads asynchronously along with an adaptive runtime system. HPX is a C++ library for concurrency and parallelism, exposing C++ standards conforming API. First, we analyze the commonalities, differences, and advantageous scenarios of Charm++ and HPX in detail. Further, to investigate the potential overheads introduced by the tasking systems of Charm++ and HPX, we utilize an existing parameterized benchmark, Task Bench, wherein 15 different programming systems were implemented, e.g., MPI, OpenMP, MPI + OpenMP, and extend Task Bench by adding HPX implementations. We quantify the overheads of Charm++, HPX, and the main stream libraries in different scenarios where a single task and multi-task are assigned to each core, respectively. We also investigate each system's scalability and the ability to hide the communication latency. △ Less

Submitted 21 July, 2022; originally announced July 2022.

arXiv:2206.11567 [pdf]

Restoring speech intelligibility for hearing aid users with deep learning

Authors: Peter Udo Diehl, Yosef Singer, Hannes Zilly, Uwe Schönfeld, Paul Meyer-Rachner, Mark Berry, Henning Sprekeler, Elias Sprengel, Annett Pudszuhn, Veit M. Hofmann

Abstract: Almost half a billion people world-wide suffer from disabling hearing loss. While hearing aids can partially compensate for this, a large proportion of users struggle to understand speech in situations with background noise. Here, we present a deep learning-based algorithm that selectively suppresses noise while maintaining speech signals. The algorithm restores speech intelligibility for hearing… ▽ More Almost half a billion people world-wide suffer from disabling hearing loss. While hearing aids can partially compensate for this, a large proportion of users struggle to understand speech in situations with background noise. Here, we present a deep learning-based algorithm that selectively suppresses noise while maintaining speech signals. The algorithm restores speech intelligibility for hearing aid users to the level of control subjects with normal hearing. It consists of a deep network that is trained on a large custom database of noisy speech signals and is further optimized by a neural architecture search, using a novel deep learning-based metric for speech intelligibility. The network achieves state-of-the-art denoising on a range of human-graded assessments, generalizes across different noise categories and - in contrast to classic beamforming approaches - operates on a single microphone. The system runs in real time on a laptop, suggesting that large-scale deployment on hearing aid chips could be achieved within a few years. Deep learning-based denoising therefore holds the potential to improve the quality of life of millions of hearing impaired people soon. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2206.06302 [pdf, other]

doi 10.1007/978-3-319-46079-6_2

Closing the Performance Gap with Modern C++

Authors: Thomas Heller, Hartmut Kaiser, Patrick Diehl, Dietmar Fey, Marc Alexander Schweitzer

Abstract: On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today's heterogeneous systems often include two or more completely distinct and incompatible hardware… ▽ More On the way to Exascale, programmers face the increasing challenge of having to support multiple hardware architectures from the same code base. At the same time, portability of code and performance are increasingly difficult to achieve as hardware architectures are becoming more and more diverse. Today's heterogeneous systems often include two or more completely distinct and incompatible hardware execution models, such as GPGPU's, SIMD vector units, and general purpose cores which conventionally have to be programmed using separate tool chains representing non-overlapping programming models. The recent revival of interest in the industry and the wider community for the C++ language has spurred a remarkable amount of standardization proposals and technical specifications in the arena of concurrency and parallelism. This recently includes an increasing amount of discussion around the need for a uniform, higher-level abstraction and programming model for parallelism in the C++ standard targeting heterogeneous and distributed computing. Such an abstraction should perfectly blend with existing, already standardized language and library features, but should also be generic enough to support future hardware developments. In this paper, we present the results from developing such a higher-level programming abstraction for parallelism in C++ which aims at enabling code and performance portability over a wide range of architectures and for various types of parallelism. We present and compare performance data obtained from running the well-known STREAM benchmark ported to our higher level C++ abstraction with the corresponding results from running it natively. We show that our abstractions enable performance at least as good as the comparable base-line benchmarks while providing a uniform programming API on all compared target architectures. △ Less

Submitted 30 May, 2022; originally announced June 2022.

arXiv:2203.09934 [pdf, other]

doi 10.1007/s42102-022-00083-4

Coupling approaches for classical linear elasticity and bond-based peridynamic models

Authors: Patrick Diehl, Serge Prudhomme

Abstract: Local-nonlocal coupling approaches provide a means to combine the computational efficiency of local models and the accuracy of nonlocal models. This paper studies the continuous and discrete formulations of three existing approaches for the coupling of classical linear elasticity and bond-based peridynamic models, namely 1) a method that enforces matching displacements in an overlap region, 2) a v… ▽ More Local-nonlocal coupling approaches provide a means to combine the computational efficiency of local models and the accuracy of nonlocal models. This paper studies the continuous and discrete formulations of three existing approaches for the coupling of classical linear elasticity and bond-based peridynamic models, namely 1) a method that enforces matching displacements in an overlap region, 2) a variant that enforces a constraint on the stresses instead, and 3) a method that considers a variable horizon in the vicinity of the interfaces. The performance of the three coupling approaches is compared on a series of one-dimensional numerical examples that involve cubic and quartic manufactured solutions. Accuracy of the proposed methods is measured in terms of the difference between the solution to the coupling approach and the solution to the classical linear elasticity model, which can be viewed as a modeling error. The objective of the paper is to assess the quality and performance of the discrete formulation for this class of force-based coupling methods. △ Less

Submitted 4 March, 2022; originally announced March 2022.

arXiv:2108.02336 [pdf, other]

doi 10.1016/j.advengsoft.2022.103360

A Fracture Multiscale Model for Peridynamic enrichment within the Partition of Unity Method

Authors: Matthias Birner, Patrick Diehl, Robert Lipton, Marc Alexander Schweitzer

Abstract: Partition of unity methods (PUM) are of domain decomposition type and provide the opportunity for multiscale and multiphysics numerical modeling. Different physical models can exist within a PUM scheme for handling problems with zones of linear elasticity and zones where fractures occur. Here, the peridynamic (PD) model is used in regions of fracture and smooth PUM is used in the surrounding linea… ▽ More Partition of unity methods (PUM) are of domain decomposition type and provide the opportunity for multiscale and multiphysics numerical modeling. Different physical models can exist within a PUM scheme for handling problems with zones of linear elasticity and zones where fractures occur. Here, the peridynamic (PD) model is used in regions of fracture and smooth PUM is used in the surrounding linear elastic media. The method is a so-called global-local enrichment strategy. The elastic fields of the undamaged media provide appropriate boundary data for the localized PD simulations. The first steps for a combined PD/PUM simulator are presented. In part I of this series, we show that the local PD approximation can be utilized to enrich the global PUM approximation to capture the true material response with high accuracy efficiently. Test problems are provided demonstrating the validity and potential of this numerical approach. △ Less

Submitted 2 February, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.10987 [pdf, other]

doi 10.1109/Cluster48925.2021.00059

Octo-Tiger's New Hydro Module and Performance Using HPX+CUDA on ORNL's Summit

Authors: Patrick Diehl, Gregor Daiß, Dominic Marcello, Kevin Huck, Sagiv Shiber, Hartmut Kaiser, Juhan Frank, Dirk Pflüger

Abstract: Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver.… ▽ More Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver. Recently, we have remodeled Octo-Tiger's hydro solver to use a three-dimensional reconstruction scheme. In addition, we have ported the hydro solver to GPU using CUDA kernels. We present scaling results for the new hydro kernels on ORNL's Summit machine using a Sedov-Taylor blast wave problem. We also compare Octo-Tiger's new hydro scheme with its old hydro scheme, using a rotating star as a test problem. △ Less

Submitted 26 July, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: Accepted to IEEE Cluster

arXiv:2102.03819 [pdf, other]

doi 10.1109/IPDPSW52791.2021.00103

Load balancing for distributed nonlocal models within asynchronous many-task systems

Authors: Pranav Gadikar, Patrick Diehl, Prashant K. Jha

Abstract: In this work, we consider the challenges of developing a distributed solver for models based on nonlocal interactions. In nonlocal models, in contrast to the local model, such as the wave and heat partial differential equation, the material interacts with neighboring points on a larger-length scale compared to the mesh discretization. In developing a fully distributed solver, the interaction over… ▽ More In this work, we consider the challenges of developing a distributed solver for models based on nonlocal interactions. In nonlocal models, in contrast to the local model, such as the wave and heat partial differential equation, the material interacts with neighboring points on a larger-length scale compared to the mesh discretization. In developing a fully distributed solver, the interaction over a length scale greater than mesh size introduces additional data dependencies among the compute nodes and communication bottleneck. In this work, we carefully look at these challenges in the context of nonlocal models; to keep the presentation specific to the computational issues, we consider a nonlocal heat equation in a 2d setting. In particular, the distributed framework we propose pays greater attention to the bottleneck of data communication and the dynamic balancing of loads among nodes with varying compute capacity. For load balancing, we propose a novel framework that assesses the compute capacity of nodes and dynamically balances the load so that the idle time among nodes is minimal. Our framework relies heavily on HPX library, an asynchronous many-task run time system. We present several results demonstrating the effectiveness of the proposed framework. △ Less

Submitted 9 April, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

arXiv:2102.00223 [pdf, other]

doi 10.1109/MCSE.2021.3073626

Performance Measurements within Asynchronous Task-based Runtime Systems: A Double White Dwarf Merger as an Application

Authors: Patrick Diehl, Dominic Marcello, Parsa Amini, Hartmut Kaiser, Sagiv Shiber, Geoffrey C. Clayton, Juhan Frank, Gregor Daiß, Dirk Pflüger, David Eder, Alice Koniges, Kevin Huck

Abstract: Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Oct… ▽ More Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Octo-Tiger full 3D AMR astrophysics application. This enables the combined visualization of physical and performance data to highlight bottlenecks with respect to different solvers. We examine the overhead introduced by these measurements, which is around 1%, with respect to the overall application runtime. We perform a convergence study for four different levels of refinement and analyze the application's performance with respect to adaptive grid refinement. The measurements' overheads are small, enabling the combined use of performance data and physical properties with the goal of improving the code's performance. All of these measurements were obtained on NERSC's Cori, Louisiana Optical Network Infrastructure's QueenBee2, and Indiana University's Big Red 3. △ Less

Submitted 9 June, 2021; v1 submitted 30 January, 2021; originally announced February 2021.

arXiv:2010.04106 [pdf, other]

doi 10.1109/ESPM251964.2020.00007

Deploying a Task-based Runtime System on Raspberry Pi Clusters

Authors: Nikunj Gupta, Steve R. Brandt, Bibek Wagle, Nanmiao, Alireza Kheirkhahan, Patrick Diehl, Hartmut Kaiser, Felix W. Baumann

Abstract: Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an \arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx pla… ▽ More Arm technology is becoming increasingly important in HPC. Recently, Fugaku, an \arm-based system, was awarded the number one place in the Top500 list. Raspberry Pis provide an inexpensive platform to become familiar with this architecture. However, Pis can also be useful on their own. Here we describe our efforts to configure and benchmark the use of a Raspberry Pi cluster with the HPX/Phylanx platform (normally intended for use with HPC applications) and document the lessons we learned. First, we highlight the required changes in the configuration of the Pi to gain performance. Second, we explore how limited memory bandwidth limits the use of all cores in our shared memory benchmarks. Third, we evaluate whether low network bandwidth affects distributed performance. Fourth, we discuss the power consumption and the resulting trade-off in cost of operation and performance. △ Less

Submitted 9 April, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2010.03012 [pdf, other]

doi 10.1109/DLS51937.2020.00008

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Authors: Bita Hasheminezhad, Shahrzad Shirzad, Nanmiao Wu, Patrick Diehl, Hannes Schulz, Hartmut Kaiser

Abstract: Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the preliminary designs of most available d… ▽ More Although recent scaling up approaches to training deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets, require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the preliminary designs of most available distributed deep learning frameworks, and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx offers a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system. △ Less

Submitted 19 April, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

arXiv:2008.09725 [pdf, other]

doi 10.1016/j.cma.2020.113391

On the treatment of boundary conditions for bond-based peridynamic models

Authors: Serge Prudhomme, Patrick Diehl

Abstract: In this paper, we propose two approaches to apply boundary conditions for bond-based peridynamic models. There has been in recent years a renewed interest in the class of so-called non-local models, which include peridynamic models, for the simulation of structural mechanics problems as an alternative approach to classical local continuum models. However, a major issue, which is often disregarded… ▽ More In this paper, we propose two approaches to apply boundary conditions for bond-based peridynamic models. There has been in recent years a renewed interest in the class of so-called non-local models, which include peridynamic models, for the simulation of structural mechanics problems as an alternative approach to classical local continuum models. However, a major issue, which is often disregarded when dealing with this class of models, is concerned with the manner by which boundary conditions should be prescribed. Our point of view here is that classical boundary conditions, since applied on surfaces of solid bodies, are naturally associated with local models. The paper describes two methods to incorporate classical Dirichlet and Neumann boundary conditions into bond-based peridynamics. The first method consists in artificially extending the domain with a thin boundary layer over which the displacement field is required to behave as an odd function with respect to the boundary points. The second method resorts to the idea that peridynamic models and local models should be compatible in the limit that the so-called horizon vanishes. The approach consists then in decreasing the horizon from a constant value in the interior of the domain to zero at the boundary so that one can directly apply the classical boundary conditions. We present the continuous and discrete formulations of the two methods and assess their performance on several numerical experiments dealing with the simulation of a one-dimensional bar. △ Less

Submitted 21 August, 2020; originally announced August 2020.

arXiv:2002.07970 [pdf, other]

Supporting OpenMP 5.0 Tasks in hpxMP -- A study of an OpenMP implementation within Task Based Runtime Systems

Authors: Tianyi Zhang, Shahrzad Shirzad, Bibek Wagle, Adrian S. Lemoine, Patrick Diehl, Hartmut Kaiser

Abstract: OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing applications. One of the major challenges of this new paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly optimized OpenMP-based librar… ▽ More OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing applications. One of the major challenges of this new paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly optimized OpenMP-based libraries do not perform well when coupled with AMTs because the threading of both libraries will compete for resources. This paper is a follow-up paper on the fundamental implementation of hpxMP, an implementation of the OpenMP standard which utilizes the C++ standard library for Parallelism and Concurrency (HPX) to schedule and manage tasks. In this paper, we present the implementation of task features, e.g. taskgroup, task depend, and task_reduction, of the OpenMP 5.0 standard and optimization of the #pragma omp parallel for pragma. We use the daxpy benchmark, the Barcelona OpenMP Tasks Suite, Parallel research kernels, and OpenBLAS benchmarks to compare the different OpenMp implementations: hpxMP, llvm-OpenMP, and GOMP. △ Less

Submitted 18 February, 2020; originally announced February 2020.

arXiv:1909.03947 [pdf, other]

doi 10.1109/MLHPC49564.2019.00009

Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

Authors: G. Laberge, S. Shirzad, P. Diehl, H. Kaiser, S. Prudhomme, A. Lemoine

Abstract: Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and… ▽ More Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations. △ Less

Submitted 25 September, 2019; v1 submitted 9 September, 2019; originally announced September 2019.

Comments: Accepted at HPCML19

arXiv:1908.03121 [pdf, other]

doi 10.1145/3295500.3356221

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions

Authors: Gregor Daiß, Parsa Amini, John Biddiscombe, Patrick Diehl, Juhan Frank, Kevin Huck, Hartmut Kaiser, Dominic Marcello, David Pfander, Dirk Pflüger

Abstract: We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems,… ▽ More We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems, Octo-Tiger relies on high-level programming abstractions. We use HPX with its futurization capabilities to ensure scalability both between nodes and within, and present first results replacing MPI with libfabric achieving up to a 2.8x speedup. We extend Octo-Tiger to heterogeneous GPU-accelerated supercomputers, demonstrating node-level performance and portability. We show scalability up to full system runs on Piz Daint. For the scenario's maximum resolution, the compute-critical parts (hydrodynamics and gravity) achieve 68.1% parallel efficiency at 2048 nodes. △ Less

Submitted 9 August, 2019; v1 submitted 8 August, 2019; originally announced August 2019.

Comments: Accepted at SC19

arXiv:1903.03023 [pdf, other]

doi 10.1145/3318170.3318191

An Introduction to hpxMP: A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-Task System

Authors: Tianyi Zhang, Shahrzad Shirzad, Patrick Diehl, R. Tohid, Weile Wei, Hartmut Kaiser

Abstract: Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating higher-level interfaces able to replace OpenMP or OpenACC in modern C++ codes. These higher level functions have been adopted in standards confor… ▽ More Asynchronous Many-task (AMT) runtime systems have gained increasing acceptance in the HPC community due to the performance improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating higher-level interfaces able to replace OpenMP or OpenACC in modern C++ codes. These higher level functions have been adopted in standards conforming runtime systems such as HPX, giving users the ability to simply utilize fork-join parallelism in their own codes. Despite innovations in runtime systems and standardization efforts users face enormous challenges porting legacy applications. Not only must users port their own codes, but often users rely on highly optimized libraries such as BLAS and LAPACK which use OpenMP for parallization. Current efforts to create smooth migration paths have struggled with these challenges, especially as the threading systems of AMT libraries often compete with the treading system of OpenMP. To overcome these issues, our team has developed hpxMP, an implementation of the OpenMP standard, which utilizes the underlying AMT system to schedule and manage tasks. This approach leverages the C++ interfaces exposed by HPX and allows users to execute their applications on an AMT system without changing their code. In this work, we compare hpxMP with Clang's OpenMP library with four linear algebra benchmarks of the Blaze C++ library. While hpxMP is often not able to reach the same performance, we demonstrate viability for providing a smooth migration for applications but have to be extended to benefit from a more general task based programming model. △ Less

Submitted 5 July, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1810.11482 [pdf, other]

doi 10.1109/ESPM2.2018.00006

Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)

Authors: Patrick Diehl, Madhavan Seshadri, Thomas Heller, Hartmut Kaiser

Abstract: Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for dist… ▽ More Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities. Overhead measurements show, that the integration of the asynchronous operations (data transfer + launches of the kernels) as part of the HPX execution graph imposes no additional computational overhead and significantly eases orchestrating coordinated and concurrent work on the main cores and the used GPU devices. △ Less

Submitted 26 October, 2018; originally announced October 2018.

arXiv:1810.07591 [pdf, other]

doi 10.1109/ESPM2.2018.00009

Asynchronous Execution of Python Code on Task Based Runtime Systems

Authors: R. Tohid, Bibek Wagle, Shahrzad Shirzad, Patrick Diehl, Adrian Serio, Alireza Kheirkhahan, Parsa Amini, Katy Williams, Kate Isaacs, Kevin Huck, Steven Brandt, Hartmut Kaiser

Abstract: Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenienc… ▽ More Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by mapping Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards. △ Less

Submitted 22 October, 2018; v1 submitted 17 October, 2018; originally announced October 2018.

arXiv:1806.06917 [pdf, other]

doi 10.1007/s42452-020-03784-x

An asynchronous and task-based implementation of Peridynamics utilizing HPX -- the C++ standard library for parallelism and concurrency

Authors: Patrick Diehl, Prashant K. Jha, Hartmut Kaiser, Robert Lipton, Martin Levesque

Abstract: On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. Through this shift of increasing cores per node, a new programming model with the focus on handle the fine-grain parallelism of this increasing amount of cores per computational node is needed. Asynchronous Many Task (AMT) run time systems represent an emerging paradigm for… ▽ More On modern supercomputers, asynchronous many task systems are emerging to address the new architecture of computational nodes. Through this shift of increasing cores per node, a new programming model with the focus on handle the fine-grain parallelism of this increasing amount of cores per computational node is needed. Asynchronous Many Task (AMT) run time systems represent an emerging paradigm for addressing fine-grain parallelism since they handle the increasing amount of threads per node and concurrency. HPX, a open source C++ standard library for parallelism and concurrency, is one AMT which is confirm with the C++ standard. Which means that HPX's Application Programming Interface (API) is confirm with its definition by the C++ standard committee. For example for the concept of futurization the hpx:future can be replaced by std::future without breaking the API. Peridynamics is a non-local generalization of continuum mechanics tailored to address discontinuous displacement fields arising in fracture mechanics. As many non-local approaches, peridynamics requires considerable computing resources to solve practical problems. This paper investigates the implementation of a peridynamics EMU nodal discretization in an asynchronous task-based fashion. The scalability of asynchronous task-based implementation is to be in agreement with theoretical estimations. In addition, to the scalabilty the code is convergent for implicit time integration and recovers theoretical solutions. Explicit time integration, convergence results are presented to showcase the agreement of results with theoretical claims in previous works. △ Less

Submitted 28 October, 2020; v1 submitted 18 June, 2018; originally announced June 2018.

arXiv:1803.07622 [pdf, other]

doi 10.1016/j.engfracmech.2018.04.030

Long term availability of raw experimental data in experimental fracture mechanics

Authors: Patrick Diehl, Ilyass Tabiai, Felix W. Baumann, Martin Levesque

Abstract: Experimental data availability is a cornerstone for reproducibility in experimental fracture mechanics, which is crucial to the scientific method. This short communication focuses on the accessibility and long term availability of raw experimental data. The corresponding authors of the eleven most cited papers, related to experimental fracture mechanics, for every year from 2000 up to 2016, were k… ▽ More Experimental data availability is a cornerstone for reproducibility in experimental fracture mechanics, which is crucial to the scientific method. This short communication focuses on the accessibility and long term availability of raw experimental data. The corresponding authors of the eleven most cited papers, related to experimental fracture mechanics, for every year from 2000 up to 2016, were kindly asked about the availability of the raw experimental data associated with each publication. For the 187 e-mails sent: 22.46% resulted in outdated contact information, 57.75% of the authors did received our request and did not reply, and 19.79 replied to our request. The availability of data is generally low with only $11$ available data sets (5.9%). The authors identified two main issues for the lacking availability of raw experimental data. First, the ability to retrieve data is strongly attached to the the possibility to contact the corresponding author. This study suggests that institutional e-mail addresses are insufficient means for obtaining experimental data sets. Second, lack of experimental data is also due that submission and publication does not require to make the raw experimental data available. The following solutions are proposed: (1) Requirement of unique identifiers, like ORCID or ResearcherID, to detach the author(s) from their institutional e-mail address, (2) Provide DOIs, like Zenodo or Dataverse, to make raw experimental data citable, and (3) grant providing organizations should ensure that experimental data by public funded projects is available to the public. △ Less

Submitted 20 March, 2018; originally announced March 2018.

arXiv:1703.06290 [pdf, other]

A wake-sleep algorithm for recurrent, spiking neural networks

Authors: Johannes Thiele, Peter Diehl, Matthew Cook

Abstract: We investigate a recently proposed model for cortical computation which performs relational inference. It consists of several interconnected, structurally equivalent populations of leaky integrate-and-fire (LIF) neurons, which are trained in a self-organized fashion with spike-timing dependent plasticity (STDP). Despite its robust learning dynamics, the model is susceptible to a problem typical fo… ▽ More We investigate a recently proposed model for cortical computation which performs relational inference. It consists of several interconnected, structurally equivalent populations of leaky integrate-and-fire (LIF) neurons, which are trained in a self-organized fashion with spike-timing dependent plasticity (STDP). Despite its robust learning dynamics, the model is susceptible to a problem typical for recurrent networks which use a correlation based (Hebbian) learning rule: if trained with high learning rates, the recurrent connections can cause strong feedback loops in the network dynamics, which lead to the emergence of attractor states. This causes a strong reduction in the number of representable patterns and a decay in the inference ability of the network. As a solution, we introduce a conceptually very simple "wake-sleep" algorithm: during the wake phase, training is executed normally, while during the sleep phase, the network "dreams" samples from its generative model, which are induced by random input. This process allows us to activate the attractor states in the network, which can then be unlearned effectively by an anti-Hebbian mechanism. The algorithm allows us to increase learning rates up to a factor of ten while avoiding clustering, which allows the network to learn several times faster. Also for low learning rates, where clustering is not an issue, it improves convergence speed and reduces the final inference error. △ Less

Submitted 18 March, 2017; originally announced March 2017.

Comments: Presented at the NIPS 2016 workshop "Computing with Spikes"

arXiv:1608.08267 [pdf, other]

Learning and Inferring Relations in Cortical Networks

Authors: Peter U. Diehl, Matthew Cook

Abstract: A pressing scientific challenge is to understand how brains work. Of particular interest is the neocortex,the part of the brain that is especially large in humans, capable of handling a wide variety of tasks including visual, auditory, language, motor, and abstract processing. These functionalities are processed in different self-organized regions of the neocortical sheet, and yet the anatomical s… ▽ More A pressing scientific challenge is to understand how brains work. Of particular interest is the neocortex,the part of the brain that is especially large in humans, capable of handling a wide variety of tasks including visual, auditory, language, motor, and abstract processing. These functionalities are processed in different self-organized regions of the neocortical sheet, and yet the anatomical structure carrying out the processing is relatively uniform across the sheet. We are at a loss to explain, simulate, or understand such a multi-functional homogeneous sheet-like computational structure - we do not have computational models which work in this way. Here we present an important step towards developing such models: we show how uniform modules of excitatory and inhibitory neurons can be connected bidirectionally in a network that, when exposed to input in the form of population codes, learns the input encodings as well as the relationships between the inputs. STDP learning rules lead the modules to self-organize into a relational network, which is able to infer missing inputs,restore noisy signals, decide between conflicting inputs, and combine cues to improve estimates. These networks show that it is possible for a homogeneous network of spiking units to self-organize so as to provide meaningful processing of its inputs. If such networks can be scaled up, they could provide an initial computational model relevant to the large scale anatomy of the neocortex. △ Less

Submitted 29 August, 2016; originally announced August 2016.

arXiv:1601.04187 [pdf, other]

Conversion of Artificial Recurrent Neural Networks to Spiking Neural Networks for Low-power Neuromorphic Hardware

Authors: Peter U. Diehl, Guido Zarrella, Andrew Cassidy, Bruno U. Pedroni, Emre Neftci

Abstract: In recent years the field of neuromorphic low-power systems that consume orders of magnitude less power gained significant momentum. However, their wider use is still hindered by the lack of algorithms that can harness the strengths of such architectures. While neuromorphic adaptations of representation learning algorithms are now emerging, efficient processing of temporal sequences or variable le… ▽ More In recent years the field of neuromorphic low-power systems that consume orders of magnitude less power gained significant momentum. However, their wider use is still hindered by the lack of algorithms that can harness the strengths of such architectures. While neuromorphic adaptations of representation learning algorithms are now emerging, efficient processing of temporal sequences or variable length-inputs remain difficult. Recurrent neural networks (RNN) are widely used in machine learning to solve a variety of sequence learning tasks. In this work we present a train-and-constrain methodology that enables the mapping of machine learned (Elman) RNNs on a substrate of spiking neurons, while being compatible with the capabilities of current and near-future neuromorphic systems. This "train-and-constrain" method consists of first training RNNs using backpropagation through time, then discretizing the weights and finally converting them to spiking RNNs by matching the responses of artificial neurons with those of the spiking neurons. We demonstrate our approach by mapping a natural language processing task (question classification), where we demonstrate the entire mapping process of the recurrent layer of the network on IBM's Neurosynaptic System "TrueNorth", a spike-based digital neuromorphic hardware architecture. TrueNorth imposes specific constraints on connectivity, neural and synaptic parameters. To satisfy these constraints, it was necessary to discretize the synaptic weights and neural activities to 16 levels, and to limit fan-in to 64 inputs. We find that short synaptic delays are sufficient to implement the dynamical (temporal) aspect of the RNN in the question classification task. The hardware-constrained model achieved 74% accuracy in question classification while using less than 0.025% of the cores on one TrueNorth chip, resulting in an estimated power consumption of ~17 uW. △ Less

Submitted 16 January, 2016; originally announced January 2016.

arXiv:1601.04183 [pdf, other]

TrueHappiness: Neuromorphic Emotion Recognition on TrueNorth

Authors: Peter U. Diehl, Bruno U. Pedroni, Andrew Cassidy, Paul Merolla, Emre Neftci, Guido Zarrella

Abstract: We present an approach to constructing a neuromorphic device that responds to language input by producing neuron spikes in proportion to the strength of the appropriate positive or negative emotional response. Specifically, we perform a fine-grained sentiment analysis task with implementations on two different systems: one using conventional spiking neural network (SNN) simulators and the other on… ▽ More We present an approach to constructing a neuromorphic device that responds to language input by producing neuron spikes in proportion to the strength of the appropriate positive or negative emotional response. Specifically, we perform a fine-grained sentiment analysis task with implementations on two different systems: one using conventional spiking neural network (SNN) simulators and the other one using IBM's Neurosynaptic System TrueNorth. Input words are projected into a high-dimensional semantic space and processed through a fully-connected neural network (FCNN) containing rectified linear units trained via backpropagation. After training, this FCNN is converted to a SNN by substituting the ReLUs with integrate-and-fire neurons. We show that there is practically no performance loss due to conversion to a spiking network on a sentiment analysis test set, i.e. correlations between predictions and human annotations differ by less than 0.02 comparing the original DNN and its spiking equivalent. Additionally, we show that the SNN generated with this technique can be mapped to existing neuromorphic hardware -- in our case, the TrueNorth chip. Mapping to the chip involves 4-bit synaptic weight discretization and adjustment of the neuron thresholds. The resulting end-to-end system can take a user input, i.e. a word in a vocabulary of over 300,000 words, and estimate its sentiment on TrueNorth with a power consumption of approximately 50 uW. △ Less

Submitted 16 January, 2016; originally announced January 2016.

Showing 1–39 of 39 results for author: Diehl, P