WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel
Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers’ computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To ...
Multi-GPU 3D k-nearest neighbors computation with application to ICP, point cloud smoothing and normals computation
The k-Nearest Neighbors algorithm is a fundamental algorithm that finds applications in many fields like Machine Learning, Computer Graphics, Computer Vision, and others. The algorithm determines the closest points (d-dimensional) of a reference ...
NxtSPR: A deadlock-free shortest path routing dedicated to relaying for Triplet-Based many-core Architecture
Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-...
Highlights
- The topology-related characteristics of Triplet-Based many-core Architecture are defined systematically using graph and group theory, and its correctness is verified through formal verification (proof-based) methods.
- A novel and high-...
Mobilizing underutilized storage nodes via job path: A job-aware file striping approach
Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring ...
Abstractions for C++ code optimizations in parallel high-performance applications
Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a ...
Highlights
- Proposing novel abstraction for flexible traversals of regular data structures.
- Designed for traversal-agnostic algorithms in HPC parallel computing.
- Reduces traversal code complexity, improving separation of concerns and ...
An automated OpenMP mutation testing framework for performance optimization
Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus ...