This year's symposium continues and reinforces the PPoPP tradition of publishing leading work on all aspects of parallel programming, including foundational and theoretical aspects, techniques, languages, compilers, runtime systems, tools, and practical experiences. Given the pervasiveness of parallel architectures in the general consumer market, PPoPP, with its interest in new parallel workloads, techniques and productivity tools for parallel programming, is becoming more relevant than ever to the computer science community.
Beyond human-level accuracy: computational challenges in deep learning
Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent ...
S-EnKF: co-designing for scalable ensemble Kalman filter
Ensemble Kalman filter (EnKF) is one of the most important methods for data assimilation, which is widely applied to the reconstruction of observed historical data for providing initial conditions of numerical atmospheric and oceanic models. With the ...
Throughput-oriented GPU memory allocation
Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability ...
SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU
In general, the performance of parallel graph processing is determined by three pairs of critical parameters, namely synchronous or asynchronous execution mode (Sync or Async), Push or Pull communication mechanism (Push or Pull), and Data-driven or ...
Incremental flattening for nested data parallelism
Compilation techniques for nested-parallel applications that can adapt to hardware and dataset characteristics are vital for unlocking the power of modern hardware. This paper proposes such a technique, which builds on flattening and is applied in the ...
Adaptive sparse matrix-matrix multiplication on the GPU
In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, ...
Modular transactions: bounding mixed races in space and time
We define local transactional race freedom (LTRF), which provides a programmer model for software transactional memory. LTRF programs satisfy the SC-LTRF property, thus allowing the programmer to focus on sequential executions in which transactions ...
Leveraging hardware TM in Haskell
Transactional memory (TM) is heavily used for synchronization in the Haskell programming language, but its performance has historically been poor. We set out to improve this performance using hardware TM (HTM) on Intel processors. This task is ...
Stretching the capacity of hardware transactional memory in IBM POWER architectures
The hardware transactional memory (HTM) implementations in commercially available processors are significantly hindered by their tight capacity constraints. In practice, this renders current HTMs unsuitable to many real-world workloads of in-memory ...
Processing transactions in a predefined order
In this paper we provide a high performance solution to the problem of committing transactions while enforcing a pre-defined order. We provide the design and implementation of three algorithms, which deploy a specialized cooperative transaction ...
Harmonia: a high throughput B+tree for GPUs
B+tree is one of the most important data structures and has been widely used in different fields. With the increase of concurrent queries and data-scale in storage, designing an efficient B+tree structure has become critical. Due to abundant computation ...
Engineering a high-performance GPU B-Tree
We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted ...
QTLS: high-performance TLS asynchronous offload framework with Intel® QuickAssist technology
Hardware accelerators are a promising solution to optimize the Total Cost of Ownership (TCO) of cloud datacenters. This paper targets the costly Transport Layer Security (TLS) and investigates the TLS acceleration for the widely-deployed event-driven ...
Data-flow/dependence profiling for structured transformations
- Fabian Gruber,
- Manuel Selva,
- Diogo Sampaio,
- Christophe Guillon,
- Antoine Moynault,
- Louis-Noël Pouchet,
- Fabrice Rastello
Profiling feedback is an important technique used by developers for performance debugging, where it is usually used to pinpoint performance bottlenecks and also to find optimization opportunities. Assessing the validity and potential benefit of a ...
Lightweight hardware transactional memory profiling
Programs that use hardware transactional memory (HTM) demand sophisticated performance analysis tools when they suffer from performance losses. We have developed TxSampler---a lightweight profiler for programs that use HTM. TxSampler measures ...
A pattern based algorithmic autotuner for graph processing on GPUs
This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in a small set of algorithmic patterns that allow for the configurable ...
Provably and practically efficient granularity control
Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction for writing ...
A coordinated tiling and batching framework for efficient GEMM on GPUs
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains such as deep learning, scientific computing, and image processing. The primary optimization method is to partition the matrix into many tiles and exploit the ...
Semantics-aware scheduling policies for synchronization determinism
A common task for all deterministic multithreading (DMT) systems is to enforce synchronization determinism. However, synchronization determinism has not been the focus of existing DMT research. Instead, most DMT systems focused on how to order data ...
Proactive work stealing for futures
The use of futures provides a flexible way to express parallelism and can generate arbitrary dependences among parallel subcomputations. The additional flexibility that futures provide comes with a cost, however. When scheduled using classic work ...
A round-efficient distributed betweenness centrality algorithm
- Loc Hoang,
- Matteo Pontecorvi,
- Roshan Dathathri,
- Gurbinder Gill,
- Bozhi You,
- Keshav Pingali,
- Vijaya Ramachandran
We present Min-Rounds BC (MRBC), a distributed-memory algorithm in the CONGEST model that computes the betweenness centrality (BC) of every vertex in a directed unweighted n-node graph in O(n) rounds. Min-Rounds BC also computes all-pairs-shortest-paths ...
Corrected trees for reliable group communication
- Martin Küttler,
- Maksym Planeta,
- Jan Bierbaum,
- Carsten Weinhold,
- Hermann Härtig,
- Amnon Barak,
- Torsten Hoefler
Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more ...
Adaptive sparse tiling for sparse matrix multiplication
Tiling is a key technique for data locality optimization and is widely used in high-performance implementations of dense matrix-matrix multiplication for multicore/manycore CPUs and GPUs. However, the irregular and matrix-dependent data access pattern ...
Encapsulated open nesting for STM: fine-grained higher-level conflict detection
Open nesting allows replacing the automatic detection of conflicting memory accesses used in transactional memory (TM) with programmer-specified higher-level conflict detection. Higher-level conflict detection allows removing conflicts of commuting ...
A specialized B-tree for concurrent datalog evaluation
Modern Datalog engines are employed in industrial applications such as graph-databases, networks, and static program analysis. To cope with vast amount of data, Datalog engines must employ parallel execution strategies, for which specialized concurrent ...
Efficient race detection with futures
This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works on determinacy race detection have mostly focused on either task parallel programs ...
Verifying C11 programs operationally
This paper develops an operational semantics for a release-acquire fragment of the C11 memory model with relaxed accesses. We show that the semantics is both sound and complete with respect to the axiomatic model of Batty et al. The semantics relies on ...
Checking linearizability using hitting families
Linearizability is a key correctness property for concurrent data types. Linearizability requires that the behavior of concurrently invoked operations of the data type be equivalent to the behavior in an execution where each operation takes effect at an ...
Transitive joins: a sound and efficient online deadlock-avoidance policy
We introduce a new online deadlock-avoidance policy, Transitive Joins (TJ), that targets programs with dynamic task parallelism and arbitrary join operations. In this model, a computation task can asynchronously spawn new tasks and selectively join (...
VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing
This work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distribution, and ensures an equal degree ...
Index Terms
- Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming