Most of the initial work on graph partitioning involved sequential algorithms. These algorithms have been extended to work in distributed-memory environments, in particular for balancing processor workloads in parallel applications. In this use case, a distributed-memory application already has a distribution of the graph; for memory scalability, the entire graph is not stored in every processor. Thus, distributed-memory partitioning algorithms do not typically have a global view of the entire graph; they often make partitioning decisions based on partial views of local graph data. As a result, they can have lower solution quality than their sequential counterparts. Still, distributed-memory algorithms are crucial for graphs that are too large to fit in a single memory space, and for applications wishing to partition their data dynamically to adjust for changing computational workloads.
In recent years, progress has been made on shared memory algorithms as shared memory architectures offer greater flexibility than distributed-memory architectures. For example, random memory accesses or atomic updates can be done orders of magnitude faster compared to distributed-memory machines. Because shared-memory algorithms have a global view of the graph, they can achieve the same solution quality as their sequential predecessors. They are not feasible, however, for extremely large graphs that do not fit in a single memory space.
6.2 Distributed Memory
Distributed (hyper)graph partitioning is one way to handle large inputs that do not fit into the main memory of a single machine. In the distributed-memory model, several processors (PEs) are interconnected via a communication network, and each has its private memory inaccessible to others. Computational tasks on each PE usually operate independently only on local data representing a small subset of the input. Intermediate computational results must be exchanged via dedicated network communication primitives.
Distributed (hyper)graph processing algorithms require that the vertices and edges of the input are partitioned among the processors. Since many applications use balanced (hyper)graph partitioning to obtain a good initial assignment, much simpler techniques are used in distributed partitioners. There exist range-based [
128] and hash-based partitioning techniques [
166,
167]. The former splits the vertex IDs into equidistant ranges, which are then assigned to the PEs. Since both techniques do not consider the (hyper)graph structure, it could lead to load imbalances or high communication overheads. However, one could also migrate vertices as more information about the structure of the (hyper)graph is available, e.g., when recursing on a subgraph obtained via recursive bipartitioning [
22,
45]. If geometric information is available, then one can also use space-filling curves [
16,
177].
Each PE then stores the vertices assigned to it and the edges incident to them. The edges stored on a PE can be incident to local vertices or vertices on other PEs (also called ghost or halo vertices). We say that a PE is adjacent to another PE if they share a common edge. Processors must be able to identify adjacent PEs to propagate updates, e.g., if we move a vertex to a different block, then we have to communicate that change to other PEs in the network such that local search algorithms can work on accurate partition information. However, each communication operation introduces overheads that can limit the scalability of the system. Thus, the main challenge in distributed (hyper)graph partitioning is keeping the global partition information on each PE in some sense up to date while simultaneously minimizing the required communication.
The remainder of this section describes the algorithmic core ideas of recent publications in that field and abstracts from the physical placement of the vertices and the actual representation of the distributed (hyper)graph data structure. However, we assume that each vertex knows on which PE its neighbors are stored.
Local Search.. The label propagation heuristic is the most widely used local search algorithm in distributed systems [
46,
95,
97,
123,
128,
166,
170,
180]. Other approaches schedule sequential two-way FM [
55] on adjacent block pairs in parallel [
40,
92,
106]. However, this limits the available parallelism to at most the number of blocks
\(k\).
Parallel label propagation implementations mostly follow the
bulk synchronous parallel model. In a computation phase, each PE computes for its local vertices their desired target block. In the communication phase, updates are made visible to other PEs via personalized all-to-all communication [
128,
166]. Meyerhenke et al. [
128] use an asynchronous communication model. If the computation phase of a PE ends, then it sends and receives updates to and from other PEs and immediately continues with the next round.
In the parallel setting, the move gain of two adjacent vertices may suggest an improvement when moved individually, but moving both simultaneously may worsen the solution quality. Therefore, some partitioners use a vertex coloring [
97] or a two-phase protocol where in the first phase, vertices can only move from a block
\(V_i\) to
\(V_j\) if
\(i \lt j\) and vice versa in the second phase [
46,
113,
170]. Many systems do not use any techniques to protect against move conflicts. This can be seen as an optimistic strategy assuming that conflicts rarely happen in practice.
The
Social Hash Partitioner [
95] (Facebook’s internal hypergraph partitioner) also uses the label propagation heuristic to optimize
\(\text{fanout}(\Pi) := \frac{1}{|E|} \sum _{e \in E} \lambda (e)\) where
\(\Pi = \lbrace V_1,\ldots ,V_k\rbrace\) is a
\(k\)-way partition. The authors note that the label propagation algorithm can easily get stuck in local optima for fanout optimization and suggest a probabilistic version of the fanout metric, called
\(\text{p-fanout}(\Pi) := \frac{1}{|E|} \sum _{e \in E} \sum _{V_i \in \Pi } 1 - (1 - p)^{\mathrm{\Phi }(e, V_i)}\) for some probability
\(p \in (0,1)\). The probabilistic fanout function samples the pins of net with probability
\(p\) and represents the expected fanout for a family of similar hypergraphs. Thus, it should be more robust and reduce the impact of local minima.
Other recently published distributed local search techniques are based on vertex swapping techniques that preserve the balance of the partition. Rahimian et al. [
143] present JA-BE-JA that uses such an approach. The algorithm iterates over the local vertices of each PE and for each vertex, it considers all adjacent vertices as swap candidates. If no partner was found, then it selects a random vertex from a sample as a candidate. If the selected vertex is assigned to a different PE, then the instantiating PE sends a request with all the required information such that the receiving PE can verify whether or not the swap operation would improve the edge cut. On success, both vertices change their blocks. Additionally, simulated annealing is used to avoid local minima.
Aydin et al. [
16] implement a distributed partitioner that computes a linear ordering of the vertices, which is then split into
\(k\) equally sized ranges to obtain an initial
\(k\)-way partition. The idea is similar to space-filling curves [
19,
138], but does not require geometric information. The initial ordering is computed by assigning labels to a tree constructed via agglomerative hierarchical clustering. Afterward, it sorts the labels of the leaves to obtain an initial ordering. To further improve the ordering, it solves the
minimum linear arrangement problem that tries to optimize
\(\sum _{(u,v) \in E} |\pi (u) - \pi (v)|\omega (u,v)\) where
\(\pi (u)\) denotes the position of
\(u \in V\) in the current ordering. To do so, it uses a two-stage MapReduce algorithm that is repeated until convergence: First, each vertex computes its desired new position as the weighted median of its neighbor’s positions. Second, the final positions are assigned to the vertices by resolving duplicates with simple ID-based ordering. The second local search algorithm performs vertex swaps. First, it pairs adjacent blocks of the partition. Then, it splits the vertices of each block into
\(r\) disjoint intervals and randomly pairs intervals between paired blocks. The paired sets are then mapped to the processors that perform the following algorithm: It sorts the vertices in both sets according to their cut reduction if moved to the opposite block and swaps the vertices with the highest combined cut reduction.
Balance Constraint.. The label propagation algorithm only knows the exact block weights at the beginning of each computation phase. In the computation phase, block weights are only maintained locally. In the communication phase, the combination of all moves may result in a partition that violates the balance constraint. Thus, partitioners based on this scheme have to employ techniques to ensure balance.
The distributed multilevel graph partitioner ParHIP [
128] divides a label propagation round into subrounds and restores the exact block weights with an All-Reduce operation after each subround. Note that this does not guarantee balance but gives a good approximation of the block weights when the number of moved vertices in a subround is small.
Slota et al. [
166] implemented a distributed graph partitioner that alternates between a balance and refinement phase, both utilizing the label propagation algorithm. In the refinement phase, each PE maintains approximate block weights
\(a(V_i) := c(V_i) + \gamma \Delta (V_i)\) where
\(c(V_i)\) is the weight of block
\(V_i\) at the beginning of the computation phase,
\(\Delta (V_i)\) is the weight of vertices that locally moved out, respectively, into block
\(V_i\), and
\(\gamma\) is a tuning parameter that depends on the number of PEs. Each PE then ensures locally that
\(a(V_i) \le L_{\max }\) for all
\(i \in \lbrace 1, \ldots , k\rbrace\). In the balancing phase, the gain of moving a vertex to block
\(V_i\) is multiplied with
\(\frac{L_{\max }}{a(V_i)}\). As a consequence, moves to underloaded blocks become more attractive. In a subsequent publication [
167], the approach is generalized to the multi-constraint partitioning problem, where each vertex is associated with multiple weights.
Recently, probabilistic methods were proposed that preserve the balance in expectation [
95,
123]. The Social Hash Partitioner [
95] aggregates the number of vertices
\(S_{i,j}\) that want to move from block
\(V_i\) to
\(V_j\) after each computation phase at a dedicated master process. Then, a vertex part of block
\(V_i\) is moved to its desired target block
\(V_j\) with probability
\(\frac{\min (S_{i,j}, S_{j,i})}{S_{i,j}}\). This ensures that the expected number of vertices that move from block
\(V_i\) to
\(V_j\) and vice versa is the same and, thus, preserves the balance of the partition in expectation. However, each PE moves its highest ranked vertices with probability one and all remaining probabilistically. Martella et al. [
123] moves a vertex
\(u\) to its desired target block
\(V_j\) with probability
\(\frac{L_{\max } - c(V_j)}{M_j}\) where
\(M_j\) are the number of vertices that want to move to block
\(V_j\). The advantage of the probabilistic method is that only the number of vertices preferring a different block need to be communicated instead of all moves.
Multilevel Algorithms.. Although it is widely known that multilevel algorithms produce better partitions than flat partitioning schemes, the systems used in industry, e.g., at Google [
16] or Facebook [
95,
123], are primarily non-multilevel algorithms. The main reason for this is that the scalability of multilevel algorithms is often limited to a few hundred processors [
92,
106]. Furthermore, most parallel multilevel systems implement matching-based coarsening algorithms [
40,
46,
92,
106,
170,
181] that are not capable to efficiently reduce the size of today’s complex networks (power-law node degree distribution). The most prominent distributed multilevel algorithms are Jostle [
181], ParMetis [
97], PT-Scotch [
40], KaPPa [
92], ParHIP [
128] and ScalaPart [
106] for graph, and Parkway [
170] and Zoltan [
46] for hypergraph partitioning.
Meyerhenke et al. [
128] build the parallel multilevel partitioner ParHIP that uses a parallel version of the size-constraint label propagation algorithm [
126]. The algorithm is used to compute a clustering in the coarsening phase and as a local search algorithm in the refinement phase. To obtain an initial partition of the coarsest graph, it uses the distributed evolutionary graph partitioner KaFFPaE [
149]. On complex networks, ParHIP computes edge cuts
\(38\%\) smaller than those of ParMetis [
97] on average, while it is also more than a factor of two faster.
Wang et al. [
182] use a similar approach that also utilizes the label propagation algorithm to compute a clustering in the coarsening phase. The algorithm is implemented on top of Microsoft’s Trinity graph engine [
161]. The partitioner additionally uses external memory techniques to partition large graphs on a small number of machines. However, it does not perform multilevel refinement (the initial partition is projected to the input graph).
Geometric Partitioners. Many graphs are derived from geometric applications and are enriched with coordinate information (e.g., each vertex is associated with a
\(d\)-dimensional point). A mesh with coordinate information and a partition based on these coordinates is shown in Figure
7. Geometric partitioning techniques use this information to partition the corresponding point set into
\(k\) equally sized clusters while minimizing an objective function defined on the clusters. The objective function should be chosen such that it implicitly optimizes the desired graph partitioning metric (e.g., the sum of the lengths of all bounding boxes approximates the total communication volume [
45]). Since geometric methods ignore the underlying structure of the (hyper)graph, the quality of the partitions is often inferior compared to traditional multilevel algorithms. However, these algorithms are often simpler leading to faster and more scalable algorithms. Prominent techniques use space-filling curves [
19,
138] that map a set of
\(d\)-dimensional points to a one-dimensional line. A fundamental property of this curve is that points that are close on the line are also close in the original space. Other approaches recursively divide the space via cutting planes such as Octree-based partitioning [
129], recursive coordinate bisection [
22,
164], and recursive inertial bisection [
169,
184]. The MultiJagged algorithm of Deveci et al. [
45] uses multisection rather than bisection to reduce the depth of the recursion and speed up computation relative to recursive coordinate bisection; its hybrid implementation uses MPI and Kokkos [
171] to support both distributed-memory message passing between PEs and multithreading or GPU computation within PEs.
Recently, Von Looz et al. [
177] presented a scalable balanced
\(k\)-means algorithm to partition geometric graphs. The
\(k\)-means problem asks for a partition of a point set
\(P\) into
\(k\) roughly equally sized clusters such that the squared distances of each point to the mean of its cluster is minimized (in the following also referred to as the center of a cluster). Clusters obtained with this problem definition tend to have better shapes than computed with previous methods and also produce better partitions when measured with graph metrics [
125]. They present a parallel implementation of Lloyd’s greedy algorithm [
118] that repeats the following steps until convergence. First, each point
\(p \in P\) is assigned to the cluster that minimizes the distance of
\(p\) to its center. Afterwards, the center of each cluster is updated by calculating the arithmetic mean of all points assigned to it. To achieve balanced cluster sizes, an influence factor
\(\gamma _c\) is introduced individually for each cluster
\(c\) and
\(\text{eff}\_\text{dist}(p, \text{center}(c)) := \text{dist}(p, \text{center}(c)) / \gamma _c\) is used as the distance of point
\(p\) to the center of a cluster
\(c\). If a cluster
\(c\) becomes overloaded, then the influence factor
\(\gamma _c\) is decreased; otherwise, it is increased. Thus, underloaded clusters become more attractive. The implementation replicates the cluster centers and influence factors globally and after each computation phase, it updates the values via a parallel sum operation. To obtain an initial solution, it sorts the points according to the index on a space-filling curve and splits the order into
\(k\) equally sized clusters. Furthermore, it establishes a lower bound for the distances of each point to the second-closest cluster, which allows us to skip expensive distance computations for most of the points. Additionally, each processor sorts the cluster centers according to their distances to a bounding box around the process-local points. Evaluating the target clusters in increasing distance order allows us to abort early when the minimum distance of the remaining clusters is above the distance of already found candidates.
Scalable Edge Partitioning. Schlag et al. [
156] present a distributed algorithm to solve the edge partitioning problem. The edge partitioning problem asks for a partition
\(\Pi = \lbrace E_1, \ldots , E_k\rbrace\) of the edge set into
\(k\) blocks each containing roughly the same number of edges, while minimizing the vertex cut
\(\sum _{v \in V} \rho (v) - 1\) where
\(\rho (v) = |\lbrace E_i~|~E_i \in \Pi : I(v)~\cap ~E_i \ne \emptyset \rbrace |\). They evaluated two methods to solve the problem. The first transforms the graph into its dual hypergraph representation (edges of the graph become vertices of the hypergraph and each vertex of the graph induces a net spanning its incident edges). Using a hypergraph partitioner that optimizes the connectivity metric to partition the vertex set directly optimizes the vertex cut of the underlying edge partitioning problem. The second method uses a distributed construction algorithm of the so-called
split-and-connect(SPAC) graph. For each vertex
\(u\), it inserts
\(d(u)\) auxiliary vertices into the SPAC graph and connects them to a cycle using auxiliary edges each with weight one. Each auxiliary vertex is a representative for exactly one incident edge of
\(u\). For each edge
\((u,v) \in E\), it adds an infinite weight edge between the two representatives of the corresponding edge. Thus, a partition of the vertex set of the SPAC graph cannot cut an edge connecting two representatives. Therefore, such a partition can be transformed into an edge partition by assigning each edge to the block of its representatives. In the evaluation, they compare both representations while using different graph and hypergraph partitioners. The results showed that parallel graph partitioners outperform distributed hypergraph partitioners. However, the sequential hypergraph partitioner KaHyPar [
154] produces significantly better vertex cuts than all other approaches (more than
\(20\%\) better than the best graph-based approach), but is an order of magnitude slower than the evaluated distributed algorithms.
6.3 GPU
Due to their high computational power, modern GPUs have become an important tool for accelerating data-parallel applications. However, due to the highly irregular structure of graphs, it remains challenging to design graph algorithms that efficiently utilize the SIMD architecture of modern GPUs.
Multilevel Graph Partitioning. Goodarzi et al. [
67,
68] present two algorithms for GPU-based multilevel graph partitioning. Their earlier approach [
67] uses heavy-edge matching for coarsening and transfers the coarsest graph onto CPU for initial partitioning (using Mt-Metis [
96]). During refinement, vertices are distributed among threads and each thread finds the blocks maximizing the gain values of its assigned vertices. To prevent conflicting moves that worsen the edge cut in combination, refinement alternates between rounds in which only moves to blocks with increasing (respectively, decreasing) block IDs are considered. For each block, potential moves to the block are collected in a global buffer, which is then sorted, and the highest rated moves are executed.
Their later approach [
68] brings several improvements. First, the authors use Warp Segmentation [
103] to improve the efficiency of the heavy-matching computation during coarsening. Initial partitioning is then performed on the GPU using a greedy growing technique. During refinement, vertices are once more divided among threads, and each thread finds the blocks maximizing the gain of its assigned boundary vertices and collects the potential moves in a global buffer. Then, the algorithm finds the highest rated
\(\ell\) moves in the global buffer, for some small input constant
\(\ell\). Since moves might conflict with each other, their algorithm distributes all
\(2^{\ell }\) move combinations across thread groups and finds the best combination, which is then applied to the graph partition. This process is repeated until the global buffer is empty. On average, their GPU-based approach is approximately 1.9 times faster than Mt-Metis while computing slightly worse edge cuts across their benchmark set of 16 graphs.
In his PhD thesis, Fagginer Auer [
15] develops two multilevel algorithms running on a GPU. One that uses spectral refinement and one that uses greedy refinement. Later, Fagginger Auer and Bisseling [
53] present a fine-grained shared-memory parallel algorithm for graph coarsening and apply this algorithm in the context of graph clustering to obtain a fast greedy heuristic for maximising modularity in weighted undirected graphs. The algorithm is suitable for both multi-core CPUs and GPUs. Later, Gilbert et al. [
63] present performance-portable graph coarsening algorithms. In particular, the authors study a GPU parallelization of the heavy edge coarsening method. The authors evaluate their coarsening method using a multilevel spectral graph partitioning algorithm as primary use case.
Spectral Graph Partitioning. The availability of efficient eigensolvers on GPUs has led to a recent re-emergance of spectral techniques for graph partitioning on GPU systems [
1,
2,
133]. These techniques were first developed by Donath and Hoffman [
50,
51] and Fiedler [
56] to compute graph bisections in the 1970s. Subsequently, these techniques have been improved [
20,
27,
83,
140,
164] and extended to partition a graph into more than two blocks using multiple eigenvectors [
8,
83].
Naumov and Moon [
133] present an implementation of spectral graph partitioning for single GPU systems as part of the nvGRAPH library, whereas Acer et al. [
1,
2] propose the multi-GPU implementation Sphynx. Both partitioners precondition the matrix and use the LOBPCG [
107] eigenvalue solver. The eigenvectors are then used to embed the graph into a multidimensional coordinate space, which is then used to derive a partition of the graph. In nvGRAPH, this is done using a
\(k\)-means clustering algorithm on the embedded graph, whereas Sphynx uses the geometric graph partitioner Multi-Jagged [
45], which supports multi-GPU systems. Since the approach by Acer et al. outperforms nvGRAPH in terms of partition balance, cut size and running time (when run on a single GPU system), we focus on their experimental evaluation. To this end, when Sphynx is compared against ParMETIS [
100], ParMETIS generally obtains significantly better cuts than Sphynx—approximately 20% (respectively, 70%) lower cuts on regular (respectively, irregular) graph instances). On irregular graphs, the authors report a significant speedup of approximately 19 using Sphynx on 24 GPUs compared to ParMETIS with 168 MPI processes across four compute nodes, although ParMETIS is approximately three times faster than Sphynx on regular graphs even when using a single CPU core for each GPU used by Sphynx. Additionally, Acer et al. report the influence of several matrix preconditioners on different classes of graphs.