In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the past decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic [29]. In particular, the survey extends the previous survey by also covering hypergraph partitioning and has an additional focus on parallel algorithms.

1 Introduction

Graphs are a universal and widely used way to model relations between objects. They are useful in a wide range of applications from social networks, simulation grids, road networks/route planning, (graph) neural networks, and many more. With exploding data sets, the scalability of graph processing methods has become an increasing challenge. As huge problem instances in the area become abundant, there is a need for scalable algorithms to perform analysis. Often (hyper)graph partitioning is a key technology to make various algorithms scale in practice. More precisely, scalable algorithms for various applications often require a subroutine that partitions a (hyper)graph into \(k\) blocks of roughly equal size such that the number of edges that run between blocks is minimized. Balance is most often modeled using a balancing constraint that demands that all block weights are below a given upper bound. The number of applications that require such partitions of (hyper)graphs as key subroutine is huge. Social network operators use graph partitioning techniques to load balance their operation and make sure that site response times are low [95]; efficient route planning algorithms rely on a precomputation phase in which a road network needs to be partitioned [43]; scientific simulations that run on supercomputers use (hyper)graph partitioning to balance load and minimize communication between processors [35, 58, 158]; partitioning (graph) neural networks is helpful to speedup training times [193]; and (hyper)graph partitioning helps to improve chip placements in very-large-scale integration (VLSI) design [7, 137].

On the other side, the problem is NP-hard [62] even for unweighted trees of maximum degree three [9] and no constant-factor approximation algorithms exist [11]. Thus, heuristic algorithms are used in practice. The purpose of this article is to give a structured overview of the rich literature, with a clear emphasis on explaining key ideas and discussing work that has been published in the last decade and is missing in other overviews. More precisely, our work serves as an update to a previous generic survey on the topic [29]; in particular, we also extend the scope of the survey to hypergraph partitioning algorithms.

There have been other (older) surveys on the topic. The book by Bichot and Siarry [23] covers techniques for graph partitioning such as the multilevel method, metaheuristics, parallel methods, and hypergraph partitioning, as well as applications of graph partitioning. The survey by Schloegel et al. [159] was published around the turn of the millennium and has a focus on techniques for scientific computing, including algorithms for adaptive and dynamic simulations and process mapping algorithms. Static algorithms as well as formulations with multiple objectives and constraints are also discussed. Monien et al. [130] discuss multilevel algorithms. Their description has a focus on matching-based coarsening and local search that use vertex-swapping heuristics. Kim et al. [105] cover memetic algorithms. The last generic survey on the topic by Buluç et al. [29] covers a wide range of techniques and practical algorithms that have been published in or before 2013. For hypergraph partitioning there exist two older surveys [7, 137]. Alpert and Kahng [7] discuss min-cut and ratio cut bipartitioning formulations along with multi-way extensions, constraint-driven partitioning and partitioning with replication. Papa and Markov [137] discuss practical applications of hypergraph partitioning, exact algorithms, and various local search heuristics for the problem as well as software packages and benchmarks. Schlag [154] presents a recent overview of hypergraph techniques in his dissertation.

Our survey¹ is structured as follows. We start by introducing the problems and other preliminaries in Section 3. Then in Section 4, we focus on new applications that emerged in the last decade. We continue with novel sequential techniques in Section 5. Then, we continue with covering parallel algorithms in Section 6. We describe experimental methodology in Section 7. We conclude with future challenges in Section 8.

2 Overview/classification

The field is currently very active. On the one hand, there are streaming algorithms that are very fast and consume little memory yet yield only low-quality solutions. These algorithms assume that internal memory of a single machine proportional to the number of vertices is available and typically load nodes and their neighborhood one by one to directly make an assignment to a block. Hence, the edges of a graph do not have to fit into the memory of the machine. On the other hand, there is a wide range of sequential internal-memory algorithms that tackle partitioning problems that fit into the main memory of a single machine without using parallelism. Recent advances in this area provide a rich set of algorithms that are able to compute partitions of very high quality. These high-quality algorithms often also have shared-memory parallel counterparts. While the sequential internal memory/parallel shared-memory non-metaheuristic algorithms in this area use a reasonable amount of time for most applications, there are also memetic or evolutionary algorithms that invest a lot of resources to achieve even higher quality. Note that memetic or evolutionary algorithms can be sequential internal-memory, shared-memory or even distributed-memory parallel. Due to their high running time, such algorithms are typically used only on graphs having a few hundred thousand nodes. Exact solvers are currently able to solve only very small instances while also requiring a lot of time to solve an instance even for few blocks. For larger numbers of blocks, the exact solvers currently do not work well. Distributed parallel algorithms scale well to large instances, but if these tools do not implement multilevel strategies, they typically have much lower quality than sequential internal memory partitioning algorithms. Even when such algorithms use the multilevel scheme, solution quality of the algorithms operating in this model of computation is typically worse compared to their internal memory/shared-memory counterparts. However, this mode of computation has the advantage that huge instances can be partitioned very quickly and also that it enables researchers to partition huge graphs on cheap machines. Last, researchers also work on algorithms for other platforms like GPUs to be able to fully use the capabilities of existing hardware. However, GPUs have limited memory size and current algorithms running on a GPU compute partitions that cut significantly more edges than high-quality internal memory schemes.

3 Preliminaries

3.1 Notation

Hypergraphs and Graphs. A weighted undirected hypergraph \(H=(V,E,c,\omega)\) is defined as a set of \(n\) vertices \(V\) and a set of \(m\) hyperedges/nets \(E\) with vertex weights \(c:V \rightarrow \mathbb {R}_{\gt 0}\) and net weights \(\omega :E \rightarrow \mathbb {R}_{\gt 0}\), where each net \(e\) is a subset of the vertex set \(V\) (i.e., \(e \subseteq V\)). The vertices of a net are called pins. We extend \(c\) and \(\omega\) to sets in the natural way, i.e., \(c(U) :=\sum _{v\in U} c(v)\) and \(\omega (F) :=\sum _{e \in F} \omega (e)\). A vertex \(v\) is incident to a net \(e\) if \(v \in e\). \(\mathrm{I}(v)\) denotes the set of all incident nets of \(v\). The set \(\mathrm{\Gamma }(v) := \lbrace u~|~\exists e \in E : \lbrace v,u\rbrace \subseteq e\rbrace\) denotes the neighbors of \(v\). The degree of a vertex \(v\) is \(d(v) := |\mathrm{I}(v)|\). We assume nets to be sets rather than multisets, i.e., a vertex can only be contained in a net once. Nets of size one are called single-vertex nets. Given a subset \(V^{\prime } \subset V\), the subhypergraph \(H_{V^{\prime }}\) is defined as \(H_{V^{\prime }}:=(V^{\prime }, \lbrace e \cap V^{\prime }~|~e \in E : e \cap V^{\prime } \ne \emptyset \rbrace)\).

A weighted undirected graph \(G=(V,E,c,\omega)\) is defined as a set of \(n\) vertices \(V\) and a set of \(m\) edges \(E\) with vertex weights \(c:V \rightarrow \mathbb {R}_{\gt 0}\) and edge weights \(\omega :E \rightarrow \mathbb {R}_{\gt 0}\). In contrast to hypergraphs, the size of the edges is restricted to two. Let \(G=(V,E,c,\omega)\) be a weighted (directed) graph. We use hyperedges/nets when referring to hypergraphs and edges when referring to graphs. However, we use the same notation to refer to vertex weights \(c\), edge weights \(\omega\), vertex degrees \(d(v)\), and the set of neighbors \(\mathrm{\Gamma }\). In an undirected graph, an edge \((u,v) \in E\) implies an edge \((v,u) \in E\) and \(\omega (u,v) = \omega (v,u)\).

Partitions and Clusterings. A \(k\)-way partition of a (hyper)graph \(H\) is a partition of its vertex set into \(k\) blocks \(\mathrm{\Pi }= \lbrace V_1, \dots , V_k\rbrace\) such that \(\bigcup _{i=1}^k V_i = V\), \(V_i \ne \emptyset\) for \(1 \le i \le k\), and \(V_i \cap V_j = \emptyset\) for \(i \ne j\). We call a \(k\)-way partition \(\mathrm{\Pi }\) \(\varepsilon\)-balanced if each block \(V_i \in \mathrm{\Pi }\) satisfies the balance constraint: \(c(V_i) \le L_{\max } := (1+\varepsilon)\left\lceil \frac{c(V)}{k} \right\rceil\) for some parameter \(\mathrm{\varepsilon }\). ² We call a block \(V_i\) overloaded if \(c(V_i) \gt L_{\max }\). For each net \(e\), \(\Lambda (e) := \lbrace V_i~|~ V_i \cap e \ne \emptyset \rbrace\) denotes the connectivity set of \(e\). The connectivity \(\lambda (e)\) of a net \(e\) is the cardinality of its connectivity set, i.e., \(\lambda (e) := |\Lambda (e)|\). A net is called a cut net if \(\lambda (e) \gt 1\); otherwise (i.e., if \(|\mathrm{\lambda }(e)|=1\)), it is called an internal net. A vertex \(u\) that is incident to at least one cut net is called a border vertex. The number of pins of a net \(e\) in block \(V_i\) is defined as \(\mathrm{\Phi }(e,V_i) := |\lbrace V_i \cap e \rbrace |\). A block \(V_i\) is adjacent to a vertex \(v \notin V_i\) if \(\exists e \in \mathrm{I}(v) : V_i \in \Lambda (e)\). We use \(\mathrm{B}(v)\) to denote the set of all blocks adjacent to \(v\). Given a \(k\)-way partition \(\mathrm{\Pi }\) of \(H\), the quotient graph\(Q := (\mathrm{\Pi }, \lbrace (V_i,V_j)~|~\exists e \in E : \lbrace V_i,V_j\rbrace \subseteq \Lambda (e)\rbrace)\) contains an edge between each pair of adjacent blocks. A clustering \(C = \lbrace C_1, \dots , C_l\rbrace\) of a hypergraph is a partition of its vertex set. In contrast to a \(k\)-way partition, the number of clusters is not given in advance, and there is no balance constraint on the actual sizes of the clusters \(C_i\).

A \(k\)-way (hyper)edge partition of a (hyper)graph \(H\) is a partition of its (hyper)edge set into \(k\) blocks \(\mathrm{\Pi }= \lbrace E_1, \dots , E_k\rbrace\) such that \(\bigcup _{i=1}^k E_i = E\), \(E_i \ne \emptyset\) for \(1 \le i \le k\), and \(E_i \cap E_j = \emptyset\) for \(i \ne j\). We call a \(k\)-way (hyper)edge partition \(\mathrm{\Pi }\) \(\varepsilon\)-balanced if each block \(E_i \in \mathrm{\Pi }\) satisfies the balance constraint: \(\omega (E_i) \le (1+\varepsilon)\lceil \frac{\omega (E)}{k} \rceil\) for some parameter \(\mathrm{\varepsilon }\). A vertex is called a cut vertex if it contains two or more edges in different blocks. The (weighted) vertex-cut of a \(k\)-way (hyper)edge partition is the total weight of cut vertices.

3.2 The k-way (Hyper)Graph Partitioning Problem

Problem Definition. The \(k\)-way (hyper)graph partitioning problem is to find an \(\varepsilon\)-balanced \(k\)-way partition \(\mathrm{\Pi }\) of a (hyper)graph \(H=(V,E,c,\omega)\) that minimizes an objective function over the cut nets for some value of \(\varepsilon\). The two most commonly used cost functions in case we are dealing with hypergraphs are the cut-net metric \(\mathfrak {f}_c(\mathrm{\Pi }):= \sum _{e \in E^{\prime }} \omega (e)\) and the connectivity metric \(\mathfrak {f}_\lambda (\mathrm{\Pi }):= \sum _{e\in E^{\prime }} (\lambda (e) -1)~\omega (e)\), where \(E^{\prime }\) is the cut-set (i.e., the set of all cut nets) [44, 49]. While the cut-net metric sums the weights of all nets that connect more than one block of the partition \(\mathrm{\Pi }\), the connectivity metric additionally takes into account the actual number \(\lambda\) of blocks connected by the cut nets. When partitioning graphs, the objective is often to minimize \(\sum _{i \lt j} \omega (E_{ij})\) (weight of all cut edges), where \(E_{ij} := \lbrace \lbrace u, v\rbrace \in E \mid u \in V_i, v \in V_j \rbrace\). Note that for graphs the objective functions \(\mathfrak {f}_c(\mathrm{\Pi })\) and \(\mathfrak {f}_\lambda (\mathrm{\Pi })\) revert to edge-cut (i.e., the sum of the weights of those edges that have endpoints in different blocks). Apart from these common cost functions, other specialized objective functions exist [29]. Figure 1 shows an example partition of a graph. Throughout the article, we use the term high quality if the respective objective function is small compared to other tools, and low quality if the opposite is the case.

Fig. 1.

4 Applications

Research on graph partitioning algorithms has always been motivated by its numerous applications. Some traditional applications are parallel processing (e.g., in scientific computing), VLSI design, route planning and image segmentation; see also the previous survey [29]. Yet, in recent years several new applications have emerged.

Distributed Databases and Query Optimization. A prominent application is distributed database sharding [14, 41, 95, 109, 187]. Vertices represent data records, nets represent queries that access records, and blocks correspond to shards (machines). Minimizing connectivity thus corresponds to minimizing the average number of shards involved in a query, which optimizes the latency and query processing time.

A similar application is to boost IO throughput in the Google search engine backend by improving cache utilization. Archer et al. [13] initialize a voting table that assigns search requests, using graph partitioning on a bipartite graph where vertices are search terms and queries, and an edge exists for each term contained in a query. Subsequent simulation-and-refinement further boosts the prediction accuracy (item in cache) of the voting table.

Li et al. [116] use graph partitioning to construct a set similarity search index. The approach first filters candidate sets using the index, then checks the remaining candidates brute-force. Vertices represent sets, edges connect sets that are \(k\)-nearest neighbors (or within a similarity threshold, depending on the query type), and cut size is correlated with pruning efficiency.

Data Distribution and Scheduling. Djidjev et al. [47] use graph partitioning to accelerate and parallelize the computation of density matrices for molecular dynamics simulations. Lattice Boltzmann Fluid Flow simulations [57, 65, 108] fall into the classic load balancing and communication minimization application category of graph partitioning. Yu et al. [192] consider sparse matrix vector multiplication [32], however focused on shared-memory machines with many NUMA nodes instead of distributed systems.

Parallel graph processing systems such as Pregel [124] or Giraph [82] have become wide-spread tools for network analysis tasks. These systems employ a think-like-a-vertex or think-like-an-edge programming paradigm, requiring a balanced partition of vertices or edges across machines. Streaming approaches [91, 123] based on label propagation result in better performance than traditional range-based or hash-based partitioning.

Quantum Circuit Simulation. Several papers employ nested dissection on hypergraphs to find good contraction trees, to speed up the simulation of quantum circuits on classical non-quantum machines [77, 93, 136]. This can be used to experimentally verify the correctness of a quantum circuit, and push back premature claims of having achieved quantum supremacy.

SAT Solving. Boolean satisfiability (SAT) formulas can be represented as hypergraphs. One representation is the dual representation with clauses as vertices and variables as nets, consisting of clauses that contain the variable (either negated or not). Mann and Papp [121] use balanced bipartitioning to identify variables that branching solvers should focus on assigning first, such that the formula is split into two once variables in the cut are assigned. The two sub-formulas can be solved independently, which is expected to be faster. Most SAT solvers employ a heuristic (variable state independent decaying sum (VSIDS)) that assigns priorities (frequency in observed conflict clauses) to variables, selecting the highest priority to branch on next. The authors consider two variants, the latter of which is more successful: force split the formula first and use VSIDS for tie-breaking, or initialize the priorities with the partition and let VSIDS overrule the decision.

Miscellaneous. Lamm et al. [111] formulate recombine-operators in an evolutionary framework for independent sets based on separator decompositions and graph partitions. Given two independent sets, their vertices in the two blocks of a separator decomposition are exchanged, yielding two offspring independent sets that are refined using local search.

Kumar et al. [110] use a clustering formulation to compute trajectories of moving objects in videos. Detected objects in a frame correspond to vertices and each block corresponds to an object. The authors enforce special constraints such that each partition contains only one object per frame and such that objects do not jump between frames.

Yao et al. [188] consider the placement of control units in software-defined networks to minimize the average latency of messages from switches (nodes) to controllers. First the number of necessary controllers is estimated based on message volume capacities, then the switch graph is partitioned and one control unit is placed in each block, using a separate routine to perform placement inside the blocks.

Quantum chemical simulations such as density functional theory exhibit quadratic time complexity, which is why calculations on largely independent sub-systems are used to approximate and accelerate the process. Von Looz et al. [178] propose to use graph partitioning to automatize the sub-system construction process, where small edge cuts correspond to small introduced calculation errors.

5 Sequential Techniques

5.1 Classic (Hyper)Graph Partitioning Techniques

This section discusses partitioning techniques repeatedly used in this survey. We briefly outline the multilevel paradigm and the most common local search algorithms. These algorithms move vertices according to gain value. The gain value \(g_u(V_j)\) reflects the change in the objective function when we move a vertex \(u\) from its current to a target block \(V_j\) (e.g., reduction in the edge cut).

The Multilevel Paradigm.. The most successful approach to solve the (hyper)graph partitioning problem is the multilevel paradigm. It consists of three phases. In the coarsening phase, a hierarchy of successively smaller and structurally similar (hyper)graphs are created by contracting matchings or clusters of vertices. Once the (hyper)graph is small enough, an initial partitioning algorithm obtains a partition of the coarsest (hyper)graph. In the uncoarsening phase, the partition is projected to the next larger (hyper)graph in the hierarchy, and, at each level, local search algorithms improve the objective function (e.g., edge cut). Figure 3 illustrates the multilevel paradigm.

A contraction of several vertices into a supervertex aggregates their weight in the supervertex. To further reduce the size of the coarser (hyper)graph, one can also remove all edges that become identical except for one, at which their weight is aggregated (self-loops or single-vertex nets in the hypergraph case are discarded). This way, one obtains a partition with the same balance and cut properties when projecting a partition to the next larger hypergraph in the hierarchy.

The Label Propagation Algorithm.. The label propagation algorithm, illustrated in Figure 2, was originally proposed to detect community structures in large-scale networks [142, 194] but was also used as a refinement technique in the partitioning context [101]. The label propagation algorithm works in rounds, and each vertex \(u\) is associated with a label \(L[u]\). Initially, each vertex is assigned its own label (i.e., \(L[u] = u\)). In each round, the vertices are visited in some order, and whenever a vertex \(u\) is visited, it adapts its label to the label appearing most frequently in its neighborhood (ties are broken randomly). The algorithm proceeds until it reaches a predefined number of rounds or none of the vertices changes its label in a round. The algorithm can be used as a clustering algorithm in the coarsening phase [126] or a local search algorithm when the current \(k\)-way partition is used for the initial label assignment.

Fig. 2.

The Fiduccia-Mattheyses Algorithm.. Fiduccia and Mattheyses [55] presented the first linear-time heuristic for the balanced bipartitioning problem (referred to as FM algorithm). The algorithm works in passes, and a vertex can change its block at most once in a pass. The FM algorithm uses two priority queues (one for each block). Initially, it inserts each boundary vertex into the priority queue, along with the gain of moving the vertex to the opposite block. Each step repeatedly performs the move with the highest gain that does not violate the balance constraint and, subsequently, updates the gain of all non-moved neighbors. A pass ends when all vertices are moved, or the balance constraint prevents further moves. The FM algorithm also performs negative gain moves and is therefore able to escape from local optima. In the end, it reverts to the best seen solution during the pass.

5.2 Recent Advances in Multilevel Partitioning

This section discusses recent developments in the multilevel partitioning context. We start with two new multilevel partitioning schemes preferable in situations where either the number of blocks is large or high solution quality is required. We then take a closer look at the different phases of the multilevel scheme and highlight recent algorithmic improvements.

Deep Multilevel Partitioning.. A \(k\)-way partition of a (hyper)graph can be obtained either by recursive bipartitioning(RB) or direct \(k\)-way partitioning. The former computes a bipartition of the input (hyper)graph and then recurses on both blocks until the (hyper)graph is divided into the desired number of blocks. The latter partitions the (hyper)graph directly into \(k\) blocks and applies \(k\)-way local search algorithms to improve the solution.

Recently, these approaches have been generalized to deep multilevel partitioning [74]. The approach continues coarsening until only \(2X\) vertices are left (where \(X\) is an input parameter) and computes an initial bipartition of the coarsest (hyper)graph. In the uncoarsening phase, it bipartitions a block when its size becomes larger than \(2X\) vertices as long as there are less than \(k\) blocks. Thus one arrives at a \(k\)-partition in the end. See Figure 3 for an illustration.

Fig. 3.

The deep multilevel approach combines the strengths of the RB and direct \(k\)-way scheme: It recursively bipartitions blocks of size \(O(1)\) in the uncoarsening phase and thereby enables using \(k\)-way local search algorithms, while the RB scheme recursively bipartitions blocks of size \(O(k)\) and direct \(k\)-way reverts to RB for initial partitioning [3, 17, 101, 160]. Figure 3 illustrates the different instantiations of the multilevel scheme.

\(n\)-Level (Hyper)Graph Partitioning.. The depth of the multilevel hierarchy offers a trade-off between running time and solution quality of a multilevel algorithm [6]. More levels provide “MLPartmore opportunities to refine the current solution at different granularities but require fast local search algorithms to achieve reasonable running times. The most extreme version of the multilevel paradigm is to contract only a single vertex at each level, which induces a hierarchy with almost \(n\) levels. This technique was first studied by Osipov et al. [135] for graph partitioning and later improved further by Schlag et al. [3, 154, 155] for hypergraph partitioning. The approach is made feasible by using a highly localized variant of the classical FM algorithm [55] that initializes the priority queue only with the uncontracted nodes and expands the search to neighbors of moved nodes [135, 147]. Furthermore, the implementation uses a gain cache to avoid expensive recomputations of move gains [3] (reducing the running time of the FM algorithm by 45%) and an adaptive stopping rule that terminates a search early if it becomes unlikely to find further improvements [135] (reducing the running time by an order of magnitude). A parallel version of the n-level scheme exists [72, 73] that uncontracts a fixed number of vertices in parallel on each level (instead of a single vertex).

Coarsening. The coarsening phase aims to compute successively coarser approximations of the input (hyper)graph such that its structural properties are maintained and act in some sense as a filter that removes as much unnecessary information from the search space as possible [179]. In the past years, research focused on clustering techniques for complex networks and enhancing the coarsening process with information about the community structure of the (hyper)graph.

Meyerhenke et al. [127] use the size-constraint label propagation algorithm [142] to compute a vertex clustering, which is then contracted to form the multilevel hierarchy. Clustering algorithms can reduce the size of complex networks (power-law degree distribution) more efficiently than previously used matching-based approaches. The latter has the problem that vertices incident to high-degree vertices often remain unmatched. Therefore, Davis et al. [42] propose several techniques to reduce the number of unmatched vertices. Two non-adjacent unmatched vertices are matched if they share a common neighbor (also known as two-hop matching [98]) or if they are adjacent to two vertices that are already matched. Additionally, one can extend two matched vertices with an unmatched neighbor (three-way match).

The following two techniques recompute the (hyper)edge weights of the input (hyper)graph and use them as an input for a multilevel algorithm. The new weights encode more information about the importance of a (hyper)edge and should prevent a coarsening algorithm from collapsing (hyper)edges into a single vertex that are in the cut set of a good partition. Glantz et al. [64] define the weight of an edge \(e \in E\) as the minimum conductance value of all bipartitions induced by a spanning tree in which \(e\) is a cut edge. The conductance of a bipartition \((V_1, V_2)\) is \(cond(V_1,V_2) := \frac{\omega (E^{\prime }(V_1,V_2))}{\min \lbrace vol(V_1), vol(V_2) \rbrace }\) where \(vol(V_i)\) is the weight of all edges incident to vertices in \(V_i\). Chen and Safro [162] use the maximum algebraic distance [39, 145] between two pins of a net as its weight.

Lotfifar and Johnson [119] compute a net (hyperedge) clustering and assign a vertex to the cluster containing most of its incident nets. Their matching-based coarsening algorithm uses the restriction that matched vertices must be part of the same cluster. Heuer and Schlag [90] also use vertex clustering to restrict contractions to densely coupled regions of a hypergraph. Their algorithm transforms the hypergraph into its bipartite graph representation and then uses the Louvain algorithm [26] maximizing the modularity objective function to exploit its community structure.

Shaydulin and Safro [163] use ideas from algebraic multigrid as well as stable matching in which the preference lists determine a good combination of nodes for aggregation-based coarsening. The authors integrate their approaches into the Zoltan tool. Their experimental results with Zoltan, hMetis and PaToH demonstrate that given the same refinement, the proposed schemes are at least as effective as traditional matching-based schemes, while outperforming them on many instances.

Initial Partitioning. Coarsening usually proceeds until \(\Omega (k)\) vertices remain. Partitioners based on the direct \(k\)-way scheme often use multilevel recursive bipartitioning to obtain an initial \(k\)-way partition of the coarsest (hyper)graph [3, 17, 101, 160]. Heuer [87] showed that this leads to better initial partitions than flat partitioning techniques. To obtain an initial bipartition, many partitioners run a portfolio of different flat bipartitioning algorithms multiple times (e.g., greedy graph growing, random or spectral partitioning) followed by label propagation or FM local search and continue uncoarsening with the best bipartition out of these runs [3, 33, 98]. Preen and Smith [141] showed that the decision when to stop coarsening is often instance-dependent and proposed an adaptive stopping criterion (instead of using the same constant number of vertices for all instances).

Refinement.. Sanders and Schulz [148] propose a flow-based refinement technique for bipartitions that is scheduled on block pairs for \(k\)-way partitions. The idea is to choose a subset \(R\) of vertices around the cut of a bipartition \(\mathrm{\Pi }= \lbrace V_1,V_2\rbrace\). A flow network is constructed by contracting \(V_1 \setminus R\) to the source and \(V_2 \setminus R\) to the sink. The resulting maximum flow induces a possibly improved cut, which may violate the balance constraint. Therefore, \(|R|\) is chosen adaptively and a strongly connected-component calculation is used to characterize the set of all minimum cuts that can then be searched for a balanced bipartition. Heuer et al. [89] generalize this approach to hypergraphs. Gottesbüren et al. [69, 70] accelerate flow-based hypergraph refinement by running the flow algorithm directly on the hypergraph. By incorporating FlowCutter [80, 186], which solves incremental flow problems to trade off cut size for better balance, they explore the solution space more effectively, which leads to partitions with smaller cuts/connectivity. Examples are shown in Figure 4. FlowCutter [80, 186] is an iterative algorithm that starts with the minimum cut (and corresponding bipartition) between a given source and sink. If the bipartition is imbalanced, then the smaller block is grown by contracting all of its vertices plus an additional vertex (piercing node) to the corresponding terminal. This vertex is chosen incident to the cut, and if possible without increasing the cut-size in the following iteration. The resulting flow problems are nested in the sense that the terminal sets only grow, which makes the flow assignment from the previous iteration feasible but its maximality is violated.

Fig. 4.

Henzinger et al. [84] use integer linear programming (ILP) to refine \(k\)-way partitions directly. Analogously to flow-based refinement, a small region of vertices allowed to move is selected. The remaining vertices are contracted to one super-vertex per block. The ILP formulation is then run on this model graph. An example is shown in Figure 5. Techniques to speed up the ILP are giving a heuristic solution to the solver and symmetry breaking by fixing sufficiently heavy super-vertices to their corresponding blocks. Ugander and Backstrom [174] use linear programming to select from a set of possible moves a subset that yields the largest reduction in cut while satisfying the balance constraint.

Fig. 5.

In experiments, flow-based (hyper)graph refinement turns out to be a key ingredient of high-quality partitioning, i.e., it has a much better cost-benefit-ratio than other approaches to improve quality like using ILPs, evolutionary techniques, restarts, V-cycles, and so on.

Partitioning With Unevenly Distributed Vertex Weights. Whereas real-world applications often use vertex and edge weights to accurately model the underlying problem, the (hyper)graph partitioning research community commonly works with unweighted instances [88, 154]. Multilevel algorithms incorporate techniques to prevent the formation of heavy vertices (restricting the maximum allowed vertex weight or incorporate the weight of a vertex into the coarsening rating function as a penalty term), but considerably struggle if such vertices are already present in the input [88], as is the case for many real-world instances [5].

There are two approaches to compute balanced \(k\)-way partitions under a tight balance constraint in a multilevel partitioner: (i) ensure that initial partitioning finds a balanced partition [88] and that refinement applies node moves to the partition only when they satisfy the balance constraint, or (ii) allow intermediate balance violations [30, 31, 52, 74] and use rebalancing techniques to ensure that the final \(k\)-way partition is balanced [74, 120, 150, 180]. Given that finding a balanced \(k\)-way partition is a NP-hard problem [61] (reducible to the most common version of the job scheduling problem), both approaches do not guarantee balance.

Recently, Heuer et al. [88] proposed a technique that enables partitioners based on recursive bipartitioning to reliably compute balanced partitions in practice. The idea is to preassign a small portion of heaviest vertices to one of the two blocks (treated as fixed vertices) and optimize the objective function on the remaining vertices.

5.3 Evolutionary Computation

A genetic or evolutionary algorithm (GA) starts with a population of individuals (in our case, partitions of the (hyper)graph) and evolves the population over several generational cycles or rounds. If a genetic algorithm is combined with local search, then it is called a memetic algorithm (MA) [105]. In each round, the GA uses a selection rule to select good individuals based on the fitness of the individuals of the population and combines them to obtain improved offspring [66]. The combination step is typically done using a recombination operation. When an offspring is generated, an eviction rule is used to select a member of the population to be replaced by the new offspring. For an evolutionary algorithm, it is of major importance to preserve diversity in the population [18]; i.e., the individuals should not become too similar to avoid premature convergence of the algorithm. This is usually achieved by using mutation operations and by using eviction rules that take similarity of individuals into account. All of the algorithms in the literature follow a rather generic overall scheme. Hence, we focus on the description of the recombination operations.

Benlic and Hao [21] cluster large sets of vertices together that have been assigned to the same block in each individual to perform a recombination operation and combine their operator with tabu search. The recombination operation is motivated by the observation that given a number of high-quality solutions, there is always a high number of vertices that are clustered together throughout these solutions.

Sanders and Schulz introduced a distributed evolutionary algorithm, KaFFPaE (KaFFPaEvolutionary) [149]. The recombination operation uses a modified version of the multilevel graph partitioning solver within Karlsruhe high-quality partitioning (KaHIP) [148] that will not contract edges that are cut in one of the input partitions. Thus, the better of the two input individuals can be used as initial partitioning and local search can efficiently exchange good parts of solutions on multiple levels of the multilevel hierarchy. Moreover, the recombination operation guarantees that the offspring is at least as good as the better of the two input individuals.

Ruiz and Segura [146] present a memetic algorithm using a weighted matching-based recombination and diversity preservation that is similar to the algorithm by Benlic and Hao [21]. More precisely, given two individuals, the recombination operation computes a matching in a bipartite graph where the left hand side represents the blocks of the first individual and the right hand side represent the blocks of the second individual. Edges between the vertices are weighted by the size of the intersection of the corresponding blocks. In this bipartite graph, a matching is computed and the corresponding intersections are used as the core of the sets in the new partition (offspring). Moreover, the algorithms use local search with negative cycle detection to further reduce the edge-cut of the computed partitions [151]. When an offspring is inserted into the population, the individual having the highest similarity is evicted. Here, similarity between two individuals is based on the weight of the matching in the bipartite graph defined above.

Henzinger et al. [84] provide a recombination operator that is based on integer linear programming and integrate this into KaFFPaE [149]. The new recombination operation builds and solves an integer linear program from multiple individuals of the population. Roughly speaking, the authors take \(l\) individuals (partitions) and build an overlap graph by contracting pairs of nodes that are in the same block in every partition. The original partitioning problem is then solved on the (much smaller) overlap graph using integer linear programming that is initialized with the block affiliations according to the partition that has the lowest cut value. When multiple partitions have the same cut value, one is chosen at random.

Andre et al. [10] generalize KaFFPaE to the multilevel memetic hypergraph partitioner KaHyParE. They also introduce a multilevel multi-point recombination operator that is applied to the best individuals. This operator penalizes contraction of vertices that appear in nets that are cut in many parents. Preen and Smith [141] refine KaHyParE with an adaptive scheme to stop coarsening.

Besides memetic algorithms for hypergraph and graph partitioning, there are also algorithms for other problem variations. For example, Moreira et al. [131] and Popp et al. [139] gave memetic algorithms for acyclic (hyper)graph partitioning in which the input is an acyclic (hyper)graph and the partition has to fulfill an acyclicity constraint. Moreover, Schulz and Sanders [152] gave a distributed evolutionary framework to compute \(k\)-way vertex separators in graphs.

5.4 Repartitioning

Graph repartitioning is an extension of graph partitioning that copes with dynamic graphs, i.e., graphs whose set of vertices and edges are modified over time. Assuming that a graph is partitioned, dynamic changes on its components can make its blocks imbalanced and its edge-cut or vertex-cut larger. Figure 6 illustrates the evolution of a graph over time followed by repartitioning efforts, In general, a rough upper bound for repartitioning quality can be obtained by partitioning the whole graph from scratch every time it is modified. Since this is expensive, many repartitioners use faster approaches.

Fig. 6.

Vaquero et al. [176] propose a distributed repartitioning algorithm for large-scale graph processing systems. In their algorithm, hashing is used as initial partitioning for all vertices. Based on local information, a label propagation scheme iteratively migrates vertices to blocks where most of their neighbors are located until a convergence is achieved. This procedure natively deals with modifications in the graph. Their experiments show that their approach is able to keep a better edge-cut and application performance upon graph changes in comparison to the hashing algorithm. The authors did not compare their algorithm against other state-of-the art repartitioning algorithms.

Xu et al. [185] propose LogGP, a repartitioning algorithm with centralized coordination for graph processing systems. Their algorithm analyzes and reuses historical information to improve the partitioning. Particularly, LogGP builds a hypergraph by combining the graph with hyper-edges that represent previous historical partitions. This hypergraph is then partitioned by a novel streaming pin partitioning algorithm. During the execution, the system uses statistical inferences from running logs to optimize the partitioning process. In their experiments, LogGP outperforms state-of-the-art streaming and repartitioning algorithms with respect to edge-cut and application runtime.

Nicoara et al. [134] propose a lightweight repartitioning algorithm and integrate it in their own distributed social network database system Hermes. Within Hermes, the initial partitioning is performed with hashing or Metis, while their repartitioning algorithm is triggered when there is a change in the vertex set of the graph. Repartitioning uses iterative local improvement of balance and edge cut. The associated data is moved only when the improvement process is finished. In their experiments with real-world workloads, their algorithm produces roughly the same edge-cut while migrating up to an order of magnitude less data in comparison with Metis (repartitioning from scratch). Moreover, their algorithm improves around 1.7 times over hashing with respect to aggregate throughput.

Huang and Abadi [94] propose Leopard, an algorithm that solves graph repartitioning while also replicating vertices. Leopard can potentially take advantage of any one-pass vertex partitioning algorithm. This is possible, since these algorithms are vertex-centered, so they can be used locally on a dynamic graph at any moment. Moreover, Leopard is the first algorithm to integrate vertex replication with the repartitioning. This replication provides fault tolerance and improves locality, which even increases the partitioning quality, since it is mostly based on local information. Their experiments show that Leopard without vertex replication maintains a partition quality comparable to running Metis from scratch at any moment. However, Leopard with vertex replication further reduces edge-cut by up to an order of magnitude.

Kiefer et al. [104] propose a repartitioning algorithm optimized for systems whose performance does not scale linearly with the input size. The authors start by formulating a penalized version of the graph partitioning problem by defining the weight of a block as the sum of weights of its vertices plus a penalty function that grows monotonically with the cardinality of the block. In practice, this formulation penalizes resource consumption, which models the nonlinear behavior of the performance of some real-world systems. The authors modify a static multilevel algorithm for graph partitioning to solve the penalized version of the problem, then this multilevel algorithm is used within an adapted version of a hybrid repartitioning strategy. Their hybrid repartitioning algorithm works by triggering a local refinement whenever the balance constraint is violated and computing a whole new partition in the background after a given number of refinements. More specifically, this new partition is computed from scratch; afterward its blocks are mapped onto the blocks of the previous partition to minimize migration costs (see extended version of this survey [37] or supplementary material for more details on process mapping). Then it replaces the current partition only if the new edge-cut compensates the migration overhead. The authors implement their algorithms on the Metis [98] framework and show that their static algorithm takes \(28\%\) more time than Metis on average while keeping a linear complexity on the amounts of vertices, edges and blocks. Finally, their repartitioning algorithm is able to keep imbalance, edge-cut, and migration time low throughout graph updates.

Fan et al. [54] study the problem of repartitioning vertices (edges) to simultaneously minimize edge-cut (vertex-cut) and the modifications to the initial partition. The authors prove robust intractability even for restricted cases of the problem. Moreover, they provide nontrivial proofs that this problem is unbounded with respect to edge-cut and vertex-cut. Based on the presented theorems, the authors derive a strategy to convert successful partitioning algorithms into repartitioning algorithms while ensuring small migration cost and keeping the same partition quality bounds of the original algorithm. As proof of concept, the authors apply their strategy on algorithms such as Fennel [172] (vertex partitioning) and higher degree replicated first (HDRF) [28] (edge partitioning). Finally, the authors validate their algorithm experimentally against the repartitioning algorithm Hermes [134] and algorithms to partition from scratch such as ParMetis [97]. In their experiments, their algorithms are up to an order of magnitude faster then their original counterparts while retaining a comparable or even better partition quality.

5.5 Objective Functions and Problem Variations

The simple balanced (hyper)graph partitioning problems studied in most of this survey does not capture all aspects relevant to applications. Therefore, additional issues such as multiple constraints, directed edges, matrix partitioning, process mapping, and various ways of modelling communication costs have been considered.

5.5.1 Objective Functions.

Kaya et al. [102] present an experimental study on the influence of different partitioning models and metrics on the performance of parallel sparse matrix-vector multiplication (SpMxV). To do so, real-life and artificial sparse matrices are partitioned using several matrix partitioning models. The authors then measure the running time of matrix-vector multiplications using three different SpMxV implementations and determine the influence of different partition metrics on the running time by performing a regression analysis. The experimental evaluation shows that the right partition model and metric depends on the number of processing elements used, the specific SpMxV implementation and the size and structure of the matrix. More specifically, minimizing the total number of messages sent is more important than minimizing the per-processing element send volume, in particular if the number of processing elements is \(\ge 64\). The checkerboard partition model [34] reduces most of the partition metrics and generally achieves the lowest running time.

Originally proposed by Schloegel et al. [157], the multi-objective approach attempts to minimize multiple objective functions simultaneously. To this end, Deveci et al. [44] recently introduced the multi-objective hypergraph partitioner UMPa. While previous work [24, 173] on this problem variation minimized multiple objectives in multiple phases, where the first phase optimizes the primary objective, which may not be worsened by later phases, UMPa uses a single-phased approach to optimize a primary metric in combination with a secondary and tertiary metric. This is done by greedily moving boundary vertices such that the primary metric is optimized, while using the secondary (and tertiary) metrics for tie-breaking if multiple eligible target blocks maximize the primary (or secondary) metric. The choice of the best target block of a boundary vertex depends on its gain values, i.e., the change of the metric if the vertex is moved to another block. To this end, the authors describe how gain values can be computed efficiently for the communication volume and number of messages metrics as well as both metrics on a per-block basis.

5.5.2 Multi-constraint Graph Partitioning.

Balanced graph partitions are usually constrained by an upper limit on the weight of each block. In the multi-constraint graph partitioning model introduced by Karypis and Kumar [99], each vertex of the graph is associated with a vector of weights, and a balance constraint is imposed on each weight. Multi-constraint partitioning was extended for hypergraphs [38], enabling 2D coarse-grain partitioning of sparse matrices/graphs [34]. Recently, Recalde et al. [144] introduced an exact algorithm for the multi-constraint graph partitioning problem. Two integer programming formulations are provided, and the authors prove several families of inequalities associated with the respective polyhedras. Using a branch-and-bound-based approach, Recalde et al. can solve real-world instances with 30 vertices in approximately half a minute of CPU time. Moreover, several of the proven families of inequalities significantly reduce the number of branch-and-bound vertices and the optimality gap.

5.5.3 Directed Acyclic (Hyper)graph Partitioning.

If the input instance is a directed acyclic graph (DAG), then one is often interested in finding a partition with an acyclic quotient graph. Moreira et al. [132] show that perfectly balanced graph partitioning is NP-complete even under the additional acyclicity constraint, and that there are no constant-factor approximation algorithms for \(k \ge 3\). Herrmann et al. [85, 86] and Moreira et al. [132] both propose multilevel heuristics for DAG partitioning. Popp et al. [139] adapt these techniques to directed acyclic hypergraph partitioning.

Moreira et al. [132] coarsen the graph using size-constraint label propagation [127]. Since this algorithm is not adapted to DAGs, contracting the computed clusters creates coarse graphs with cycles. Thus, the partitioner computes an initial acyclic partition before coarsening. Restricting clusters to single blocks then ensures that the partition of the input graph can also be used as a partition of the coarsest graph. Herrmann et al. [86] present an acyclic clustering algorithm, which ensures that coarse graphs are acyclic. The top-level of a DAG vertex \(v\) is defined as the longest distance from any source of the DAG to \(v\). Based on the top-level of adjacent vertices, conditions for inter- and intra-cluster edges are formulated and it is shown that a clustering respecting those conditions yields an acyclic coarse graph. Moreira et al. compute an initial partition by cutting the topologically ordered vertices of the graph into \(k\) chunks of equal size. Herrmann et al. adapt greedy graph growing to initial bipartitioning [85] and initial \(k\)-way partitioning [86]. Moreover, bipartitioning the graph as if it was undirected (using, for instance, Metis [98]) and fixing the acyclicity constraint afterwards [86] seems promising. For refinement, both Herrmann et al. and Moreira et al. adapt the FM algorithm to DAG partitioning by restricting the set of possible vertex moves to those that do not violate the acyclicity constraint. Herrmann et al. [86] note that this condition can be checked efficiently if the partition has only two blocks.

5.5.4 Symmetric Rectilinear Matrix Partitioning.

Yaşar et al. [189, 190] consider the symmetric rectilinear (Cartesian) sparse matrix partitioning problem. Here, the columns and rows of a sparse matrix are partitioned using the same partition vector (in contrast to the more general rectilinear matrix partitioning problem [78, 122], in which different vectors are used). This yields a tiling of the sparse matrix with square tiles on the diagonal. The goal is to find a partition that minimizes the weight (i.e., the sum of all nonzero matrix entries assigned to a tile) of the heaviest tile or, given an upper limit on the weight of a tile, to find a partition subject to the constraint with a minimum number of cuts. The authors show that both problem variations are NP-complete [189], propose heuristics to optimize either objective, and provide efficient implementations thereof. Running time is reduced by sparsifying the matrix. Using these techniques, the algorithm can partition a graph representing the Twitter social network with approximately 1.5 billion edges in less than 3 seconds on a modern 24-core system.

6 Parallel Algorithms

Most of the initial work on graph partitioning involved sequential algorithms. These algorithms have been extended to work in distributed-memory environments, in particular for balancing processor workloads in parallel applications. In this use case, a distributed-memory application already has a distribution of the graph; for memory scalability, the entire graph is not stored in every processor. Thus, distributed-memory partitioning algorithms do not typically have a global view of the entire graph; they often make partitioning decisions based on partial views of local graph data. As a result, they can have lower solution quality than their sequential counterparts. Still, distributed-memory algorithms are crucial for graphs that are too large to fit in a single memory space, and for applications wishing to partition their data dynamically to adjust for changing computational workloads.

In recent years, progress has been made on shared memory algorithms as shared memory architectures offer greater flexibility than distributed-memory architectures. For example, random memory accesses or atomic updates can be done orders of magnitude faster compared to distributed-memory machines. Because shared-memory algorithms have a global view of the graph, they can achieve the same solution quality as their sequential predecessors. They are not feasible, however, for extremely large graphs that do not fit in a single memory space.

6.1 Shared Memory

Most of the algorithms discussed in this section are multilevel algorithms. Therefore, we structure the discussion into the separate contributions in each phase. The multilevel algorithms we consider are Mt-Metis by Lasalle and Karypis [113], Mt-KaHiP by Akhremtsev et al. [4], Mt-KaHyPar by Gottesbüren et al. [71] (hypergraphs), KaMinPar by Gottesbüren et al. [74] (deep multilevel), and BiPart by Maleki et al. [120]. The following work is focused on transferring the sequential approaches to shared memory as faithfully as possible, while sometimes incorporating lessons learned from distributed memory.

6.1.1 Coarsening.

Most coarsening algorithms are based on greedy matching or greedy clustering. These algorithms visit vertices in some order (e.g., random), calculate the ratings (e.g., heavy-edge) for joining neighboring clusters (or match with neighbor), and then join the highest rated cluster. Visiting vertices in parallel yields faithful parallelizations of the sequential approaches. The challenge is to prevent inconsistent clustering decisions between threads.

Çatalyürek et al. [36] propose parallel schemes for agglomerative clustering and greedy matching that use locking, in addition to a lock-free version of greedy matching that resolves conflicts in a second pass. The lock-based algorithms first try to lock the visited vertex, calculate ratings, and then iterate through the candidates. When a better candidate is found, its lock is tested. If successful, then the old candidate is replaced and unlocked. For the lock-free resolution scheme, the matches are stored in a global array \(M\) without protecting write access. After one pass, vertices \(u\) with \(M[M[u]] \ne u\) are matched with themselves \(M[u] \leftarrow u\). These are vertices whose match was visited concurrently and matched to a different vertex. In their evaluation, the lock-based algorithms fare equally well as the sequential versions in terms of cardinality and heavy-edge metric, whereas the algorithm resolving conflicts in a second pass falls off by 10% in the heavy-edge metric.

The resolution-based matching scheme is also used by LaSalle and Karypis [113] in Mt-Metis. On inputs with skewed degree distributions (such as complex networks), maximal matchings are small and thus coarsening converges too slowly. If too few vertices are matched after a pass, then pairs of non-adjacent vertices that have identical neighbors and low degree are matched. If still too few vertices are matched, then any vertex pair that shares a common neighbor can be matched. This technique is dubbed two-hop matching [115].

Akhremtsev et al. [4] instead use parallel size-constrained label propagation clustering [126] to coarsen skewed inputs more effectively. No locks are employed, so cyclic cluster joins may occur. The cluster size constraint ensures that initial partitioning can find a feasible partition. Cluster sizes are updated with atomic instructions. The size constraint is checked before the update, but this offers no atomic consistency. Hence, the instructions’ results are checked to guarantee the size constraint (and revert if exceeded). For coarsening it is not necessary to strictly adhere to the size constraint (whereas it is for refinement to guarantee balance). Label propagation is also used in KaMinPar [74], but the atomic check is omitted for coarsening (not for refinement).

For Mt-KaHyPar, Gottesbüren et al. [71] use parallel agglomerative clustering but defer the locking after the rating and target cluster selection, locking only the visited vertex and target cluster. Vertices trying to cyclically join each other are detected and resolved on-the-fly, recursively merging the associated clusters.

6.1.2 Initial Partitioning.

For initial partitions, LaSalle and Karypis [113] use recursive bipartitioning. Each thread sequentially computes a bipartition, from which the best one is selected. For recursion, the threads are statically split into two groups, working on the two separate subgraphs. This is later improved [115] to threads cooperating on one bipartition if the graph is sufficiently large instead of independent sequential trials. Akhremtsev et al. [4] compute independent \(k\)-way partitions with sequential KaHiP [148]. The downsides of these two approaches are potentially severe load imbalance (recursive bipartitioning) or being a sequential bottleneck. To overcome this, Gottesbüren et al. [71] use a work-stealing task scheduler instead of splitting threads for recursive bipartitioning with parallel (un)coarsening. For flat bipartitions, a portfolio of sequential algorithms is run independently in parallel [87, 155].

The deep multilevel approach [74] offers additional parallelism through repetitions (forks) at different coarsening levels. At these stages the graphs may be too small to make efficient use of parallel (un)coarsening, such that splitting threads off for repetitions comes at little to no running time penalty. Using randomized components (coarsening, initial bipartitioning, refinement) thus offers multiple diversified solutions from which the best is chosen.

Slota et al. [165] use parallel \(k\)-source BFS to compute flat \(k\)-way partitions. Vertices join the blocks of their parents. Their approach does not use coarsening, which justifies using a flat parallel algorithm at this stage.

Maleki et al. [120] parallelize greedy hypergraph growing [32, 98] for bipartitioning. All vertices are assigned to block \(V_0\). Then the gains of moving the vertices in block \(V_0\) to \(V_1\) are computed. The \(\sqrt {|V|}\) highest rated vertices are moved, before the gains of vertices remaining in block \(V_0\) are recomputed. This is repeated until the bipartition is balanced.

6.1.3 Refinement.

The two most crucial questions that must be addressed in the refinement phase are how to maintain a balanced partition and how to make sure that concurrent moves do not result in worse cuts. Some refinement algorithms are more difficult to parallelize than others. Unsurprisingly, the more sophisticated techniques, in particular those able to escape local minima, are more difficult to parallelize, as intermediate negative gain moves are required. In particular FM and KL are known to be P-complete [153], which makes the existence of poly-log depth algorithms unlikely.

Label Propagation. The most straightforward approach to parallelize is label propagation: simply visit vertices in parallel. While gains can be incorrect due to concurrent moves in their neighborhood, this does not affect cut optimization too much in practice. Balance can be ensured by using atomic instructions to update block weights [4, 71], or by first collecting moves and approving in a second step [75, 95, 120]. During refinement, it is important to enforce the size constraint by checking the result of the atomic instruction, to guarantee a balanced partition.

Greedy. LaSalle and Karypis [113] parallelize Metis’ greedy refinement by statically assigning vertices to threads and running the sequential algorithm on the local vertices. The sequential algorithm performs FM on boundary vertices, but stops once no positive gain moves remain. This eases parallelization, since no moves must be reverted, and thus there is no serial move order to observe. However, it does eliminate the ability to escape local minima. Moves are communicated to other threads of neighboring vertices via message buffers. The threads frequently check their buffers to update local gains. Further, the refinement is split into an upstream and a downstream pass, where vertices are only allowed to move into blocks with higher (respectively, smaller) identifier than their current block. This avoids accidental cut-degrading concurrent moves [97] but restricts the search space, since some blocks quickly become close to overloaded. Balance checks are done optimistically with locally updated block weights. Since this may result in balance violations, synchronization points after passes are used to revert some moves to restore balance.

Parallel Localized FM. Localized FM [148] is a variant of FM that starts with a few boundary vertices and expands its search space to neighbors of moved vertices. It is good at escaping local minima due to allowing negative gain moves, but not wasting too much time on unpromising areas through short search sprints. This approach can be parallelized by performing multiple independent localized searches, as opposed to standard FM, which is difficult to parallelize efficiently [153]. Akhremtsev et al. [4] organize the boundary vertices in a queue that is randomly shuffled. Threads repeatedly poll seed vertices from the queue and perform localized FM around their seeds. Moves are not communicated to other threads. Instead, each thread maintains a local partition – an array for block weights and a hash table for partition IDs. The rationale for keeping moves private is that reverting (negative gain) moves at the end of a localized search confuses other searches. Searches may overlap in their local vertices, but each vertex is moved only once (atomic test-and-set). Once the global queue is empty, the local move sequences of the threads are concatenated into a single sequence. To ensure a balanced partition and no cut degradation, the gains of this sequence are recomputed sequentially, and the prefix that yields the smallest edge cut subject to the balance constraint is applied.

In Mt-KaHyPar, Gottesbüren et al. [71] apply local move sequences to the global partition as soon as a local optimum is found. This provides more accurate information to the other threads. Additionally, the resulting move order more adequately reflects the partition state on which the move decisions of the localized searches are based. Finally, it reduces the memory for hash tables, while keeping the negative gain moves at the end of a localized search private. This is important for hypergraphs where there is more information (e.g., \(\mathrm{\Phi }(e,i)\)) stored in local partition data structures than in graph partitioning. Moves applied to the global partition are rejected if they violate the balance constraint and block weights are maintained with atomic instructions.

Parallel Flow-based Refinement. The flow-based refinement from Section 5.2 offers two parallelism sources that are investigated in Reference [76]. The core routine works on bipartitions, such that it can be applied to different block pairs of a \(k\)-way partition in parallel. A basic version only runs non-overlapping block pairs in parallel, but the authors show that overlapping searches are feasible with certain restrictions, and necessary for better parallelism. The second parallelism source is the flow algorithm, where the well-known push-relabel algorithm is nicely amenable to parallelization. The approach is integrated in the state-of-the-art parallel multilevel framework Mt-KaHyPar. Experiments show that the partition quality of the new algorithm is on par with the highest-quality sequential code (KaHyPar), while being an order of magnitude faster when using 10 threads.

Hill Scanning. LaSalle and Karypis extend their greedy refinement to escape local minima to some extent [114] by employing a simplified variant of localized search. Whenever only negative gain moves remain in the thread-local priority queue, a small group of vertices around \(u\) (up to 16) is incrementally constructed, with the hope that moving the entire group reduces the cut. The group is constructed in the same way as localized FM expands its search; however, the expansion stops as soon as an improvement is possible. The selected vertices are not restricted to the thread-local ones, as opposed to the greedy algorithm. One restriction is that all vertices must be moved to the same block.

Techniques for Accurate Gains. Gottesbüren et al. [71] double-check gains for localized FM and label propagation using a technique named attributed gains. Recall that \(\mathrm{\Phi }(e, i)\) denotes the number of pins that net \(e\) has in block \(V_i\). When vertex \(u\) is moved from block \(V_s\) to \(V_t\), \(\mathrm{\Phi }(e, s)\) is (atomically) decremented and \(\mathrm{\Phi }(e, t)\) is incremented for each \(e \in I(u)\). Reducing \(\mathrm{\Phi }(e, s)\) to zero attributes an \(\omega (e)\) connectivity reduction, whereas increasing \(\mathrm{\Phi }(e, t)\) to one attributes an \(\omega (e)\) increase to moving \(u\). If the overall attributed gain of a move is negative, then it is reverted.

The global moves in localized FM are collected in a sequence in the order in which they were applied. As in Mt-KaHiP, Gottesbüren et al. perform a gain recalculation on this sequence in Mt-KaHyPar, for which a parallelization is proposed [71]. Analogously to attributed gains the last pin of a net \(e\) moved out of a block is attributed an \(\omega (e)\) reduction (if no pin moved in before) and the first pin moved into the block is attributed an \(\omega (e)\) increase (if none were contained before). This can be calculated in two passes over the pins of \(e\). Each net incident to a moved vertex is handled independently in parallel, with gain attributions distributed using atomic fetch-and-add.

Finally, the authors [71] show that updating a global gain table [3, 55, 112] can be performed in parallel with atomic instructions. Analogously to attributed gains, gain table updates are triggered by specific observed values when the \(\mathrm{\Phi }(e, i)\) values are updated. For neighbors of moved vertices their priority queue key is updated with the associated table entries. Further, when a vertex is extracted from the priority queue, its key is checked against the table entries, and reinserted with the new key, if worse. This way, searches gradually update their priority queues to global partition changes without resorting to message passing.

Rebalancing. If vertices are moved concurrently, then it does not suffice to check whether each single move would preserve balance, but rather some synchronization mechanism is necessary. Therefore, some refinement algorithms cannot guarantee balanced partitions, so there is a need for explicit rebalancing algorithms. Furthermore, even if some refinement algorithms can guarantee balance, intermediate balance violations can lead to smaller cuts after a rebalancing step.

Slota et al. [165] use label propagation with gains multiplied by \(L_{\max } / |c(V_i)|\) to favor moving into lighter blocks [123]. Maleki et al. [120] use the same approach as their graph growing parallelization for rebalancing two-way partitions, starting with a non-empty lighter block that is gradually filled.

Lasalle and Karypis [113] integrate rebalancing into their parallel greedy refinement. The thread-local move buffers are traversed in reverse order and moves into overloaded blocks are reverted. Each thread is responsible for restoring excess weight proportional to the amount it moved into that block.

For large \(k\), Gottesbüren et al. [74] use one priority queue per overloaded block with the ratio of highest gain (to a non-overloaded block) and vertex weight as key. Each queue is filled with just enough vertices to remove the excess weight, but neighbors of moved vertices in the same former block are inserted, for the case that designated target blocks become close to overloaded. Parallelism is achieved by emptying different overloaded blocks in parallel.

Gottesbüren et al. [71] allow controlled balance violations in the gain recalculation step of localized FM in Mt-KaHyPar. Label propagation on the next level is often able to rebalance, and explicit rebalancing similar to label propagation is employed on the finest level.

Parallel \(n\)-level (Un)Coarsening. Gottesbüren et al. [73] propose an \(n\)-level version [3, 135] of Mt-KaHyPar [71]. With sequential \(n\)-level, only one vertex is (un)contracted at a time, which is inherently sequential but offers high solution quality through highly localized refinement. In the parallel version, vertex pairs are contracted asynchronously with vertices to be contracted organized in a hierarchical forest data structure. The forest yields precedence conditions (bottom up) for the asynchronous contractions. With this, vertices in different subtrees can be contracted independently in parallel, as soon as their children are finished. Fine-grained locking is employed to edit the hypergraph data structure, and dynamically maintain consistency of the forest.

Uncoarsening introduces parallelism by uncontracting batches of \(b \gt 1\) vertices in parallel. The batches are constructed by traversing the forest in top-down order, assembling contractions that can be reverted independently in a batch. Uncontracting a batch resolves the last dependencies required to uncontract the next batch. The vertices of the current batch serve as seeds for highly localized parallel refinement (label propagation and FM).

Determinism. Researchers have advocated the benefits of deterministic parallel algorithms for several decades [25, 168], including ease of debugging, reasoning about performance, and reproduciblity. A downside is less flexibility in terms of algorithm design choices and potential overheads for synchronizing optimization decisions. For example, the deterministic version of Mt-KaHyPar is a factor of 1.16 slower than an equivalent non-deterministic configuration, and exhibits a factor of 1.029 higher connectivity (both aggregated using geometric mean). Interestingly, this degradation stems from coarsening [75], not refinement.

Maleki et al. [120] propose BiPart [120], a deterministic multilevel recursive bipartitioning algorithm. For coarsening, vertices are matched to their smallest net with ID hashes as tie breakers. All vertices matched to the same net are contracted without restricting coarse vertex weights. Initial partitioning uses the parallel greedy graph growing described above. For refinement, the gain for moving each vertex to the opposite block is computed in parallel. Let \(l_0, l_1\) denote the number of vertices with positive gain in block \(V_0, V_1\). Then the \(\min (l_0, l_1)\) vertices with highest gains are swapped, an approach that was already proposed for Social Hash [95]. If the hypergraph has unit vertex weights, then this maintains a balanced partition. However, this algorithm is used within the multilevel framework, where non-uniform vertex weights occur from coarsening. Thus, the graph growing rebalancing described above is employed. Note that gains become stale after a neighbor is moved, and thus the overall cut reduction is not the sum of the gains.

Gottesbüren and Hamann [75] present a deterministic version of Mt-KaHyPar [71]. The label propagation style algorithms for preprocessing, coarsening and refinement of Mt-KaHyPar are re-implemented in a deterministic fashion using synchronous local moving [81]. Move decisions do not depend on other moves in the current round. Rounds are further split into sub-rounds to incorporate more up-to-date information. After each sub-round some of the moves are approved and some are denied, for example due to the balance constraint or a cluster size constraint for coarsening. For refinement, the same method as BiPart [120] and SocialHash [95] is used: swapping highest gain prefixes between block pairs. To incorporate non-unit vertex weights, the cumulative weights of all prefixes of the move sequences are computed. The best prefix combination is then selected similar to a parallel merge.

6.2 Distributed Memory

Distributed (hyper)graph partitioning is one way to handle large inputs that do not fit into the main memory of a single machine. In the distributed-memory model, several processors (PEs) are interconnected via a communication network, and each has its private memory inaccessible to others. Computational tasks on each PE usually operate independently only on local data representing a small subset of the input. Intermediate computational results must be exchanged via dedicated network communication primitives.

Distributed (hyper)graph processing algorithms require that the vertices and edges of the input are partitioned among the processors. Since many applications use balanced (hyper)graph partitioning to obtain a good initial assignment, much simpler techniques are used in distributed partitioners. There exist range-based [128] and hash-based partitioning techniques [166, 167]. The former splits the vertex IDs into equidistant ranges, which are then assigned to the PEs. Since both techniques do not consider the (hyper)graph structure, it could lead to load imbalances or high communication overheads. However, one could also migrate vertices as more information about the structure of the (hyper)graph is available, e.g., when recursing on a subgraph obtained via recursive bipartitioning [22, 45]. If geometric information is available, then one can also use space-filling curves [16, 177].

Each PE then stores the vertices assigned to it and the edges incident to them. The edges stored on a PE can be incident to local vertices or vertices on other PEs (also called ghost or halo vertices). We say that a PE is adjacent to another PE if they share a common edge. Processors must be able to identify adjacent PEs to propagate updates, e.g., if we move a vertex to a different block, then we have to communicate that change to other PEs in the network such that local search algorithms can work on accurate partition information. However, each communication operation introduces overheads that can limit the scalability of the system. Thus, the main challenge in distributed (hyper)graph partitioning is keeping the global partition information on each PE in some sense up to date while simultaneously minimizing the required communication.

The remainder of this section describes the algorithmic core ideas of recent publications in that field and abstracts from the physical placement of the vertices and the actual representation of the distributed (hyper)graph data structure. However, we assume that each vertex knows on which PE its neighbors are stored.

Local Search.. The label propagation heuristic is the most widely used local search algorithm in distributed systems [46, 95, 97, 123, 128, 166, 170, 180]. Other approaches schedule sequential two-way FM [55] on adjacent block pairs in parallel [40, 92, 106]. However, this limits the available parallelism to at most the number of blocks \(k\).

Parallel label propagation implementations mostly follow the bulk synchronous parallel model. In a computation phase, each PE computes for its local vertices their desired target block. In the communication phase, updates are made visible to other PEs via personalized all-to-all communication [128, 166]. Meyerhenke et al. [128] use an asynchronous communication model. If the computation phase of a PE ends, then it sends and receives updates to and from other PEs and immediately continues with the next round.

In the parallel setting, the move gain of two adjacent vertices may suggest an improvement when moved individually, but moving both simultaneously may worsen the solution quality. Therefore, some partitioners use a vertex coloring [97] or a two-phase protocol where in the first phase, vertices can only move from a block \(V_i\) to \(V_j\) if \(i \lt j\) and vice versa in the second phase [46, 113, 170]. Many systems do not use any techniques to protect against move conflicts. This can be seen as an optimistic strategy assuming that conflicts rarely happen in practice.

The Social Hash Partitioner [95] (Facebook’s internal hypergraph partitioner) also uses the label propagation heuristic to optimize \(\text{fanout}(\Pi) := \frac{1}{|E|} \sum _{e \in E} \lambda (e)\) where \(\Pi = \lbrace V_1,\ldots ,V_k\rbrace\) is a \(k\)-way partition. The authors note that the label propagation algorithm can easily get stuck in local optima for fanout optimization and suggest a probabilistic version of the fanout metric, called \(\text{p-fanout}(\Pi) := \frac{1}{|E|} \sum _{e \in E} \sum _{V_i \in \Pi } 1 - (1 - p)^{\mathrm{\Phi }(e, V_i)}\) for some probability \(p \in (0,1)\). The probabilistic fanout function samples the pins of net with probability \(p\) and represents the expected fanout for a family of similar hypergraphs. Thus, it should be more robust and reduce the impact of local minima.

Other recently published distributed local search techniques are based on vertex swapping techniques that preserve the balance of the partition. Rahimian et al. [143] present JA-BE-JA that uses such an approach. The algorithm iterates over the local vertices of each PE and for each vertex, it considers all adjacent vertices as swap candidates. If no partner was found, then it selects a random vertex from a sample as a candidate. If the selected vertex is assigned to a different PE, then the instantiating PE sends a request with all the required information such that the receiving PE can verify whether or not the swap operation would improve the edge cut. On success, both vertices change their blocks. Additionally, simulated annealing is used to avoid local minima.

Aydin et al. [16] implement a distributed partitioner that computes a linear ordering of the vertices, which is then split into \(k\) equally sized ranges to obtain an initial \(k\)-way partition. The idea is similar to space-filling curves [19, 138], but does not require geometric information. The initial ordering is computed by assigning labels to a tree constructed via agglomerative hierarchical clustering. Afterward, it sorts the labels of the leaves to obtain an initial ordering. To further improve the ordering, it solves the minimum linear arrangement problem that tries to optimize \(\sum _{(u,v) \in E} |\pi (u) - \pi (v)|\omega (u,v)\) where \(\pi (u)\) denotes the position of \(u \in V\) in the current ordering. To do so, it uses a two-stage MapReduce algorithm that is repeated until convergence: First, each vertex computes its desired new position as the weighted median of its neighbor’s positions. Second, the final positions are assigned to the vertices by resolving duplicates with simple ID-based ordering. The second local search algorithm performs vertex swaps. First, it pairs adjacent blocks of the partition. Then, it splits the vertices of each block into \(r\) disjoint intervals and randomly pairs intervals between paired blocks. The paired sets are then mapped to the processors that perform the following algorithm: It sorts the vertices in both sets according to their cut reduction if moved to the opposite block and swaps the vertices with the highest combined cut reduction.

Balance Constraint.. The label propagation algorithm only knows the exact block weights at the beginning of each computation phase. In the computation phase, block weights are only maintained locally. In the communication phase, the combination of all moves may result in a partition that violates the balance constraint. Thus, partitioners based on this scheme have to employ techniques to ensure balance.

The distributed multilevel graph partitioner ParHIP [128] divides a label propagation round into subrounds and restores the exact block weights with an All-Reduce operation after each subround. Note that this does not guarantee balance but gives a good approximation of the block weights when the number of moved vertices in a subround is small.

Slota et al. [166] implemented a distributed graph partitioner that alternates between a balance and refinement phase, both utilizing the label propagation algorithm. In the refinement phase, each PE maintains approximate block weights \(a(V_i) := c(V_i) + \gamma \Delta (V_i)\) where \(c(V_i)\) is the weight of block \(V_i\) at the beginning of the computation phase, \(\Delta (V_i)\) is the weight of vertices that locally moved out, respectively, into block \(V_i\), and \(\gamma\) is a tuning parameter that depends on the number of PEs. Each PE then ensures locally that \(a(V_i) \le L_{\max }\) for all \(i \in \lbrace 1, \ldots , k\rbrace\). In the balancing phase, the gain of moving a vertex to block \(V_i\) is multiplied with \(\frac{L_{\max }}{a(V_i)}\). As a consequence, moves to underloaded blocks become more attractive. In a subsequent publication [167], the approach is generalized to the multi-constraint partitioning problem, where each vertex is associated with multiple weights.

Recently, probabilistic methods were proposed that preserve the balance in expectation [95, 123]. The Social Hash Partitioner [95] aggregates the number of vertices \(S_{i,j}\) that want to move from block \(V_i\) to \(V_j\) after each computation phase at a dedicated master process. Then, a vertex part of block \(V_i\) is moved to its desired target block \(V_j\) with probability \(\frac{\min (S_{i,j}, S_{j,i})}{S_{i,j}}\). This ensures that the expected number of vertices that move from block \(V_i\) to \(V_j\) and vice versa is the same and, thus, preserves the balance of the partition in expectation. However, each PE moves its highest ranked vertices with probability one and all remaining probabilistically. Martella et al. [123] moves a vertex \(u\) to its desired target block \(V_j\) with probability \(\frac{L_{\max } - c(V_j)}{M_j}\) where \(M_j\) are the number of vertices that want to move to block \(V_j\). The advantage of the probabilistic method is that only the number of vertices preferring a different block need to be communicated instead of all moves.

Multilevel Algorithms.. Although it is widely known that multilevel algorithms produce better partitions than flat partitioning schemes, the systems used in industry, e.g., at Google [16] or Facebook [95, 123], are primarily non-multilevel algorithms. The main reason for this is that the scalability of multilevel algorithms is often limited to a few hundred processors [92, 106]. Furthermore, most parallel multilevel systems implement matching-based coarsening algorithms [40, 46, 92, 106, 170, 181] that are not capable to efficiently reduce the size of today’s complex networks (power-law node degree distribution). The most prominent distributed multilevel algorithms are Jostle [181], ParMetis [97], PT-Scotch [40], KaPPa [92], ParHIP [128] and ScalaPart [106] for graph, and Parkway [170] and Zoltan [46] for hypergraph partitioning.

Meyerhenke et al. [128] build the parallel multilevel partitioner ParHIP that uses a parallel version of the size-constraint label propagation algorithm [126]. The algorithm is used to compute a clustering in the coarsening phase and as a local search algorithm in the refinement phase. To obtain an initial partition of the coarsest graph, it uses the distributed evolutionary graph partitioner KaFFPaE [149]. On complex networks, ParHIP computes edge cuts \(38\%\) smaller than those of ParMetis [97] on average, while it is also more than a factor of two faster.

Wang et al. [182] use a similar approach that also utilizes the label propagation algorithm to compute a clustering in the coarsening phase. The algorithm is implemented on top of Microsoft’s Trinity graph engine [161]. The partitioner additionally uses external memory techniques to partition large graphs on a small number of machines. However, it does not perform multilevel refinement (the initial partition is projected to the input graph).

Geometric Partitioners. Many graphs are derived from geometric applications and are enriched with coordinate information (e.g., each vertex is associated with a \(d\)-dimensional point). A mesh with coordinate information and a partition based on these coordinates is shown in Figure 7. Geometric partitioning techniques use this information to partition the corresponding point set into \(k\) equally sized clusters while minimizing an objective function defined on the clusters. The objective function should be chosen such that it implicitly optimizes the desired graph partitioning metric (e.g., the sum of the lengths of all bounding boxes approximates the total communication volume [45]). Since geometric methods ignore the underlying structure of the (hyper)graph, the quality of the partitions is often inferior compared to traditional multilevel algorithms. However, these algorithms are often simpler leading to faster and more scalable algorithms. Prominent techniques use space-filling curves [19, 138] that map a set of \(d\)-dimensional points to a one-dimensional line. A fundamental property of this curve is that points that are close on the line are also close in the original space. Other approaches recursively divide the space via cutting planes such as Octree-based partitioning [129], recursive coordinate bisection [22, 164], and recursive inertial bisection [169, 184]. The MultiJagged algorithm of Deveci et al. [45] uses multisection rather than bisection to reduce the depth of the recursion and speed up computation relative to recursive coordinate bisection; its hybrid implementation uses MPI and Kokkos [171] to support both distributed-memory message passing between PEs and multithreading or GPU computation within PEs.

Fig. 7.

Recently, Von Looz et al. [177] presented a scalable balanced \(k\)-means algorithm to partition geometric graphs. The \(k\)-means problem asks for a partition of a point set \(P\) into \(k\) roughly equally sized clusters such that the squared distances of each point to the mean of its cluster is minimized (in the following also referred to as the center of a cluster). Clusters obtained with this problem definition tend to have better shapes than computed with previous methods and also produce better partitions when measured with graph metrics [125]. They present a parallel implementation of Lloyd’s greedy algorithm [118] that repeats the following steps until convergence. First, each point \(p \in P\) is assigned to the cluster that minimizes the distance of \(p\) to its center. Afterwards, the center of each cluster is updated by calculating the arithmetic mean of all points assigned to it. To achieve balanced cluster sizes, an influence factor \(\gamma _c\) is introduced individually for each cluster \(c\) and \(\text{eff}\_\text{dist}(p, \text{center}(c)) := \text{dist}(p, \text{center}(c)) / \gamma _c\) is used as the distance of point \(p\) to the center of a cluster \(c\). If a cluster \(c\) becomes overloaded, then the influence factor \(\gamma _c\) is decreased; otherwise, it is increased. Thus, underloaded clusters become more attractive. The implementation replicates the cluster centers and influence factors globally and after each computation phase, it updates the values via a parallel sum operation. To obtain an initial solution, it sorts the points according to the index on a space-filling curve and splits the order into \(k\) equally sized clusters. Furthermore, it establishes a lower bound for the distances of each point to the second-closest cluster, which allows us to skip expensive distance computations for most of the points. Additionally, each processor sorts the cluster centers according to their distances to a bounding box around the process-local points. Evaluating the target clusters in increasing distance order allows us to abort early when the minimum distance of the remaining clusters is above the distance of already found candidates.

Scalable Edge Partitioning. Schlag et al. [156] present a distributed algorithm to solve the edge partitioning problem. The edge partitioning problem asks for a partition \(\Pi = \lbrace E_1, \ldots , E_k\rbrace\) of the edge set into \(k\) blocks each containing roughly the same number of edges, while minimizing the vertex cut \(\sum _{v \in V} \rho (v) - 1\) where \(\rho (v) = |\lbrace E_i~|~E_i \in \Pi : I(v)~\cap ~E_i \ne \emptyset \rbrace |\). They evaluated two methods to solve the problem. The first transforms the graph into its dual hypergraph representation (edges of the graph become vertices of the hypergraph and each vertex of the graph induces a net spanning its incident edges). Using a hypergraph partitioner that optimizes the connectivity metric to partition the vertex set directly optimizes the vertex cut of the underlying edge partitioning problem. The second method uses a distributed construction algorithm of the so-called split-and-connect(SPAC) graph. For each vertex \(u\), it inserts \(d(u)\) auxiliary vertices into the SPAC graph and connects them to a cycle using auxiliary edges each with weight one. Each auxiliary vertex is a representative for exactly one incident edge of \(u\). For each edge \((u,v) \in E\), it adds an infinite weight edge between the two representatives of the corresponding edge. Thus, a partition of the vertex set of the SPAC graph cannot cut an edge connecting two representatives. Therefore, such a partition can be transformed into an edge partition by assigning each edge to the block of its representatives. In the evaluation, they compare both representations while using different graph and hypergraph partitioners. The results showed that parallel graph partitioners outperform distributed hypergraph partitioners. However, the sequential hypergraph partitioner KaHyPar [154] produces significantly better vertex cuts than all other approaches (more than \(20\%\) better than the best graph-based approach), but is an order of magnitude slower than the evaluated distributed algorithms.

6.3 GPU

Due to their high computational power, modern GPUs have become an important tool for accelerating data-parallel applications. However, due to the highly irregular structure of graphs, it remains challenging to design graph algorithms that efficiently utilize the SIMD architecture of modern GPUs.

Multilevel Graph Partitioning. Goodarzi et al. [67, 68] present two algorithms for GPU-based multilevel graph partitioning. Their earlier approach [67] uses heavy-edge matching for coarsening and transfers the coarsest graph onto CPU for initial partitioning (using Mt-Metis [96]). During refinement, vertices are distributed among threads and each thread finds the blocks maximizing the gain values of its assigned vertices. To prevent conflicting moves that worsen the edge cut in combination, refinement alternates between rounds in which only moves to blocks with increasing (respectively, decreasing) block IDs are considered. For each block, potential moves to the block are collected in a global buffer, which is then sorted, and the highest rated moves are executed.

Their later approach [68] brings several improvements. First, the authors use Warp Segmentation [103] to improve the efficiency of the heavy-matching computation during coarsening. Initial partitioning is then performed on the GPU using a greedy growing technique. During refinement, vertices are once more divided among threads, and each thread finds the blocks maximizing the gain of its assigned boundary vertices and collects the potential moves in a global buffer. Then, the algorithm finds the highest rated \(\ell\) moves in the global buffer, for some small input constant \(\ell\). Since moves might conflict with each other, their algorithm distributes all \(2^{\ell }\) move combinations across thread groups and finds the best combination, which is then applied to the graph partition. This process is repeated until the global buffer is empty. On average, their GPU-based approach is approximately 1.9 times faster than Mt-Metis while computing slightly worse edge cuts across their benchmark set of 16 graphs.

In his PhD thesis, Fagginer Auer [15] develops two multilevel algorithms running on a GPU. One that uses spectral refinement and one that uses greedy refinement. Later, Fagginger Auer and Bisseling [53] present a fine-grained shared-memory parallel algorithm for graph coarsening and apply this algorithm in the context of graph clustering to obtain a fast greedy heuristic for maximising modularity in weighted undirected graphs. The algorithm is suitable for both multi-core CPUs and GPUs. Later, Gilbert et al. [63] present performance-portable graph coarsening algorithms. In particular, the authors study a GPU parallelization of the heavy edge coarsening method. The authors evaluate their coarsening method using a multilevel spectral graph partitioning algorithm as primary use case.

Spectral Graph Partitioning. The availability of efficient eigensolvers on GPUs has led to a recent re-emergance of spectral techniques for graph partitioning on GPU systems [1, 2, 133]. These techniques were first developed by Donath and Hoffman [50, 51] and Fiedler [56] to compute graph bisections in the 1970s. Subsequently, these techniques have been improved [20, 27, 83, 140, 164] and extended to partition a graph into more than two blocks using multiple eigenvectors [8, 83].

Naumov and Moon [133] present an implementation of spectral graph partitioning for single GPU systems as part of the nvGRAPH library, whereas Acer et al. [1, 2] propose the multi-GPU implementation Sphynx. Both partitioners precondition the matrix and use the LOBPCG [107] eigenvalue solver. The eigenvectors are then used to embed the graph into a multidimensional coordinate space, which is then used to derive a partition of the graph. In nvGRAPH, this is done using a \(k\)-means clustering algorithm on the embedded graph, whereas Sphynx uses the geometric graph partitioner Multi-Jagged [45], which supports multi-GPU systems. Since the approach by Acer et al. outperforms nvGRAPH in terms of partition balance, cut size and running time (when run on a single GPU system), we focus on their experimental evaluation. To this end, when Sphynx is compared against ParMETIS [100], ParMETIS generally obtains significantly better cuts than Sphynx—approximately 20% (respectively, 70%) lower cuts on regular (respectively, irregular) graph instances). On irregular graphs, the authors report a significant speedup of approximately 19 using Sphynx on 24 GPUs compared to ParMETIS with 168 MPI processes across four compute nodes, although ParMETIS is approximately three times faster than Sphynx on regular graphs even when using a single CPU core for each GPU used by Sphynx. Additionally, Acer et al. report the influence of several matrix preconditioners on different classes of graphs.

6.4 Approaches for Other Types of Hardware

Recently, Ushijima-Mwesigwa et al. [175] explore graph partitioning using quantum annealing on the D-Wave 2X machine. The main idea is to formulate the graph partitioning problem as a quadratic unconstraint binary optimization problem via a spectral approach. These problems can be mapped to minimizing Ising objective functions, which the D-wave system can minimize. The method could run directly on the D-Wave system for small graphs and used a hybrid approach for large graphs.

Lio et al. [117] experiment with solving the graph partitioning problem on the Fujitsu Digital Annealer, which is a special-purpose hardware designed for solving combinatorial optimization problems. In particular, the authors also model the problem as a quadratic unconstrained binary optimization problem via a spectral approach. The Fujitsu Digital Annealer then utilizes application-specific integrated circuit hardware for solving fully connected quadratic unconstraint binary optimization problems. The authors identify a dense network for which their approach significantly outperforms KaffpaE. Then a range of similar instances is created using the MUSKETEER [79] framework. On most of those instances, their approach outperforms KaffpaE and Gurobi.

7 Experimental Methodology

As more applications of (hyper)graph partitioning have emerged in recent years, it becomes increasingly important to evaluate algorithms on many (hyper)graph instances to demonstrate their effectiveness for practical applications. The presentation of results using tables comparing the running time and solution quality of different algorithms can quickly become difficult to interpret, while evaluations based on aggregated numbers can lead to misleading conclusions. This section discusses several alternative methods helpful for presenting and interpretting experimental results.

Performance Profile. Performance profiles can be used to compare the solution quality of different algorithms [48]. Let \(\mathcal {A}\) be the set of all algorithms, \(\mathcal {I}\) the set of instances, and \(q_{A}(I)\) the value of a quality metric (e.g., edge-cut) of algorithm \(A \in \mathcal {A}\) on instance \(I \in \mathcal {I}\). For each algorithm \(A\), performance profiles show the fraction of instances (\(y\)-axis) for which \(q_A(I) \le \tau \cdot \text{Best}(I)\), where \(\tau\) is on the \(x\)-axis and \(\text{Best}(I) := \min _{A^{\prime } \in \mathcal {A}}q_{A^{\prime }}(I)\) is the best solution produced by an algorithm \(A \in \mathcal {A}\) on an instance \(I \in \mathcal {I}\). For \(\tau = 1\), the \(y\)-value indicates the percentage of instances for which an algorithm \(A \in \mathcal {A}\) performs best. Achieving higher fractions at smaller \(\tau\) values is considered better.

Figure 8 (left) compares a quality metric of four different algorithms using a performance profile. Algorithms B and C compute on roughly \(40\%\) of the instances the best solutions while Algorithms A and D only on \(10\%\) (see \(\tau = 1\)). The solutions produced by Algorithm A are worse than the best solutions by \(\approx 4\%\) in the median (intersection of \(y = 0.5\) with the red line is at \(\tau \approx 1.04\)). If we compare Algorithms C and D based on their geometric mean of the quality metric, then we would observe that Algorithm C is only \(0.2\%\) better than Algorithm D. Hence, we would probably conclude that there is no significant difference between both. If we look at the performance profile, then we see that Algorithm C is on most of the instances closer to the best solution than Algorithm D. However, on \(\approx 10\%\) of the instances the solutions are worse than the best one by more than a factor of two, which has a large influence on its geometric mean (see green line at \(\tau = 2\)).

Fig. 8.

Effectiveness Tests. The performance profile in Figure 8 (left) suggests that Algorithm B produces solutions that are better than those of Algorithm A. However, if we look at their running times, we observe that Algorithm A is more than an order of magnitude faster than Algorithm B on average. Thus, Algorithm B may have an unfair advantage due to its longer running time. Therefore, Akhremtsev et al. [4] introduces effectiveness tests to compare solution quality when two algorithms are given a similar running time, by performing additional repetitions with the faster algorithm. Consider Algorithms A and B and an instance \(I\). Effectiveness tests first sample one run of both algorithms. Let \(t_A^1\) and \(t_B^1\) be their running times and assume that \(t_A^1 \ge t_B^1\). Then, additional runs of Algorithm B are sampled until the accumulated time exceeds \(t_A^1\). Let \(t_B^2, \dots , t_B^l\) denote their running times. The last run is accepted with probability \((t_A^1 - \sum _{i = 1}^{l-1} t_B^i) / t_B^l\) so that the expected time for the runs of \(B\) equals \(t_A^1\). The quality metric taken is the minimum out of all runs.

Figure 8 (right) shows the result of the effectiveness test for Algorithms A and B using a performance profile. The plot shows that there is no significant difference between the two algorithms.

Significance Tests. It is not immediately obvious from the performance profile in Figure 8 (left) whether Algorithm B performs better than Algorithm C or vice versa, since both lines are close to each other. Here, statistical tests can give further information that helps to decide whether the difference between two algorithms is statistically significant.

A widely used technique is to formulate a null hypothesis assuming that there is no difference between the observed distributions. A significance test then calculates the probability (\(p\)-value) that the observed distribution occurs under the assumption that the null hypothesis is true [191]. The null hypothesis is rejected if the probability is below a predefined significance level \(\alpha\) (e.g., \(\alpha = 5\%\) as suggested by Fisher [59]). To compute the \(p\)-value, the Wilcoxon signed rank test [183] is often used to compare two algorithms while the Friedman test [60] makes pairwise comparisons between multiple algorithms.

Significance tests based on a null hypothesis decide if the difference between two measurements is statistically significant but do not reveal any information whether or not the difference is relevant in practice. For example, a significance test comparing the running times Algorithms A and B may conclude that Algorithm A is faster than Algorithm B. However, Algorithm A might be only \(1\%\) faster on average and, thus, the relevance of the improvement is questionable. Therefore, Angriman et al. [12] recommend using parameter estimation instead of hypothesis testing. To apply parameter estimation to our example, we could compute the running time ratios \(t_A(I) / t_B(I)\) for each instance \(I\) and determine the confidence interval that contains \(95\%\) of the measurements. The confidence interval gives additional information on the effect size of the improvement.

8 Future Challenges

While there has been a considerable progress in the field in the last decade, there is a wide range of challenges that remain open. It is an interesting question to what extent the multitude of results sketched above have reached a state of maturity where future improvements become less and less likely. Algorithms like multi-threaded graph partitioning for balanced graph partitioning using a small number of threads and standard inputs may be difficult to improve. Long time open problems like large gaps between theory and practice may remain open for a long time. However, other important issues have considerable potential. We try to identify some of them below.

Parallelism and Other Hardware Issues. Scalability of high-quality parallel (hyper)graph partitioning remains an active area of research. In particular, achieving good scalability and quality on large distributed-memory machines is still a challenge, but even on shared-memory machines, scalability to a large number of threads seems difficult. Even more difficult is aligning the inherent complexity and irregularity of state-of-the-art algorithms with the restrictions of GPUs or SIMD-instructions. Another conundrum is that, for good memory access locality during partitioning, (hyper)graphs need to already be partitioned reasonably well.

Hierarchies of supercomputers have to be taken into account during partitioning. This can be done by using multi-recursive approaches taking the system hierarchy into account or by adapting the deep multilevel partitioning approach sketched above to the distributed-memory case. When arriving at a compute-node level, additional techniques are necessary to employ the full capabilities of a parallel supercomputer. For example, many of those machines have GPU’s on a node level. Recently, researchers started to develop partitioning algorithms that run on GPUs and while of independent interest, partitioning algorithms developed for this type of hardware can help in that regard. Hence, future parallel algorithms have to compute partitions on and for heterogeneous machines. However, algorithms should be energy-efficient and performance per watt has to be considered. Lastly, future hardware platforms have to be taken into consideration when developing such algorithms. One way to achieve this will be to use performance portable programming ecosystems like the Kokkos library [171].

Problem Variations. Current partitioning algorithms work well on a wide range of instances. Most of these instances are either unweighted and translate into well behaved weighted problems if a multilevel algorithm is used or instances have a fairly even distribution of node weights. For instances with vastly different values of node weights, partitioning problems discussed in this article get close to bin packing problems. Currently solvers are not able to handle this case well.

Algorithms for the well established objective functions like edge cut (in the graph case) or connectivity/cut (in the hypergraph case) are able to compute very high-quality partitions. Many open problems remain when optimizing for other or multiple objectives and when the problem has other/multiple constraints. For example, in parallel computing, bottleneck objectives such as minimizing the maximum edge cut of a block should perform better, but few partitioners optimize for such objectives. The same is true for high-quality solvers for problems with multiple constraints in the case of repartitioning. Streaming algorithms currently compute much worse quality partitions than internal memory partitioning algorithms. This is partially due to the fact that such algorithms do not have a global view of the optimization problem. Improving quality of streaming algorithms is an open problem. Last, we believe that directed variants of the problems will become more important to model different applications.

Multilevel Partitioning. The multilevel technique has been incredibly successful in the field of decomposition. Yet, multilevel algorithms in the area still consist in practice of a very limited number of multilevel techniques. Development and understanding components of more sophisticated coarsening schemes, edge ratings, and metrics of nodes’ similarity that can be propagated throughout the hierarchies are among the future challenges for (hyper)graph partitioning. Additional challenges occur in attempt of their rigorous analysis. For example, coarsening at the moment is typically stopped based on simple formulas. However, in practice it may be much more fruitful to derive stopping rules for coarsening that take the given instance into account. This also translates to different parameters of sub-algorithms that should be chosen based on the type of input that is provided to the algorithm.

Modeling and Experiments. Traditionally instances for partitioning algorithms came from scientific simulations and thus were fairly well structured. For these networks, edge cut and communication volume of the application have been highly correlated. However, for instances having much more complex structures like social networks that recently became important in practice, the correlation is not that high. It remains an open question how different models of (hyper)graph partitioning translate into concrete performance of applications.

Acknowledgments

We thank Ilya Safro for feedback and additional references on an early version of the manuscript.

Footnotes

This version of the survey is a shortened version of Reference [37]. The extended technical report also contains material on streaming graph partitioning as well as process mapping.

The \(\lceil \cdot \rceil\) in this definition ensures that there is always a feasible solution for inputs with unit vertex weights. For general weighted inputs, there is no commonly accepted way how to deal with feasibility; see also References [74, 88].

References

[1]

Seher Acer, Erik G. Boman, Christian A. Glusa, and Sivasankaran Rajamanickam. 2021. Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems. Parallel Comput. 106 (2021), 102769.

Abstract

1 Introduction

2 Overview/classification

3 Preliminaries

3.1 Notation

3.2 The k-way (Hyper)Graph Partitioning Problem

4 Applications

5 Sequential Techniques

5.1 Classic (Hyper)Graph Partitioning Techniques

5.2 Recent Advances in Multilevel Partitioning

5.3 Evolutionary Computation

5.4 Repartitioning

5.5 Objective Functions and Problem Variations

5.5.1 Objective Functions.

5.5.2 Multi-constraint Graph Partitioning.

5.5.3 Directed Acyclic (Hyper)graph Partitioning.

5.5.4 Symmetric Rectilinear Matrix Partitioning.

6 Parallel Algorithms

6.1 Shared Memory

6.1.1 Coarsening.

6.1.2 Initial Partitioning.

6.1.3 Refinement.

6.2 Distributed Memory

6.3 GPU

6.4 Approaches for Other Types of Hardware

7 Experimental Methodology

8 Future Challenges

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Partitioning of a graph into induced subgraphs not containing prescribed cliques

Partitioning a Graph into Complementary Subgraphs

Multilevel hypergraph partitioning: applications in VLSI domain

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations