Correction
1 May 2019: The PLOS ONE Staff (2019) Correction: Supervised and extended restart in random walks for ranking and link prediction in networks. PLOS ONE 14(5): e0216572. https://doi.org/10.1371/journal.pone.0216572 View correction
Figures
Abstract
Given a real-world graph, how can we measure relevance scores for ranking and link prediction? Random walk with restart (RWR) provides an excellent measure for this and has been applied to various applications such as friend recommendation, community detection, anomaly detection, etc. However, RWR suffers from two problems: 1) using the same restart probability for all the nodes limits the expressiveness of random walk, and 2) the restart probability needs to be manually chosen for each application without theoretical justification. We have two main contributions in this paper. First, we propose Random Walk with Extended Restart (RWER), a random walk based measure which improves the expressiveness of random walks by using a distinct restart probability for each node. The improved expressiveness leads to superior accuracy for ranking and link prediction. Second, we propose SuRe (Supervised Restart for RWER), an algorithm for learning the restart probabilities of RWER from a given graph. SuRe eliminates the need to heuristically and manually select the restart parameter for RWER. Extensive experiments show that our proposed method provides the best performance for ranking and link prediction tasks.
Citation: Jin W, Jung J, Kang U (2019) Supervised and extended restart in random walks for ranking and link prediction in networks. PLoS ONE 14(3): e0213857. https://doi.org/10.1371/journal.pone.0213857
Editor: Vince Grolmusz, Mathematical Institute, HUNGARY
Received: November 15, 2018; Accepted: March 2, 2019; Published: March 20, 2019
Copyright: © 2019 Jin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data and code are available from GitHub (https://github.com/datalabsnu/sure).
Funding: This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) [2013-0-00179 to UK, Development of Core Technology for Context-aware Deep-Symbolic Hybrid Learning and Construction of Language Resources]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
How can we measure effective node-to-node proximities for graph mining applications such as ranking and link prediction? Measuring relevance (i.e., proximity or similarity) scores between nodes is a fundamental tool for many graph mining applications [1, 2, 3, 4, 5]. Among various relevance measures, Random Walk with Restart (RWR) [6] provides useful node-to-node relevance scores by considering global network structure [7] and intricate edge relationships [8]. RWR has been successfully exploited in a wide range of data mining applications such as ranking [9, 10, 11], link prediction [1, 2, 3, 12, 5], community detection [13, 14], anomaly detection [15], etc.
However, RWR has two challenges for providing more effective relevance scores. First, RWR assumes a fixed restart probability on all nodes, i.e., a random surfer jumps back to the query node with the same probability regardless of where the surfer is located. This assumption prevents the surfer from considering the query node’s preferences for other nodes, thereby limiting the expressiveness of random walk for measuring good relevance scores. Second, RWR requires users to heuristically select the restart probability parameter without strong theoretical justification to choose the parameter.
In this paper, we propose a novel relevance measure Random Walk with Extended Restart (RWER), an extended version of RWR, which reflects a query node’s preferences on relevance scores by allowing a distinct restart probability for each node. We also propose a supervised learning method SuRe (Supervised Restart for RWER) that automatically finds optimal restart probabilities in RWER from a given graph. Extensive experiments show that our method provides the best link prediction accuracy. Table 1 summarizes strength of SuRe compared to existing methods. Our main contributions are summarized as follows:
- Model. We propose Random Walk with Extended Restart (RWER), a new random walk model to improve the expressiveness of RWR. RWER allows each node to have a distinct restart probability so that the random surfer has a finer control on preferences for each node.
- Learning. We propose SuRe, an algorithm for learning the restart probabilities in RWER from data. SuRe automatically determines the best restart probabilities.
- Experiment. We empirically demonstrate that our proposed method improves accuracy in all dataset.
SuRe outperforms all learning methods in terms of accuracy, speed, scalability, and memory usage.
The source code of our method and datasets used in this paper are available at https://github.com/datalabsnu/sure. The rest of this paper is organized as follows. Section 2 presents a preliminary on RWR and defines the problem. Our proposed methods are described in Section 3. After presenting experimental results in Section 4, we provide a review on related works in Section 5. Lastly, we conclude in Section 6.
2 Preliminaries
In this section, we describe the preliminaries on Random Walk with Restart. Then, we formally define the problem handled in this paper. We use Aij or A(i, j) to denote the entry at the intersection of the i-th row and j-th column of matrix A. The i-th element of the vector x is denoted by xi. Table 2 lists the symbols used in this paper.
2.1 Random walk with restart
Random walk with restart (also known as Personalized PageRank, PPR with a single seed node) [9] measures each node’s proximity (relevance) w.r.t. a given query node s in a graph. RWR assumes a random surfer who starts at node s. The surfer moves to one of its neighboring nodes with probability 1 − c or restarts at node s with probability c. When the surfer moves from u to one of its neighbors, each neighbor v is selected with a probability proportional to the weight in the edge (u, v). The relevance score between seed node s and node u is the stationary probability that the surfer is at node u. If the score is large, we consider that nodes s and u are highly related.
Limitations.
RWR cannot consider a query node’s preferences for estimating relevance scores between the query node and other nodes. For example, suppose we compute relevance scores from the query node A to other nodes in a political blog network in Fig 1 where blue colored nodes (A, B, and C) are liberal blogs, red colored ones (G and H) are conservative, black ones (D, E, F, and I) are not labeled, and an edge between nodes indicates a hyperlink between the corresponding blogs. Based on the topology of the graph, we consider that nodes E and F tend to be moderate, node D is likely to be liberal, and node I is likely to be conservative. Since the query node A is a liberal blog, node A will prefer other liberal nodes to conservative nodes. However, a conservative node G is ranked higher than nodes related to liberal blogs such as nodes C and D in the ranking result of RWR as shown in the left table of Fig 1. The reason is that preferences are not considered in RWR, and the random surfer jumps back to the query node A with a fixed restart probability c wherever the surfer is. On the other hand, RWER reflects the query node’s preferences on relevance scores by allowing a distinct restart probability for each node as shown in the right table of Fig 1.
RWR uses the fixed restart probability 0.15 or 0.5 while our proposed RWER uses distinct restart probabilities on nodes.
Another practical problem is that it is non-trivial to set an appropriate value of the restart probability c for different applications since we need to manually choose c so that the restart probability provides optimal relevance scores for each application. RWR scores are highly affected by the restart probability; the ranking results of each restart probability (c = 0.15 and c = 0.5) are quite different as seen in the left table of Fig 1. In contrast, our learning method SuRe automatically determines the optimal restart probabilities for all nodes based on the query node’s preferences as well as relationships between nodes. The ranking result by our proposed RWER and SuRe is more desirable for the query node A (liberal) than those of RWR because many liberal blog nodes are ranked high in the ranking result as shown in the right table of Fig 1. The detailed descriptions of our proposed approaches RWER and SuRe are presented in Section 3.
2.2 Problem definition
We are given a graph with n nodes and m edges, a query node s, and side information from the query node. The side information contains a set of positive nodes P = {x1, …, xk} that s prefers, and a set of negative nodes N = {y1, …, yl} that s dislikes. Our task is to learn restart probabilities for all nodes such that relevance scores of the positive nodes are greater than those of the negative ones.
3 Proposed method
In this section, we first describe Random Walk with Extended Restart (RWER), our proposed model for extended restart probabilities. We then propose SuRe, an efficient algorithm for learning the restart probabilities.
3.1 Overview of Random Walk with Extended Restart
RWER is a novel relevance model reflecting a query node’s preferences on relevance scores. The main idea of RWER is that we introduce a restart probability vector each of whose entry corresponds to a restart probability at a node, so that the restart probabilities are related to the preferences for the nodes.
In RWER, a restart probability of each node is interpreted as the degree of boredom of a node w.r.t. the query node. That is, if the restart probability on a node is large, then the surfer runs away from the current node to the query node (i.e., the surfer becomes bored at the node). On the other hand, if the restart probability of the node is small, then surfer desires to move around the node’s neighbors (i.e., the surfer has an interest in the node and its neighbors).
As depicted in Fig 1, each node has its own restart probability in our model RWER. The restart probabilities are determined by our supervised learning method SuRe (Section 3.5) from the query (liberal) node A, the positive (liberal) nodes B and C, and the negative (conservative) nodes G and H. Note that a ranking list where many liberal nodes are ranked high is desirable for the query node A since A is liberal. As shown in Fig 1, using distinct restart probabilities for each node by RWER provides more satisfactory rankings for the query node than using a single restart probability for all nodes by RWR. The restart probabilities of liberal nodes are smaller than those of conservative nodes, which implies that the random surfer prefers searching around the liberal nodes such as B and C while the surfer is likely to run away from the conservative nodes such as G and H.
One might think that it is enough to simply assign small restart probabilities to positive nodes and large restart probabilities to negative nodes for a desirable ranking. However, the restart probabilities should be determined also for unlabeled nodes, and the probabilities should reflect intricate relationships between nodes as well as the query node’s preferences. For example, the restart probability of node F in Fig 1 is relatively moderate because node F is located between a liberal node A and a conservative node G. Also, the restart probability of node D is small since node D is close to other liberal nodes B and C. Similarly, the restart probability of node I is large since node I is closely related to other conservative nodes G and H.
3.2 Formulation of Random Walk with Extended Restart
We formulate RWER in this section. We first explain the formulation using the example shown in Fig 2, and present general equations. In the example, the surfer goes to one of its neighbors or jumps back to the query node. To obtain the RWER probability ru at time t + 1, we should take into account the scores of the three nodes which are i, j and k at time t. Suppose the surfer is at the node i at time t. The surfer can go to an out-neighbor through one of the two outgoing edges with probability 1 − ci. Note that every node has a distinct restart probability and node i has a restart probability ci in this case. Without the restart action, in Fig 2 is defined as follows:
Each node has its own restart probability.
Also, the surfer on any node v jumps back to the query node with probability cv. The above equation is rewritten as follows considering the restart action of the random surfer: where 1(u = s) is 1 if u is the query node s; otherwise, it is 0. Note that the restart term is different from that of the traditional random walk with restart.
Based on the aforementioned example, the recursive equation of our model is defined as follows: (1) where INi is the set of in-neighbors of node i, and OUTi is the set of out-neighbors of node i.
Eq (1) is expressed in the form of a matrix equation as follows: (2) where is a row-normalized matrix of the adjacency matrix A, c is a restart vector whose i-th entry is ci, diag(c) is a matrix whose diag(c)ii = ci and other entries are 0, and q is a vector whose s-th element is 1 and all other elements are 0. The random surfer can return to a set of nodes S = {s1, s2, …, sm}. In this case, q(si) = 1/|S|, for i = 1, …, m. Notice that if c is a vector all of whose elements are the same, then the RWER is equal to RWR (or PPR). The following lemma shows that Eq (2) can be represented as a closed form equation.
Lemma 1 The closed form w.r.t. r in Eq (2) is represented as follows: (3) where B is , is a row-normalized matrix, and 1 is an all-ones vector.
Proof 1 Note that c⊤r is a scalar; thus, (c⊤r)q = q(c⊤r) = (qc⊤)r. Hence, Eq (2) is represented as follows: Since r is a probability vector and 1⊤r = 1, the above equation is written in the following equation: where B is . Finally, r is represented in the following closed form: Note that I − B is invertible (see the following Lemma 2).
Lemma 2 Suppose B is . Then, M = I − B is invertible.
Proof 2 Let C be I − diag(c), and X be . Then, M = I − B is represented as follows: Note that , 1 ≤ i, j ≤ n and the largest eigenvalue since is stochastic and C is sub-stochastic. In other words, is M-matrix [16]; thus, X−1 exists and all entries of X−1 are nonnegative. By Sherman-Morrison lemma [17], (X + q(1 − c)⊤)−1 is represented as follows: Since all entries of X−1, 1 − c and q are nonnegative, (1 − c)⊤ X−1 q ≥ 0; thus, 1 + (1 − c)⊤ X−1 q ≥ 1. Hence, the right side of the above equation exists, i.e., (X + q(1 − c)⊤)−1 = M−1 exists. I − B⊤ is also invertible, which is proved similarly to the proof of I − B.
Note that if c is given, the RWER vector r can be calculated using the closed form in Lemma 1. However, the computation using the closed form requires O(n3) time and O(n2) memory space due to the matrix inversion where n is the number of nodes; thus, this approach is impractical when we need to compute RWER scores in large-scale graphs. In order to avoid the heavy computational cost, we exploit an efficient iterative algorithm described in Section 3.3.
3.3 Algorithm for Random Walk with Extended Restart
We present an iterative algorithm for computing RWER scores efficiently. Our algorithm is based on power iteration and comprises two phases: a normalization phase (Algorithm 1) and an iteration phase (Algorithm 2).
Normalization phase (Algorithm 1).
Our proposed algorithm first computes the out-degree diagonal matrix D of A (line 1). Then, the algorithm computes the row normalized matrix using D (line 2).
Iteration phase (Algorithm 2).
Our algorithm computes the RWER score vector r for the seed node s in the iteration phase. As described in Section 3.2, the vector q denotes a length-n starting vector whose entry at the index of the seed node is 1 and otherwise 0 (line 1). Our algorithm iteratively computes Eq (2) (line 3). We then compute the error δ between r′, the result in the previous iteration, and r (line 4). Next, we update r′ into r for the next iteration (line 5). The iteration stops when the error δ is smaller than a threshold ϵ (line 6).
Theoretical analysis.
We analyze the convergence of the iterative algorithm in Theorem 1 and the time complexity in Theorem 2. We assume that all the matrices considered are saved in a sparse format, such as the compressed column storage [18], which stores only non-zero entries, and that all the matrix operations exploit such sparsity by only considering non-zero entries.
Algorithm 1 Normalization phase of RWER
Input: adjacency matrix A
Output: row-normalized matrix
1: compute a degree diagonal matrix D of A (i.e., Dii = ∑j Aij)
2: compute a normalized matrix, .
3: return
Algorithm 2 Iteration phase of RWER
Input: row-normalized matrix , query node s, restart probability vector c, and error tolerance ϵ
Output: RWER score vector r
1: set the starting vector q from the seed node s
2: repeat
3:
4: compute error, δ = ‖r′ − r‖
5: update r ← r′
6: until δ < ϵ
7: return r
Theorem 1 (Convergence) Suppose the graph represented by is irreducible and aperiodic. Then, the power iteration algorithm (Algorithm 2) for RWER converges.
Proof 3 Eq (2) is represented as follows: Note that G is a column stochastic matrix. Moreover, G is irreducible since is irreducible, and aperiodic due to the self-loop at node s by restart. Hence, r is the eigenvector corresponding to the principal eigenvalue of G, and the power iteration for r converges [19].
Theorem 2 (Time Complexity) The time complexity of Algorithm 2 is O(Tm) where T is the number of iterations, and m is the number of edges.
Proof 4 We assume that the number of edges is greater than that of nodes for simplicity. The iterative algorithm for computing RWER scores takes O(Tm) time since each iteration requires the sparse matrix-vector multiplication which takes O(m) time.
Theorem 2 indicates that our method in Algorithm 2 presents the linear scalability w.r.t the number of edges.
3.4 Cost function
Although our relevance measure RWER improves the expressiveness of RWR by introducing a distinct restart probability for each node, it is difficult to manually investigate the optimal restart probabilities for all nodes in large graphs. In this section, we define the cost function for finding the optimal restart probabilities.
As mentioned in Section 2.2, our goal is to set the optimal restart probabilities so that the relevance scores of positive nodes outweigh those of negative nodes. We define the following cost function: (4) where λ is a regularization parameter that controls the importance of the regularization term, o is a given origin vector, h is a loss function, and rx and ry are RWER scores of nodes x and y, respectively. The cost function is obtained from the pairwise differences between the RWER scores of positive and negative nodes. Given an increasing loss function h, F(c) is minimized as the scores of positive nodes are maximized and those of negative nodes are minimized. The origin vector o prevents the c vector becoming too small, and serves as a model regularizer which helps avoid overfitting and thus improves accuracy, as we will see in Section 4.4. We set o to a constant vector all of whose elements are set to a constant. We use the loss function h(x) = (1 + exp(−x/b))−1 since the loss function maximizes AUC of binary classification when b is small enough [20, 5]; in that case the loss function becomes a step function.
3.5 SuRe—Optimizing the cost function
Our goal is to minimize Eq (4) with respect to c. Note that the objective function F(c) is not convex. Thus, we exploit the gradient descent method to find the local minimum of function F(c). For the purpose, we first need to obtain the derivative of F(c) w.r.t. c: (5) where δyx is ry − rx. The derivative of the loss function is .
In order to obtain the derivative , we have to calculate the derivative of the relevance score rx w.r.t. ci which is the i-th element of c. Let M be (I − B)−1; then, r = (I − B)−1 q = Mq, M(:, s) = r, and M(x, s) = rx, from Theorem 1.
Since M is the inverse of I − B, according to [21], becomes: where Jij is a single-entry matrix whose (i, j)th entry is 1 and all other elements are 0. Based on the above equation, is represented as follows: where es is a length n unit vector whose s-th entry is 1. Note that is calculated for 1 ≤ i ≤ n; then, is written in the following equation: (6) where ∘ denotes Hadamard product, and 1 is an all-ones vector of length n. Similarly, is calculated by switching x to y. Using the Eq (6), in Eq (5) is represented as follows: (7) where r = M(:, s) is an RWER score vector, and . Then, in Eq (5) is represented as follows: (8)
Algorithm 3 SuRe—Learning a restart vector c
Input: adjacency matrix A, query node s, positive set P, negative set N, origin vector o, parameter b of loss function h, and the learning rate η
Output: the learned restart vector c
1: randomly initialize c
2: while c does not converge do
3: compute r = M(:, s) based on Eq (2) (Algorithm 2)
4: compute by a linear system solver (Lemma 3)
5: compute by Eq (8)
6: update
7: end while
8: return the learned restart vector c
Notice that we do not obtain M explicitly to compute M(:, s) in Eq (8) since M is the inverse of I − B and inverting a large matrix is infeasible as mentioned in Section 3.2. Instead, we use the iterative method described in Algorithm 2 to get r = M(:, s). However, the problem is that we also require rows of M (i.e., M(x,:) in ), and Algorithm 2 only computes a column of M for a given seed node. How can we calculate without inverting M? is computed iteratively by the following lemma:
Lemma 3 From the result of Eq (7), which is represented as where , ex is an n × 1 vector whose x-th element is 1 and the others are 0, and δyx = ry − rx. Then, is the solution of the linear system which is solved by an iterative method for linear systems.
Proof 5 M(x,:)⊤ is a column vector which is the transpose of the x-th row of the matrix M. In other words, where ex is an n × 1 vector whose x-th element is 1 and the others are 0. Then, is represented as follows: where . Also, is represented as the following linear system: where .
Note that M−⊤ = I − B⊤ is non-symmetric and invertible (Lemma 2); thus, any iterative method for a non-symmetric matrix can be used to solve for . We use GMRES [22], an iterative method for solving linear systems since it is the state-of-the-art method in terms of efficiency and accuracy.
Optimization phase (Algorithm 3).
SuRe algorithm for solving the optimization problem is summarized in Algorithm 3 and Fig 3. In the algorithm, we use a gradient-based method to update the restart probability c based on Eq (8).
SuRe learns restart probability vector c, and RWER computes our node relevance score vector r for a given seed node s.
3.6 Theoretical analysis
We analyze the time complexity of SuRe (Algorithm 3).
Lemma 4 Let |P| and |N| denote the number of positive and negative nodes, respectively. The computation of takes O(Tm + |P||N|) time where T is the number of iterations until convergence, and m is the number of edges.
Proof 6 Note that solving a sparse linear system Ax = b with an iterative method such as GMRES [22] requires O(T|A|) time where T is the number of iterations, and |A| is the number of non-zeros of A. Hence, it takes O(T|I − B⊤|) = O(Tm) time to solve the linear system by the iterative method where the number of non-zeros of I − B⊤ is bounded by O(m). In addition, setting takes O(|P||N|) time. Thus, the overall time complexity for computing is O(Tm + |P||N|).
Based on Lemma 4, the time complexity of Algorithm 3 is presented in Theorem 3.
Theorem 3 (Time complexity of Algorithm 3) For a given graph with m non-zero elements, the learning algorithm SuRe takes O(T1(T2 m + |P||N|)) time where T1 is the number of times c is updated with the gradient, and T2 is the number of inner iterations for computing r and .
Proof 7 The computation of r takes O(T′m) time according to Theorem 3.2. The computation of takes O(T″m + |P||N|) time according to Lemma 4. Besides, the computation of takes additional O(m) time due to the sparse matrix-vector multiplication. Hence, SuRe takes O(T1(T2m + |P||N|)) time where T1 is the number of iterations of the gradient descent procedure in order that the restart vector converges and T2 = T′ + T″.
Theorem 3 implies that our learning method SuRe in Algorithm 3 provides linear time and space scalability w.r.t. the number m of edges. Notice that |P| and |N| are constants much smaller than m.
3.7 Discussion on restart probabilities
The restart vector c can be considered as a function of attributes, , where ai and wi are an attribute vector and a weight vector of node i, respectively. We can incorporate rich attributes into the attribute vector from a given graph. For instance, node attributes such as each node’s degree and the number of common neighbors can be utilized. H(x) can be an exponential function exp(x) or a logistic function 1/(1 + exp(−x/d)). In this setting, finding good attributes is a key factor to show the best performances. However, using restart probabilities, which are independent of attributes, shows better performances due to the lack of attributes in our problem setting. Further works include learning the restart probabilities as a function of attributes.
4 Experiment
We evaluate our proposed method with various baseline approaches. Since there is no ground-truth of node-to-node relevance scores in real-world graphs, we instead evaluate the performance of two representative applications based on relevance scores: ranking and link prediction. Based on these settings, we aim to answer the following questions:
- Q1. Ranking performance (Section 4.2). Does our proposed method SuRe provide the best relevances scores for ranking compared to other methods?
- Q2. Link prediction performance (Section 4.3). How effective is SuRe for link prediction tasks?
- Q3. Parameter sensitivity (Section 4.4). How do parameters used in SuRe affect the accuracy of link prediction?
- Q4. Effects of number of labeled nodes. (Section 4.5). How does the number of labeled nodes affect the performance of link prediction?
- Q5. Scalability (Section 4.6). How well does SuRe scale up with the number of edges?
4.1 Experimental settings
Datasets.
We experiment on various real-world network datasets. Datasets used in our experiments are summarized in Table 3. We use Polblogs for the ranking task (Section 4.2), HepPh and HepTh for the link prediction task (Section 4.3), and Wikipedia for the scalability experiment (Section 4.6). The Wikipedia dataset is a hyperlink network. The HepPh and HepTh datasets are collaboration networks where nodes are authors, and edges are collaboration relationships time-stamped from May 15, 1992 to August 14, 1996 and from October 1, 1993 to December 10, 1999, respectively. The Polblogs dataset is a political network made up of liberal and conservative blogs. Since only HepPh and HepTh have time information, we use them in the link prediction task. All experiments are performed on a Linux machine with Intel(R) Xeon E5-2630 v4 CPU @ 2.2GHz and 256GB memory.
The query nodes are used for the ranking and the link prediction tasks.
Methods. We compare our proposed method SuRe with the following methods:
- Simple RWER method (S-RWER): sets the restart probabilities such that positively labeled nodes are given 0.1 and negatively labeled nodes are given 0.7.
- Common Neighbor (CN) [23]: |Nset(x) ∩ Nset(y)| where Nset is set of neighbors.
- Adamic-Adar (AA) [1]: .
- Jaccard’s Coefficient (JC) [24]:
- Random Walk with Restart (RWR) [6]: performs RWR on the network with no side information used.
- Supervised Random Walks (SRW) [5]: learns a function that assigns weights to edges. We use each node’s degree and the number of common neighbors as features.
- QUINT [12]: learns an adjacency matrix which represents edge’s weights and presence of edges.
- Prioritization algorithm [25]: prioritizes proteins in protein-protein interaction graphs.
- SimRank [26]: measures similarity of nodes based on their relationships with other nodes.
Parameters.
We set the origin vector o in SuRe and restart probability c in RWR, SRW, and QUINT to the ones that give the best performance. Also, in SuRe, SRW, and QUINT, we set λ = 1 among {0.01, 0.1, 1, 10}, and b = 10−2 among {10−1, 10−2, 10−3, 10−4} by grid search. The selected parameters give the best performance for each method.
Evaluation metrics.
To compare the methods, we use Mean Average Precision (MAP), Area under the ROC curve (AUC), and Precision@20. MAP is the mean of average precisions for multiple queries. AUC is the expectation that a uniformly drawn random positive is ranked higher than a uniformly drawn random negative. Precision@20 is the precision at the top-20 position in a ranking result. The higher the values of the metrics are, the better the performance is.
4.2 Ranking performance
We evaluate the ranking performance of our method SuRe compared to that of other methods.
Experimental setup.
We perform this experiments on the Polblogs dataset. In Polblogs dataset, a node represents a blog, and an edge between nodes indicates a hyperlink between blogs. In the dataset, each node has a label which is either liberal or conservative. Among nodes connected from the query node (i.e., neighbors), we choose nodes having the same political position to the query node as positive nodes, and nodes having the opposite propensity to the query node as negative nodes. Note that the numbers of positive nodes and negative nodes do not exceed their query node’s degree. We sample 115 nodes whose degrees are greater than 4 as query nodes to perform this experiment. We use all nodes except neighbors from a query node as test nodes. In this experiment, we aim to boost ranks of nodes having the same political position as the query node.
Case study.
We analyze the ranking quality produced from each method in the Polblogs dataset. Table 4 shows the top-10 ranking list for a query node obsidianwings, a liberal blog. Red colored nodes are conservative, and the black colored ones are liberal. As shown in the table, our ranking result from SuRe is of a higher quality compared to those from RWR, SRW, and QUINT since top-10 ranking result from SuRe contains only liberal nodes while other ranking results have several conservative nodes, considering that the query node is liberal.
Bold nodes are conservative blogs, and the non-bold ones are liberal. Our ranking result from SuRe contains only liberal nodes, indicating the best result, while other ranking results wrongly contain conservative nodes.
Result.
To evaluate ranking performances, we measure MAP, Precision@20, and AUC for the ranking results produced from our method SuRe and other methods.
In Polblogs, if the query node is liberal (conservative), then the positive class is liberal (conservative), and the negative one is conservative (liberal). As shown in Fig 4, our method SuRe shows the best ranking performance compared to other methods in terms of MAP, Precision@20, and AUC. Also, SuRe shows better performances over the simple RWER method (S-RWER), which means SuRe learns the restart probabilities effectively.
Our method SuRe provides the best ranking performance compared to existing methods in terms of MAP and Precision@20. Note that the maximal values are 1 for each plot.
4.3 Link prediction performance
We examine the link prediction performance of our proposed method SuRe compared to other link prediction methods as well as RWR-based methods SRW and QUINT.
Experimental setup.
In the link prediction task, we aim to predict future links from a query node to other nodes based on relevance scores. We perform this experiments on the HepPh and HepTh datasets which are time-stamped networks. We follow the setting of [5] and focus on predicting links to nodes that are 2-hops away from the query node since most of new edges are created closing a triangle. Fig 5 supports the triangle closing setting used in the link prediction task. When an edge is newly added, we count the hop between two nodes to be connected by the edge. Note that the majority of the connected nodes are 2 hops away in both of the datasets.
Note that the majority of nodes are within 2 hops in both real-world networks.
The detailed experimental setting of the link prediction task is as follows:
- We consider time-stamp information in networks to construct training and test datasets as in [5]. For each query node s, let ts,min and ts,max denote the minimum and maximum time-stamp of 1-hop neighbors of node s, respectively. Suppose ts,min < ts,1 < ts,2 < ts,max. We select links (s, v) created between ts,2 and ts,max as test data where node v is 2-hop neighbors from node s before the links are created. We choose the query node’s 1-hop neighbors created between ts,1 and ts,2 as positive nodes. We sample the same number of negative nodes as that of positive ones, which are 3-hops or more from node s. We exploit other links excepts the test links as training data. We select ts,1 and ts,2 such that ts,1 = ts,min + 0.3L and ts,2 = ts,min + 0.7L, respectively, where L = ts,max − ts,min is the total time length.
We divide nodes into two groups; the first group consists of query nodes of non-low degrees whose degrees are greater than or equal to 30, and the second group consists of 176 query nodes of low degrees whose degrees are less than 30. Figs 6, 7 and Table 5 show the link prediction performances of the first group; Tables 6 and 7 show the link prediction performances of the second group.
SuRe shows the highest accuracies: 10.8% higher MAP, and 5.7% higher Precision@20 compared to the best existing method.
SuRe shows the highest accuracies: 14.7% higher MAP, and 10.1% higher Precision@20 compared to the best existing method.
We compare SuRe with other baselines. SuRe provides the best link prediction accuracy.
Bold and italic fonts indicate the best and the second best methods, respectively.
Bold and italic fonts indicate the best and the second best methods, respectively.
Result on query nodes of non-low degree.
Figs 6, 7 and Table 5 show the link prediction performances in terms of MAP, Precision@20, and AUC. Here, we select nodes whose degrees are greater than 30 as query nodes. As shown in the results, our method SuRe outperforms other competitors including SRW and QUINT which are the state-of-the-art methods for link prediction. In the HepTh dataset, compared to the best competitor SRW, SuRe achieves 14.7% improvement in terms of MAP, and 10.1% improvement in terms of Precision@20 (Fig 7). For the AUC results, SuRe gives the best performance as shown in Table 5. AUC is high because of class imbalance (# of positive class ≪ # of negative class) Note that SuRe provides the best prediction over all datasets. The results state that assigning a distinct restart probability to each node and learning the restart probabilities (RWER and SuRe) have a significant effect on link prediction compared to using a fixed restart probability for all nodes (RWR). Furthermore, the result indicates that learning restart probabilities (SuRe) provides better link prediction accuracy than existing supervised learning methods that focus on learning edge weights (SRW) or network topology (QUINT).
We perform a Welch’s t-test between the MAP results of SuRe and the best baseline SRW on HepPh. The p-value is 0.0038, which indicates the improvement over the baseline is significant.
Result on query nodes of low degree.
As shown in Tables 6 and 7, SuRe, SRW, and RWR show the best and comparable performances in general for low degree nodes. In the HepTh dataset, SuRe outperforms all other methods in all aspects. In the HepPh dataset, however, other methods including SRW outperform SuRe. We conjecture that this comes from small training instances for low degree nodes, since they have a small number of neighbors.
Discussion.
We discuss the above experimental results in terms of the number of model parameters. As shown in Table 8, SRW has not enough parameters (i.e., number of user-defined features in [5] < number of nodes n); thus, feature selection is important for the performance of applications in SRW. On the other hand, QUINT has too many parameters; thus, it is infeasible to learn O(n2) parameters in large-scale graphs. We conjecture that this huge number of parameters is the main reason of the poor performance of QUINT. Compared to these methods, SuRe has a moderate number of parameters, implying that 1) the expressiveness of SuRe is better than that of SRW, and 2) SuRe is not likely to overfit. This point explains why SuRe provides better performance of the ranking and link prediction tasks than SRW and QUINT do as shown in Sections 4.2 and 4.3.
4.4 Effects of parameters
We investigate the parameter sensitivity of SuRe w.r.t. the value of the origin vector o and λ. The origin vector o serves as a model regularizer which helps avoid overfitting and improves accuracy, as described in Section 3.4. We evaluate MAP of the link prediction task, and report the results in Fig 8. Note that the performance of SuRe is improved by introducing the origin parameter o, compared to the case without o, which corresponds to the leftmost points in both plots of Fig 8.
We report the link prediction accuracy using MAP measure, changing the values of the elements in the origin vector o, where all the elements of o are set to a same value. Note that the performance of SuRe is improved by introducing the origin parameter o, compared to not using o which corresponds to setting o = 0.
λ is a regularization parameter that controls the importance of the regularization term. We evaluate MAP of the link prediction task in the HepPh and HepTh datasets varying the value of λ. As shown in Fig 9, when λ is 1, the MAP score of SuRe is the highest. By introducing the regularization parameter λ, SuRe avoids the overfitting problem and improves accuracy.
We report the MAP scores in the link prediction task, varying the value of λ. Note that SuRe avoids overfitting problem and shows improvement by introducing the regularization parameter λ. When λ is 100 = 1, SuRe exhibits the best link prediction performance in both datasets.
4.5 Effect of number of labeled nodes
We evaluate the link prediction performance of SuRe by varying the number of labeled nodes. For a given seed node, we sample labeled nodes with a sampling rate. E.g., the average number of the labeled nodes in the Hepth dataset is 69.58; when the sampling rate is 0.2, then the average number of labeled nodes becomes 13.91. As shown in Fig 10, when a seed node has many labeled nodes, SuRe shows better performance since it learns a better model with rich data.
We sample labeled nodes with a sampling rate. Note that a higher sampling rate leads to more training examples which in turn improve the performance.
4.6 Scalability
We examine the scalability of our proposed method SuRe compared to other baselines. We perform SuRe to find the optimal restart probabilities and RWER to compute rankings with various sizes of the Wikipedia dataset to investigate the scalability of SuRe. For the dataset, we extract the principal sub-matrices, which are the upper left part of the adjacency matrix, of different lengths to get graphs of different number of edges. Fig 11 shows that SuRe scales near-linearly with the number of edges; this result is consistent with Theorem 3. The slope of the fitted line for SuRe is 0.76, the smallest number: those for RWR, SRW, and QUINT are 0.83, 0.88, and 2.57, respectively.
SuRe has the smallest slope 0.76 of the fitted line, while the slopes for RWR, SRW, and QUINT are 0.83, 0.88, and 2.57, respectively. Note that SuRe is the fastest among the supervised methods SRW and QUINT. Although SuRe shows slower running time compared to RWR which is an unsupervised method, SuRe provides higher accuracy than RWR in most cases in both ranking and link prediction tasks as shown in Figs 7, 4 and 6.
The number of iterations of SuRe is determined by tolerance or the learning rate we set. If we set small tolerance, then the number of iterations will be large. Each SuRe computation for each query node take 39.4 iterations on average in the HepTh dataset. Note that SuRe is the fastest among the supervised methods. Although RWR, which is an unsupervised method, is faster than SuRe, RWR shows low accuracy as shown in Figs 7, 4 and 6.
5 Related works
The related works fall into two main categories: 1) relevance measures in graphs, 2) label prediction and classification, and 3) ranking and link prediction based on relevance measures.
Relevance measures in graphs
There are various relevance measures in graphs based on link analysis and random walk, e.g., PageRank [27], HITS [4], SimRank [26], Random Walk Graph Kernel [28], and RWR (or Personalized PageRank) [6]. Among these measures, RWR has received much attention from the data mining community since it provides a personalized ranking w.r.t. a node, and it has been applied to many graph mining applications such as community detection [13], link prediction [5, 12], ranking [9], and graph matching [29]. Also, fast and scalable methods [10, 11, 9] for computing RWR in large graphs have been proposed to boost the performance of those applications in terms of time.
Label prediction and classification
Considering the problem setting, our work is related to label propagation. Hwang and Kuang [30] proposed the MINProp algorithm, a heterogeneous label propagation algorithm to discover disease genes. This algorithm is based on [31], whose algorithm spreads label information. These problems assume the graphs are undirected, but our graphs in the experiments are directed. Vanunu et al. [25] proposed a prioritization algorithm, which prioritize a set of genes in protein-protein interaction networks. Sousa et al. [32] studied graph-based semi-supervised learning algorithms and graph construction methods on labeled graphs. Ji et al. [33] proposed a graph-based regularization framework to model the link structure in information networks. They treated different types of objects and links separately because of different semantic meanings. They also proposed a ranking-based classification algorithm [34], which integrates ranking and classification. Sen et al. [35] provided a brief description of some of most widely used classification algorithms for classifying networked data.
Ranking and link prediction
Jung et al. [36] extended the concept of RWR to design a personalized ranking model in signed networks. Our proposed algorithm is different from SRWR in that SRWR is for edge-labeled graphs. Regarding the edge-labeled graphs, there are some works on knowledge graphs [37, 38, 39]. Lao et al. [37] extended the Path Ranking algorithm adjusting the weights associated with random walks that follow different paths through the graph. Wang et al. [38] developed a fast and easily-parallelized weight-learning algorithm for ProPPR, based on local partitioning methods. Wei et al. [39] introduced a goal-directed random walk algorithm which directs random walks by a specific inference target. Wang et al. [40] proposed an image annotation technique that generates candidate annotations and re-ranks them using RWR. Pan et al. [41] exploited RWR to discover correlations across multimedia data. Sun et al. [15] used RWR ranking results for detecting anomalies in bipartite networks. Gleich et al. [42] have empirically shown that random walk based models such as RWR is competitive with other cut based approaches for detecting local communities in graphs. Liben-Nowell et al. [23] extensively studied the link prediction problem in social networks based on relevance measures such as PageRank, RWR, and Adamic-Adar [1]. Grover and Leskovec [43] proposed a representation learning algorithm based on random walks. This work can be utilized in link prediction. Many researchers have proposed supervised learning methods for link prediction. Backstrom et al. [5] proposed Supervised Random Walk (SRW), a supervised learning method for link prediction based on RWR. SRW learns parameters for adjusting edge weights. Li et al. [12] developed QUINT, a learning method for finding a query-specific optimal network. QUINT modifies the network topology including edge weights.
In many real-world scenarios, however, modifying the graph structure would not be allowed. On the contrary, our SuRe method controls the behavior of the random surfer without modifying the graph structure, and provides better prediction accuracy than other competitors as shown in Section 4.
6 Conclusion
We propose Random Walk with Extended Restart (RWER), a novel relevance measure using distinct restart probabilities for each node. We also propose SuRe, a data-driven algorithm for learning restart probabilities of RWER. Experiments show that our method brings the best performance for ranking and link prediction tasks, outperforming the traditional RWR and recent supervised learning methods. Future works include extending the proposed method for heterogeneous and edge labeled graphs.
Acknowledgments
The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study.
References
- 1. Adamic LA, Adar E. Friends and neighbors on the web. Social networks. 2003;25(3):211–230.
- 2.
Agarwal A, Chakrabarti S, Aggarwal S. Learning to rank networked entities. In: KDD; 2006.
- 3.
Agarwal A, Chakrabarti S. Learning random walks to rank nodes in graphs. In: ICML; 2007.
- 4. Kleinberg JM. Authoritative sources in a hyperlinked environment. JACM. 1999;46(5):604–632.
- 5.
Backstrom L, Leskovec J. Supervised random walks: predicting and recommending links in social networks. In: WSDM; 2011.
- 6.
Haveliwala TH. Topic-sensitive pagerank. In: WWW; 2002.
- 7.
He J, Li M, Zhang HJ, Tong H, Zhang C. Manifold-ranking based image retrieval. In: MM; 2004.
- 8.
Tong H, Faloutsos C. Center-piece subgraphs: problem definition and fast solutions. In: KDD; 2006.
- 9.
Tong H, Faloutsos C, Pan JY. Fast random walk with restart and its applications. 2006.
- 10.
Shin K, Jung J, Lee S, Kang U. Bear: Block elimination approach for random walk with restart on large graphs. In: SIGMOD; 2015.
- 11. Jung J, Shin K, Sael L, Kang U. Random Walk with Restart on Large Graphs Using Block Elimination. TODS. 2016;41(2):12.
- 12.
Li L, Yao Y, Tang J, Fan W, Tong H. QUINT: On Query-Specific Optimal Networks. In: KDD; 2016.
- 13.
Andersen R, Chung F, Lang K. Local graph partitioning using pagerank vectors. In: FOCS; 2006.
- 14.
Zhu ZA, Lattanzi S, Mirrokni VS. A Local Algorithm for Finding Well-Connected Clusters. In: ICML; 2013.
- 15.
Sun J, Qu H, Chakrabarti D, Faloutsos C. Neighborhood formation and anomaly detection in bipartite graphs. In: ICDM; 2005.
- 16. Fujimoto T, Ranade RR. Two characterizations of inverse-positive matrices: the Hawkins-Simon condition and the Le Chatelier-Braun principle. Electronic Journal of Linear Algebra. 2004;11(1):6.
- 17. Sherman J, Morrison WJ. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics. 1950;21(1):124–127.
- 18.
Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes. Cambridge University Press, cambridge; 2007.
- 19.
Langville AN, Meyer CD. Google’s PageRank and beyond: The science of search engine rankings. Princeton University Press; 2011.
- 20.
Mozer MC. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. 2003.
- 21.
Petersen KB, Pedersen MS. The matrix cookbook. Technical University of Denmark. 2012.
- 22. Saad Y, Schultz MH. GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on scientific and statistical computing. 1986;7(3):856–869.
- 23. Liben-Nowell D, Kleinberg J. The link-prediction problem for social networks. Journal of the American society for information science and technology. 2007;58(7):1019–1031.
- 24.
Salton G, McGill MJ. Introduction to modern information retrieval. 1986.
- 25. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS computational biology. 2010;6(1):e1000641. pmid:20090828
- 26.
Jeh G, Widom J. SimRank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2002. p. 538–543.
- 27. Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: bringing order to the web. 1999.
- 28.
Kang U, Tong H, Sun J. Fast Random Walk Graph Kernel. In: SDM. SIAM; 2012. p. 828–838.
- 29.
Tong H, Faloutsos C, Gallagher B, Eliassi-Rad T. Fast best-effort pattern matching in large attributed graphs. In: KDD; 2007.
- 30.
Hwang T, Kuang R. A heterogeneous label propagation algorithm for disease gene discovery. In: Proceedings of the 2010 SIAM International Conference on Data Mining. SIAM; 2010. p. 583–594.
- 31.
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B. Learning with local and global consistency. In: Advances in neural information processing systems; 2004. p. 321–328.
- 32.
de Sousa CAR, Rezende SO, Batista GE. Influence of graph construction on semi-supervised learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2013. p. 160–175.
- 33.
Ji M, Sun Y, Danilevsky M, Han J, Gao J. Graph regularized transductive classification on heterogeneous information networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2010. p. 570–586.
- 34.
Ji M, Han J, Danilevsky M. Ranking-based classification of heterogeneous information networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. p. 1298–1306.
- 35. Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T. Collective classification in network data. AI magazine. 2008;29(3):93.
- 36.
Jung J, Jin W, Sael L, Kang U. Personalized Ranking in Signed Networks using Signed Random Walk with Restart. In: ICDM; 2016.
- 37.
Lao N, Mitchell T, Cohen WW. Random walk inference and learning in a large scale knowledge base. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2011. p. 529–539.
- 38. Wang WY, Mazaitis K, Lao N, Cohen WW. Efficient inference and learning in a large knowledge base. Machine Learning. 2015;100(1):101–126.
- 39.
Wei Z, Zhao J, Liu K. Mining inference formulas by goal-directed random walks. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; 2016. p. 1379–1388.
- 40.
Wang C, Jing F, Zhang L, Zhang HJ. Image annotation refinement using random walk with restarts. In: MM; 2006.
- 41.
Pan JY, Yang HJ, Faloutsos C, Duygulu P. Automatic multimedia cross-modal correlation discovery. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2004. p. 653–658.
- 42.
Gleich DF, Seshadhri C. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2012. p. 597–605.
- 43.
Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2016. p. 855–864.