skip to main content
research-article
Open access

Invariant Node Representation Learning under Distribution Shifts with Multiple Latent Environments

Published: 18 August 2023 Publication History

Abstract

Node representation learning methods, such as graph neural networks, show promising results when testing and training graph data come from the same distribution. However, the existing approaches fail to generalize under distribution shifts when the nodes reside in multiple latent environments. How to learn invariant node representations to handle distribution shifts with multiple latent environments remains unexplored. In this article, we propose a novel Invariant Node representation Learning (INL) approach capable of generating invariant node representations based on the invariant patterns under distribution shifts with multiple latent environments by leveraging the invariance principle. Specifically, we define invariant and variant patterns as ego-subgraphs of each node and identify the invariant ego-subgraphs through jointly accounting for node features and graph structures. To infer the latent environments of nodes, we propose a contrastive modularity-based graph clustering method based on the variant patterns. We further propose an invariant learning module to learn node representations that can generalize to distribution shifts. We theoretically show that our proposed method can achieve guaranteed performance under distribution shifts. Extensive experiments on both synthetic and real-world node classification benchmarks demonstrate that our method greatly outperforms state-of-the-art baselines under distribution shifts.

1 Introduction

Graph-structured data is ubiquitous in the real world, e.g., social networks [22], knowledge graphs [61], biology networks [5], chemical molecules [80], and so on. Learning node representation is critical for various graph analytical tasks such as node classification [38] and link prediction [67]. Especially, graph neural networks (GNNs) [38, 75, 81] have shown great successes in learning effective node representations and handling applications from various fields [14, 35, 55, 70, 84, 94, 97, 100].
Despite their successes, the existing node representation learning approaches typically assume that the testing and training graph data are drawn from the same distribution, namely the node features and graph structures of labeled training nodes and testing nodes follow similar patterns. Under this assumption, the node representation learning methods can naturally generalize to unseen testing nodes. However, this assumption can be easily violated in real-world graphs, since nodes always reside in multiple latent environments where distribution shifts widely exist between multiple latent environments of training and testing data induced by complex underlying data generation mechanism [6]. For example, in protein-protein interaction graphs, the distributions of protein features/interactions (i.e., input data) and their functions (i.e., labels) exist significant changes between different species that the proteins come from (i.e., environments) [15]. In citation networks, the papers’ citations (i.e., input data) and their subject topics (i.e., labels) are strongly affected by the publication time (i.e., environments) [33]. There exist increasing evidences suggesting that most node representation learning approaches are vulnerable to distribution shifts [33, 78, 79] and fail to achieve out-of-distribution (OOD) generalization. If the models capture the variant correlations across different environments rather than focus on invariant patterns of the truly predictive properties in multiple latent environments, they will inevitably fail under distribution shifts, hindering their applications in real-world graphs, especially for high-stake applications such as molecular prediction [80], financial analysis [85], medical diagnosis [47], drug repurposing [32], and so on.
In this work, we study learning invariant node representations to handle distribution shifts with multiple latent environments, which remains unexplored and poses great challenges as follows:
First, nodes in the graph are connected by structures and cannot be modeled as independent samples for predictions. Distribution shifts can happen on both node features and graph structures, leading to complex invariant and variant patterns. How to define and identify these patterns to capture sufficiently predictive information is non-trivial.
Second, environment labels for nodes are usually unavailable or prohibitively expensive to collect. How to infer the environment labels, which is critical for designing invariant learning methods, is also challenging, since the environments of different nodes are also highly entangled.
Last but not least, even with the inferred environment labels of nodes, it requires tailored designs to learn invariant node representations capable of generalization under distribution shifts with theoretical guarantees.
To tackle these challenges, we propose Invariant Node representation Learning (INL) approach capable of learning invariant node representations under distribution shifts with multiple latent environments and achieve theoretically grounded generalization performance. The framework of INL is shown in Figure 1. In particular, we take a local view and define invariant patterns as ego-subgraphs, i.e., subgraphs of the \(L\)-order ego-graph of each node, and identify these ego-subgraphs through jointly considering node features and graph structures. Then, we use the variant ego-subgraphs, i.e., the complement of invariant ego-subgraphs, to infer environment labels by proposing a contrastive modularity-based graph clustering method. The variant ego-subgraphs capture correlative but not truly predictive patterns with node labels under distribution shifts and therefore contain discriminative information to infer environment labels of nodes. Finally, we propose to optimize the maximal invariant pattern criterion given the identified invariant ego-subgraphs and inferred environments to produce invariant node representations. We theoretically show that INL can achieve guaranteed generalization performance by finding a maximal invariant pattern. We conduct extensive experiments on both synthetic datasets and real-world benchmarks for the node classification task. The results show that INL achieves substantial performance gains on the unseen testing nodes compared with various state-of-the-art baselines. Our contributions are summarized as follows:
Fig. 1.
Fig. 1. The framework of INL model. Our proposed method jointly optimizes three modules: (1) The invariant ego-subgraph identification module uses \(\Psi (\cdot)\) to identify the invariant ego-subgraph \(G^I_v\) and the variant ego-subgraph \(G^S_v\) for each node \(v\). (2) The node environment inference module uses the variant ego-subgraphs \(\lbrace G_v^S\rbrace\) to infer the latent environments by a contrastive modularity-based graph clustering. (3) The invariance regularization module jointly optimizes the invariant ego-subgraph generator \(\Psi (\cdot)\), the representation learning function \(g(\cdot)\), and the classifier \(w(\cdot)\). Training stage (shown by grey arrows): We back propagate with the objective function to update model parameters. Testing stage (shown by orange arrows): We use the optimized model to make predictions. This example assumes that the node labels have two classes, which are denoted by red and green colors respectively.
We propose a novel Invariant Node representation Learning (INL) approach to learn invariant node representations capable of OOD generalization under distribution shifts. To the best of our knowledge, we are the first to study invariant node representation learning with multiple latent environments.
We design a contrastive modularity-based graph clustering method to infer the environment labels of nodes for the graph with complex multiple latent environments.
We propose a maximal invariant pattern criterion to learn node representations. We theoretically show that by finding maximal invariant ego-subgraphs, INL can achieve guaranteed OOD generalization performance under distribution shifts.
Extensive experimental results demonstrate the effectiveness of INL on various synthetic and benchmark datasets for the node classification task under distribution shifts.
We introduce the notations and preliminaries in Section 2. In Section 3, we describe the problem formulation and the details of our proposed INL. We present the experimental results in Section 4, including quantitative comparisons on both synthetic and real-world datasets, complexity analysis, ablation studies, hyper-parameter sensitivity, and so on. Subsequently, some related works are reviewed in Section 5. We conclude this work in Section 6.

2 Notations and Preliminaries

2.1 Notations

Consider a graph \(G = (V, E)\), the node feature matrix \(X = \lbrace x_v | v \in V \rbrace \in \mathbb {R}^{|V| \times F}\) (where \(F\) denotes the node feature dimension), and labels \(Y = \lbrace y_v | v \in V\rbrace\). The adjacency matrix is denoted as \(A=\lbrace a_{v,v^\prime } | v,v^\prime \in V\rbrace \in \mathbb {R}^{|V| \times |V|}\), where \(a_{v,v^\prime }=1\) means there exists an edge connecting node \(v\) and \(v^\prime\), and \(a_{v,v^\prime }=0\) otherwise. We assume the nodes \(V\) are collected from multiple environments, i.e., \(V = \lbrace V^e\rbrace _{e \in {\rm supp}(\mathcal {E}_{tr})}\), where \(V^e\) denotes the nodes from environment \(e\), \(\mathrm{supp}(\mathcal {E}_{tr})\) is the support of the environmental variable. We use \(\mathbf {v}\) and \(\mathbf {y}\) to denote the random variables of node and label, respectively. We summarize the key notations of this article and the corresponding descriptions in Table 1.
Table 1.
NotationDescription
\(G = (V, E)\)The input graph \(G\) with node set \(V\) and edge set \(E\)
\(X, A, Y\)The node feature matrix, the adjacency matrix, and the label vector
\(G_v, \mathbf {G_v}\)An instance and the random variable of node \(v\)’s ego-graph
\(G_v^I=\Psi (G_v)\)An instance of the invariant ego-subgraph and the invariant ego-subgraph generator
\(\Psi ^*\)The optimal invariant ego-subgraph generator
\(X_v, A_v\)The local node feature matrix and the adjacency matrix of ego-graph \(G_v\)
\(G_v^S=G_v\backslash G_v^I\)An instance of the variant ego-subgraph
\(\mathbf {G_v}, \mathbf {v}, \mathbf {Y}, \mathbf {y}\)The random variable of ego-graph, node, label vector, node label
\(X_v^I/X_v^S\)The local node feature matrix of the invariant/variant ego-subgraph \(G_v\)
\(A_v^I/A_v^S\)The local adjacency matrix of the invariant/variant ego-subgraph \(G_v\)
\(\mathbf {Z}^I\)The invariant node representations
\(\mathcal {N}_v\)The node \(v\)’s \(L\)-hop neighbors
\(K\)The number of the ground-truth environments
\(\mathcal {E}/\mathcal {E}_{tr}\)A random variable on indices of all/training environments
\(\mathcal {E}_{infer}\)A random variable on indices of the inferred environments
\(|\mathcal {E}_{infer}|\)The number of the inferred environments
\(C\)The cluster assignment matrix
\(C_v\)The one-hot vector indicating the environment of node \(v\) with dimensionality \({|\mathcal {E}_{infer}|}\)
\(e\)An instance of environment
\(\mathbb {G}, \mathbb {Y}\)The graph space and label space
\(f\)The predictor from \(\mathbb {G}\) to \(\mathbb {Y}\)
\(w\)The classifier from \(\mathbb {R}^d\) to \(\mathbb {Y}\)
\(h\)The representation learning function from \(\mathbb {G}\) to \(\mathbb {R}^d\)
\(g\)The representation learning function for invariant ego-subgraph
\(\mathcal {I}_{\mathcal {E}}\)The invariant ego-subgraph generator set with respect to \(\mathcal {E}\)
\(\ell\)The loss function
Table 1. Notations

2.2 Preliminaries

Recently, invariant learning has received surging attention to enable generalizing to distribution shifts, i.e., out-of-distribution (OOD)generalization. It aims to exploit the invariant relationships between the input data and labels across distribution shifts while filtering out the variant spurious correlations.1 Following the invariant learning literature [2, 4, 11, 40, 42, 64], we formulate the problem of learning invariant node representations capable of generalizing to distribution shifts, i.e., out-of-distribution (OOD) generalized node representation learning, as:
Problem 1.
Let \(\mathcal {E}\) denote the random variable on indices of all possible environments of nodes \(V\). The goal is to find an optimal predictor \(f^{*}(\cdot)\) mapping nodes to their labels that performs well on all environments:
\begin{equation} f^{*}(\cdot) = \arg \min _f \sup _{e\in \mathrm{supp}(\mathcal {E})} \mathcal {R}(f|e), \end{equation}
(1)
where \(\mathcal {R}(f|e)\) is the risk of the predictor \(f\) on the nodes that belong to environment \(e\). Equation (1) encourages to learn the predictor whose performance on the worst-case environment is optimal, where such min-max optimality with respect to unseen test environments is proved to satisfy the OOD generalization in the invariant learning literature [3, 40, 64]. We further decompose \(f(\cdot) = w \circ h\), where \(h(\cdot):\mathbb {G} \rightarrow \mathbb {R}^d\) is the representation learning function, \(\mathbb {G}\) is the graph space, \(d\) is the dimensionality, and \(w(\cdot):\mathbb {R}^d \rightarrow \mathbb {Y}\) is the classifier.
Note that \(\mathrm{supp}(\mathcal {E}_{tr}) \subset \mathrm{supp}(\mathcal {E})\). Distribution shifts indicate that \(P^e(\mathbf {v},\mathbf {y}) \ne P^{e^{\prime }}(\mathbf {v},\mathbf {y}), e \in \mathrm{supp}(\mathcal {E}_{tr}), e^{\prime } \in \mathrm{supp}(\mathcal {E}) \setminus \mathrm{supp}(\mathcal {E}_{tr})\), i.e., the joint distribution of node and label is different in training and testing data. The testing nodes are not available in the training stage, meaning that we can not obtain a prior distribution of testing nodes for training.2 However, Problem 1 is difficult to be directly solved, since (1) the nodes which are connected by graph structure are non-independent inducing obstacle for predictions, and (2) the environment labels for the nodes are unobserved [4, 40], which are usually unavailable or prohibitively expensive to collect for most real scenarios.

3 Method

In this section, we introduce our proposed INL in detail. The framework of INL is shown in Figure 1. Specifically, we first propose an invariant ego-subgraph identification module. Then, we infer environment labels by proposing a contrastive modularity-based graph clustering method. Last, we optimize the maximal invariant pattern criterion to produce invariant node representations capable of generalizing under distribution shifts with theoretical guarantees.

3.1 Problem Formulation

In this article, we focus on learning invariant node representation by adopting message-passing GNNs. Since only the immediate neighbors of nodes are aggregated in each message-passing layer, the representation of nodes only depends on their \(L\)-hop neighbors, where \(L\) is the number of message-passing layers. Therefore, we learn representations of nodes by only focusing on their \(L\)-order ego-graph, which is the common assumption for most message-passing GNNs [34, 38, 78]. Denote the node \(v\)’s \(L\)-hop neighbors as \(\mathcal {N}_v = \lbrace u|d(v,u) \le L\rbrace\), where \(d(v,u)\) is the shortest path distance between node \(v\) and \(u\). The nodes in \(\mathcal {N}_v\) and their connections form the ego-graph \(G_v\) of node \(v\), which is represented as a local node feature matrix \(X_v = \lbrace x_u | u \in \mathcal {N}_v\rbrace\) and local adjacency matrix \(A_v = \lbrace a_{u,u^\prime } | u,u^\prime \in \mathcal {N}_v\rbrace\). We use \(\mathbf {G_v}\) and \(G_v\) to denote the random variable and instance of ego-graphs and use \(\mathbf {G}\) and \(\mathbf {Y}\) to denote the random variable of input graph and node label vector, respectively. Then, we can reformulate the problem by using ego-subgraphs, i.e., an ego-graph dataset defined as \(\mathcal {G} = \lbrace \mathcal {G}^e\rbrace _{e \in {\rm supp}(\mathcal {E}_{tr})}\), where \(\mathcal {G}^e = \lbrace (G^e_v, y^e_v) | v \in V^e\rbrace\) denotes the ego-graphs in environment \(e\). Notice that ego-graphs are not independent samples, but they can be seen as a Markov blanket [34, 78], so the conditional distribution can be decomposed (conditional independence), i.e., \(P(\mathbf {Y}|\mathbf {G}) = \prod _{\mathbf {v}} P(\mathbf {y}|\mathbf {G_v})\).
Problem 2.
Given the training graph where nodes are from multiple latent environments but without environment labels, the task is to jointly infer the node environments \(\mathcal {E}_{infer}\) and learn \(f^{*}(\cdot)\) in Problem 1 with \(\mathcal {E}_{infer}\) to achieve good OOD generalization performance under distribution shifts.

3.2 Invariant Ego-subgraph Identification

To enable OOD generalization, recent studies on invariant learning [2, 4, 11, 40, 42, 64] propose to train a predictor using only a portion of features of each input instance that capture the invariant and sufficiently predictive relations with labels. Since we have transformed the node representation learning task into only using ego-graphs \(G_v\), we assume that each ego-graph instance has an invariant subgraph, i.e., ego-subgraph \(G_v^{I} \subset G_v\), that possesses invariant and sufficiently predictive information to the node’s label \(y_v\) in different environments under distribution shifts. We refer to the rest of each ego-graph, i.e., the complement of \(G_v^I\), as the variant ego-subgraph and denote it as \(G_v^S\). \(G_v^S\) represents the surrounding part of the node \(v\) whose relationship with the label is variant across different environments, e.g., spurious correlations for predicting node \(v\). The graph model will have a better OOD generalization ability if it can identify the invariant ego-subgraph \(G_v^I\) for each node accurately and learn node representation based on \(G_v^I\) for predictions.
Formally, we denote a generator for each node’s ego-graph to obtain the invariant ego-subgraph as \(G_v^I = \Psi (G_v)\). Following the invariant learning literature [2, 11, 40, 42, 45, 50], we make the assumption:
Assumption 1.
Given ego-graph \(\mathbf {G_v}\), there exists an optimal invariant ego-subgraph generator \(\Psi ^*(\mathbf {G_v})\) satisfying the following properties:
(a) \(\mathrm{Invariance\ property}\): \(\forall e, e^{\prime } \in \mathrm{supp}(\mathcal {E}), P^e(\mathbf {y}|\Psi ^*(\mathbf {G_v})) = P^{e^{\prime }}(\mathbf {y}|\Psi ^*(\mathbf {G_v})),\) where \(P^e(\cdot)\) and \(P^{e^{\prime }}(\cdot)\) denote the probability distribution in two environments \(e\) and \(e^{\prime }\), respectively.
(b) \(\mathrm{Sufficiency\ property}\): \(\mathbf {y} = w^*(g^*(\Psi ^*(\mathbf {G_v}))) + \epsilon ,\ \epsilon \perp \mathbf {G_v},\) where \(g^*(\cdot)\) denotes a representation learning function, \(w^*\) is the classifier, \(\perp\) indicates statistical independence, and \(\epsilon\) is random noise.
The invariance assumption means that the node representations learned on invariant ego-subgraphs have an invariant relation to the node labels across different environments. The sufficiency assumption means that the node representations learned on invariant ego-subgraphs are sufficiently predictive to the node labels.
In this article, we instantiate \(\Psi (\cdot)\) using two learnable masks on node features and graph structures (i.e., edges). First, the edge mask is responsible for splitting the local adjacency matrix \(A_v\) of the ego-graph \(G_v\) into the local adjacency matrix \(A_v^I\) of the invariant ego-subgraph \(G_v^I\) and the local adjacency matrix \(A_v^S\) of the variant ego-subgraph \(G_v^S\). A straightforward strategy is to train a binary mask matrix \(M^{A_v} = \lbrace 0, 1\rbrace ^{|\mathcal {N}_v| \times |\mathcal {N}_v|}\) on the local adjacency matrix \(A_v\). However, directly optimizing such a mask matrix is a discrete optimization problem and intractable in practice, especially for large-scale graphs [88]. Besides, learning a mask for each ego-subgraph cannot share knowledge among different nodes. Therefore, we adopt a learnable GNN (denoted as \(\mathrm{GNN^{M}}\)) to parameterize the mask matrix. Specifically, we relax edge masks from binary variables to continuous variables in \([0,1]\). The soft mask for each edge \((u,u^\prime), u,u^\prime \in \mathcal {N}_v\) in ego-graph \(G_v\) is:
\begin{equation} M_{u,u^\prime }^{A_v} = {\rm Sigmoid}\left({\mathbf {Z}_u^{\mathrm{M}}}^\top \cdot \mathbf {Z}_{u^\prime }^{\mathrm{M}}\right)\!,\ \ \ \mathbf {Z}^{\mathrm{M}} = \mathrm{GNN^{M}}(G_v) \in \mathbb {R}^d. \end{equation}
(2)
Besides the edge mask, we also adopt a soft \(F\)-dimensional feature mask \(M^X \in [0,1]^{F}\) shared by all the nodes for selecting the invariant node features in the ego-graph \(G_v\). The invariant ego-subgraph \(G_v^I = (A_v^I, X_v^I)\) and variant ego-subgraph \(G_v^S = (A_v^S, X_v^S)\) of \(G_v\) are calculated as:
\begin{equation} A_v^I = M^{A_v} \odot A_v, X_v^I = M^{X} \odot X_v; \ \ A_v^S = A_v - A_v^I, X_v^S = X_v - X_v^I, \end{equation}
(3)
where \(\odot\) is the element-wise matrix multiplication. Using the above method, we can generate all the invariant ego-subgraphs \(\lbrace G_v^I|v \in V\rbrace\) and variant ego-subgraphs \(\lbrace G_v^S|v \in V\rbrace\).

3.3 Node Environment Inference

After splitting the nodes’ ego-graphs into invariant and variant subgraphs, we can infer the environment label \(\mathcal {E}_{infer}\) using variant subgraphs \(\lbrace G_v^S|v \in V\rbrace\). The intuition is that, since the invariant ego-subgraphs capture the invariant relationships between predictive node features and graph structures with the node labels, the variant ego-subgraphs in turn capture variant spurious correlations under different distributions. Consider two nodes \(v, v^\prime\) from the same environment (e.g., two proteins from the same species or two papers published in the same period). Their variant ego-subgraphs \(G_v^S\) and \(G_{v^\prime }^s\) are likely show similar environment patterns. Based on the graph homophily assumption [57] that similar nodes are more likely to connect to each other, the nodes from the same environment will tend to be more densely connected in their variant ego-subgraphs than nodes from different environments (an illustrating example is shown in Figure 1). Therefore, we can infer the environments by conducting graph clustering based on the variant node features and edges.
Specifically, let \(X^S\) and \(A^S\) denote the node features and edges in \(\lbrace G_v^S|v \in V\rbrace\). Assuming there are \(K\) latent environments in graph, we design a contrastive modularity-based clustering method to infer the environments by learning a cluster assignment matrix \(C = \lbrace C_v|v \in V\rbrace\), where \(C_v\) is \(K\)-dimensional one-hot vector indicating the environment of node \(v\). We propose to minimize the following contrastive objective for clustering the nodes denoted by \((X^S, A^S)\):
\begin{equation} \min _{C} \ell = - \frac{1}{K}\sum _{k=1}^K{\rm log} \frac{{\rm exp} (B_{k,k})}{\sum _{k^\prime =1, k^\prime \ne k}^{K} {\rm exp} (B_{k,k^\prime })}, \end{equation}
(4)
where
\begin{equation} B = \frac{1}{2m} \left(C^\top A^S C - \frac{1}{2m} {\rm diag} \left({C^\top \mathbf {d} \mathbf {d}^\top C}\right) \right). \end{equation}
(5)
In Equation (5), \(\mathbf {d}\) and \(m\) indicate the degree vector and the number of edges calculated by \(A^S\), respectively. \({\rm diag}(\cdot)\) means only keeping the diagonal elements of the input matrix. \(B \in \mathbb {R}^{K \times K}\) is the modularity matrix [60], whose entry \(B_{k,k\prime }\) is the probability of an edge existing between cluster \(k\) and \(k^\prime\). Optimizing Equation (4) can maximize the connection probability between nodes from the same clusters (i.e., positive pairs) and minimize the connecting probability between nodes from the different clusters (i.e., negative pairs) via a contrastive scheme [13], encouraging to form clear clusters. Since optimizing the binary cluster assignment matrix is proven to be NP-hard [8], we follow Reference [73] to relax \(C \in [0,1]^{|V| \times K}\) as a soft cluster assignment and adopt a GNN to calculate the assignment matrix, i.e., \(C = {\rm Softmax}({\rm GNN}^{\rm C}(X^S, A^S)).\) Finally, the optimal cluster assignment \(C^*\) can be used to indicate the inferred environments \(\mathcal {E}_{infer}\) of nodes.

3.4 Invariance Regularization

After obtaining the inferred invariant ego-subgraphs \(\lbrace G_v^I | v \in V\rbrace\) and environment labels \(\mathcal {E}_{infer}\), we propose the invariance regularization module, which can make the graph model to generate node representations capable of OOD generalization under distribution shifts. Specifically, we aim to learn the optimal generator \(\Psi ^*\) in Assumption 1 by proposing and optimizing the maximal invariant ego-subgraph generator criterion. Following the invariant learning literature [11, 40, 50, 51], we give the following definition:
Definition 1
The invariant ego-subgraph generator set \(\mathcal {I}\) with respect to \(\mathcal {E}\) is defined as:
\begin{equation} \begin{aligned}\mathcal {I}_{\mathcal {E}} &= \lbrace \Psi (\cdot): P^e(\mathbf {y}|\Psi (\mathbf {G_v}))=P^{e^\prime }(\mathbf {y}|\Psi (\mathbf {G_v})),e,e^\prime \in \mathrm{supp}(\mathcal {E}) \rbrace . \end{aligned} \end{equation}
(6)
Then, we show that the optimal generator \(\Psi ^*\) satisfies the following theorem:
Theorem 1.
A generator \(\Psi (\mathbf {G_v})\) is the optimal generator that satisfies Assumption 1 if and only if it is the maximal invariant ego-subgraph generator, i.e., \(\Psi ^* = \arg \max _{\Psi \in \mathcal {I}_{\mathcal {E}}}I\left(\mathbf {y}; \Psi (\mathbf {G_v}) \right)\), where \(I(\cdot ;\cdot)\) is the mutual information between the label and the generated invariant ego-subgraph.
Proof.
Denote \(\hat{\Psi } = \arg \max _{\Psi \in \mathcal {I}_{\mathcal {E}}}I\left(\mathbf {y}; \Psi (\mathbf {G_v}) \right)\). According to the invariance property of Assumption 1, we have \(\Psi ^* \in \mathcal {I}_{\mathcal {E}}\). Therefore, we prove the theorem by showing that \(I(\mathbf {y}; \hat{\Psi } (\mathbf {G_v})) \le I(\mathbf {y}; \Psi ^* (\mathbf {G_v}))\) and consequently, \(\hat{\Psi } = \Psi ^*\). To show the inequality, we use the functional representation lemma [23], which states that for any random variables \(\mathbf {X}_1\) and \(\mathbf {X}_2\), there exists a random variable \(\mathbf {X}_3\) independent of \(\mathbf {X}_1\) such that \(\mathbf {X}_2\) can be represented as a function of \(\mathbf {X}_1\) and \(\mathbf {X}_3\). So, for \(\Psi ^*(\mathbf {G_v})\) and \(\hat{\Psi }(\mathbf {G_v})\), there exists \(\Psi ^{\prime }(\mathbf {G_v})\) satisfying that \(\Psi ^{\prime }(\mathbf {G_v}) \perp \Psi ^*(\mathbf {G_v})\) and \(\hat{\Psi }(\mathbf {G_v}) = \gamma \left(\Psi ^*(\mathbf {G_v}),\Psi ^\prime (\mathbf {G_v})\right)\), where \(\gamma (\cdot)\) is a function. Then, we can derive that:
\begin{equation} \begin{aligned}I(\mathbf {y}; \hat{\Psi } (\mathbf {G_v})) &= I\left(\mathbf {y}; \gamma \left(\Psi ^*(\mathbf {G_v}),\Psi ^\prime (\mathbf {G_v})\right) \right) \\ & \le I\left(\mathbf {y}; \Psi ^*(\mathbf {G_v}),\Psi ^\prime (\mathbf {G_v}) \right) \\ &= I\left(w^*(g^*(\Psi ^*(\mathbf {G_v}))) ; \Psi ^*(\mathbf {G_v}), \Psi ^{\prime }(\mathbf {G_v}) \right) \\ &= I\left(w^*(g^*(\Psi ^*(\mathbf {G_v}))) ; \Psi ^*(\mathbf {G_v}) \right) \\ &= I\left(\mathbf {y}; \Psi ^* (\mathbf {G_v}) \right), \end{aligned} \end{equation}
(7)
which finishes the proof.□
Theorem 1 provides us an objective function to optimize the invariant ego-subgraph generator. However, directly solving according to Theorem 1 for a non-linear \(\Psi\) is difficult [40]. Following the invariant learning literature [40], we minimize the following invariance regularizer:
\begin{equation} \mathbb {E}_{e \in \mathrm{supp}(\mathcal {E}_{infer}) } \mathcal {R}^e\left(f\left(\mathbf {G_v}\right), \mathbf {y};\theta \right) + \lambda \mathrm{trace}\left(\mathrm{Var}_{\mathcal {E}_{infer}}\left(\nabla _\theta \mathcal {R}^e\right)\right), \end{equation}
(8)
where \(f(\cdot) = w \circ g \circ \Psi\), \(\mathcal {E}_{infer}\) is the inferred environment label, and \(\theta\) denotes all the learnable parameters. Recall that \(g(\cdot)\) is the representation learning function of the invariant ego-subgraphs and \(w(\cdot)\) is the classifier. We instantiate \(g\) as another GNN as: \(\mathbf {Z}_{I} = \mathrm{GNN^{I}}(G_v^I)\), where \(\mathbf {Z}_{I}\) are the node representations capturing invariant patterns from the ego-subgraphs. \(w(\cdot)\) is instantiated as a multilayer perceptron with the ReLU [1] activation function, followed by the softmax function. By optimizing Equation (8), we can get our desired generator \(\Psi\) and the ego-subgraph representation learning function \(g(\cdot)\), which collectively serve as our representation learning method \(h(\cdot)\), i.e., \(h = g \circ \Psi\).
We further theoretically analyze our INL model by showing that the maximal invariant ego-subgraph generator can achieve OOD optimality.
Theorem 2.
Let \(\Psi ^*\) be the optimal invariant ego-subgraph generator for \(\mathbf {G_v}\) in Assumption 1 and denote the complement as \(\mathbf {G_v} \backslash \Psi ^*(\mathbf {G_v})\), i.e., the corresponding variant ego-subgraph. Then, we can obtain the optimal predictor under distribution shifts, i.e., the solution to Problem 1, as follows:
\begin{equation} \arg \min _{w,g} w \circ g \circ \Psi ^*(\mathbf {G_v}) = \arg \min _f \sup _{e \in \mathrm{supp}(\mathcal {E})} \mathcal {R}(f|e), \end{equation}
(9)
if the following conditions hold: (1) \(\Psi ^*(\mathbf {G_v}) \perp \mathbf {G_v}\backslash \Psi ^*(\mathbf {G_v})\); and (2) \(\forall \Psi \in \mathcal {I}_{\mathcal {E}}\), \(\exists \; e^{\prime } \in \mathrm{supp}(\mathcal {E})\) such that \(P^{e^{\prime }}(\mathbf {G_v},\mathbf {y}) = P^{e^\prime }(\Psi (\mathbf {G_v}),\mathbf {y})P^{e^{\prime }}(\mathbf {G_v} \backslash \Psi (\mathbf {G_v}))\) and \(P^{e^\prime }(\Psi (\mathbf {G_v})) = P^{e}(\Psi (\mathbf {G_v}))\).
Proof.
Denote the function to obtain the complement of invariant ego-subgraph as \(\Phi (\mathbf {G_v}) = \mathbf {G_v} \backslash \Psi (\mathbf {G_v})\) and \(\Phi ^*(\mathbf {G_v}) = \mathbf {G_v} \backslash \Psi ^*(\mathbf {G_v})\). By assumption, \(\Psi ^*(\mathbf {G_v}) \perp \Phi ^*(\mathbf {G_v})\). Further denote \(\hat{f} = \arg \min _{w,g} w \circ g \circ \Psi ^*(\mathbf {G_v})\). By Assumption 1, we have
\begin{equation} \hat{f}(\mathbf {G_v}) = w^* \circ g^* \circ \Psi ^*(\mathbf {G_v}). \end{equation}
(10)
To show that \(\hat{f}\) is \(f^*\), our proof strategy is to show that \(\forall e \in \mathrm{supp}(\mathcal {E})\), for any possible \(f\), \(\mathcal {R}(\hat{f}|e) \le \mathcal {R}(f|e^\prime)\) and therefore \(\sup _{e \in \mathrm{supp}(\mathcal {E})} \mathcal {R}(\hat{f}|e) \le \sup _{e \in \mathrm{supp}(\mathcal {E})} \mathcal {R}(f|e)\).
To show the inequality, we have:
\begin{align} &\mathcal {R}(\hat{f}|e) \end{align}
(11)
\begin{align} &=\mathbb {E}^e_{\mathbf {G_v}, \mathbf {y}}[\ell (\hat{f}(\mathbf {G_v}),\mathbf {y})] \end{align}
(12)
\begin{align} &= \sum _{\mathbf {G_v}, \mathbf {y}} P^{e}(\mathbf {G_v}, \mathbf {y})\ell (\hat{f}(\mathbf {G_v}),\mathbf {y}) \end{align}
(13)
\begin{align} &= \sum _{\Phi ^*(\mathbf {G_v})} P^e(\Phi ^*(\mathbf {G_v})) \left[ \sum _{\Psi ^*(\mathbf {G_v}), \mathbf {y}} P^e(\Psi ^*(\mathbf {G_v}), \mathbf {y}) \cdot \ell \left(w^*(g^*(\Psi ^*(\mathbf {G_v}))),\mathbf {y}\right) \right] \end{align}
(14)
\begin{align} &= \sum _{\Psi ^*(\mathbf {G_v}), \mathbf {y}} P^e(\Psi ^*(\mathbf {G_v}), \mathbf {y}) \ell (w^*(g^*(\Psi ^*(\mathbf {G_v}))),\mathbf {y}) \end{align}
(15)
\begin{align} &\le \sum _{\Psi (\mathbf {G_v}), \mathbf {y}} P^e(\Psi (\mathbf {G_v}), \mathbf {y}) \ell (w(g(\Psi (\mathbf {G_v}))),\mathbf {y}) \end{align}
(16)
\begin{align} &= \sum _{\Phi (\mathbf {G_v})} P^{e^\prime }(\Phi (\mathbf {G_v})) \sum _{\Psi (\mathbf {G_v}), \mathbf {y}} P^e(\Psi (\mathbf {G_v}), \mathbf {y}) \ell (w(g(\Psi (\mathbf {G_v}))),\mathbf {y}) \end{align}
(17)
\begin{align} &= \sum _{\Phi (\mathbf {G_v})} \sum _{\Psi (\mathbf {G_v}), \mathbf {y}} P^{e^\prime }(\Psi (\mathbf {G_v}), \mathbf {y}) P^{e^\prime }(\Phi (\mathbf {G_v})) \ell (w(g(\Psi (\mathbf {G_v}))),\mathbf {y}) \end{align}
(18)
\begin{align} &= \sum _{\mathbf {G_v}, \mathbf {y}} P^{e^\prime }(\mathbf {G_v}, \mathbf {y})\ell (f(\mathbf {G_v}),\mathbf {y}) \end{align}
(19)
\begin{align} &= \mathbb {E}^{e^\prime }_{\mathbf {G_v}, \mathbf {y}}[\ell (f(\mathbf {G_v}),\mathbf {y})] \end{align}
(20)
\begin{align} &=\mathcal {R}(f|e^\prime). \end{align}
(21)
Intuitively, Theorem 2 shows that we can transform the OOD generalization problem into finding the optimal invariant ego-subgraphs while maintaining the optimality. The proofs of the above theorems are inspired by the invariant learning literature [45, 50, 51, 78], and a motivating example for better understanding is provided in Section 3.6. It indicates that our method can get rid of spurious correlations and learn OOD generalized node representations based on the identified invariant ego-subgraphs.

3.5 Training Procedure

We present the pseudocode of INL in Algorithm 1 to show the training procedure. Specifically, we first obtain the invariant and variant ego-subgraphs for all nodes with the learnable masks on node features and edges. Then, we infer the environments for all nodes with the variant node features and edges from variant ego-subgraphs. And we learn the invariant node representations with invariance regularization based on the inferred invariant ego-subgraphs and environment labels. Note that the adopted GNNs including \(\mathrm{GNN^{M}}\), \(\mathrm{GNN^{C}}\), and \(\mathrm{GNN^{I}}\) for all ego-graphs are shared, following References [34, 78]. At the testing stage, we directly adopt the optimized \(f\) to conduct predictions. In Algorithm 1, “Epoch” means the overall number of epochs for optimizing the proposed method, and “Epoch_Cluster” denotes the number of epochs for clustering to infer environments in each training epoch. The setting details of the hyper-parameters can be found in Section 4.1.3.

3.6 A Motivating Example

For better understanding our proposed method intuitively, we present a linear toy example and the corresponding theoretical analysis inspired by Reference [78] to show why our method can achieve out-of-distribution generalization by learning node representations based on invariant ego-subgraph \(G_v^I\) (i.e., invariant node features \(X_v^I\) and structures \(A_v^I\)).
For simplification, in this toy example, we consider the ego-graph \(G_v\) (and \(\mathcal {N}_v\)) only contains the centered node \(v\) and its 1-hop neighbors (i.e., \(L=1\)), which can be split into invariant ego-subgraph \(G_v^I\) (and \(\mathcal {N}_v^I\)) and variant ego-subgraph \(G_v^S\) (and \(\mathcal {N}_v^S\)). And we consider the dimensionality of node features \(F=2\), including one-dimensional invariant node feature \(x_v^I\) and variant node feature \(x_v^S\), i.e., \(x_v = [x_v^I, x_v^S]\) for each node \(v\). The illustration of ego-graph \(G_v\) is shown in Figure 2. The dependence among variables in the toy example is shown in Figure 3. We do not distinguish the notation of random variables and of their particular instances when there is no risk of confusion in this toy example.
Fig. 2.
Fig. 2. The ego-graph \(G_v\) in the toy example.
Fig. 3.
Fig. 3. The dependence among variables in our synthetic datasets.
Considering the representation learning function \(g^*\) that averages the node representations in invariant ego-subgraph \(G_v^I\) to produce the centered node representations and classifier \(w^*\) is identity mappings in Assumption 1, the node label can be determined by the invariant node features and structures:
\begin{equation} y_v = \frac{1}{|\mathcal {N}_v^I|} \sum _{u \in \mathcal {N}_v^I} x_u^I + \epsilon _1, \end{equation}
(22)
where \(\epsilon _1\) is standard normal noise. And we assume that the variant node feature \(x_v^S\) is generated by identity mapping given the input of the node’s label \(y_v\) and environment \(e_v\), which can be denoted as:
\begin{equation} x_v^S = y_v + e_v + \epsilon _2, \end{equation}
(23)
where \(\epsilon _2\) is standard normal noise. \(e_v\) denotes the node \(v\)’s environment, following normal distribution whose mean and variance are dependent on node environment. Besides, we assume the variant structures are also dependent on the node environment and the environments of nodes in \(\mathcal {N}_v^S\) is \(e_v\). For example, in citation networks, the invariant node features and structures can be the paper published avenues and citations among them that determine the subject topics (i.e., labels), while the variant node features and structures can be the citation indexes and edges between papers with high citations in some publication periods (i.e., environments).
Therefore, given the invariant and variant ego-subgraph, we consider the following predictor model:
\begin{equation} \hat{y}_v = \frac{1}{|\mathcal {N}_v^I|} \sum _{u \in \mathcal {N}_v^I} \left(\theta _1 x_u^I + \theta _2 x_u^S\right) + \frac{1}{|\mathcal {N}_v^S|} \sum _{u \in \mathcal {N}_v^S} \left(\theta _3 x_u^I + \theta _4 x_u^S\right)\!. \end{equation}
(24)
Note that the ideal solution for the predictor model is \(\theta = [\theta _1, \theta _2, \theta _3, \theta _4] = [1, 0, 0, 0]\), indicating that the predictor accurately identifies the sufficiently predictive and invariant node features and structures for making OOD generalized predictions. However, the following proposition shows that we cannot obtain this ideal solution if only using standard empirical risk minimization (ERM):
Proposition 3.
Denoting the risk (i.e., loss) of the predictor model \(f\) as \(\mathcal {R} = \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {y_v}|\mathbf {G_v}=G_v} || \hat{y}_v - y_v ||_2^2\), the optimal solution for objective \(min_\theta \mathcal {R}\) is \(\theta = [\theta _1, \theta _2, \theta _3, \theta _4] = [\scriptstyle {1 - \frac{\mu ^S}{2(\mu ^S-\mu ^I)}, \frac{\mu ^S}{2(\mu ^S-\mu ^I)}, \frac{\mu ^I}{2(\mu ^S-\mu ^I)},} \scriptstyle {\frac{-\mu ^I}{2(\mu ^S-\mu ^I)}}]\), assuming \(\mu ^I \ne \mu ^S\), where \(\mu ^I = \frac{1}{|V|} \sum _{v \in V} \frac{1}{|\mathcal {N}_v^I|} \sum _{u \in \mathcal {N}_v^I} e_u\) and \(\mu ^S = \frac{1}{|V|} \sum _{v \in V} \frac{1}{|\mathcal {N}_v^S|} \sum _{u \in \mathcal {N}_v^S} e_u\) are dependent on the node environments.
The proof is in Appendix A.1. Proposition 3 indicates directly optimizing with ERM will inevitably make the predictor model heavily rely on spurious correlations, since \(\theta _2, \theta _3, \theta _4\) is not constant zero, so that the model performs poorly under distribution shifts with multiple latent environments. Next, we show that our objective in Equation (8) can mitigate this issue.
Proposition 4.
The solution of optimizing the invariance regularizer in Equation (8) to the minimum satisfies \([\theta _2, \theta _3, \theta _4] = [0, 0, 0]\).
The proof is in Appendix A.2. Proposition 4 indicates our method can get rid of spurious correlations and learn OOD generalized node representations under distribution shifts with multiple latent environments by generating node representations based on the identified invariant ego-subgraph \(G_v^I\).
Intuitively, Proposition 3 shows that the optimal solution under standard empirical risk minimization (ERM) in this toy example (as shown in Figure 2) consists of non-zero coefficients of the predictor model for variant ego-subgraph, which means that the predictions rely on variant environment information, e.g., different species that the proteins come from in protein-protein interaction graphs and the publication time of papers in citation networks. Therefore, the OOD generalization performance is poor. However, Proposition 4 shows that the optimal solution using the proposed method in this toy example only includes non-zero coefficients of the predictor model for invariant ego-subgraph, demonstrating that our method can make predictions only based on the invariant information and is not affected by variant spurious correlations, leading to strong OOD generalization ability.

4 Experiments

In this section, we empirically evaluate our proposed method through the experiments on both synthetic and real-world datasets, including the experimental setup, quantitative comparisons, complexity analysis, ablation studies, the impact of the hyper-parameters, and so on.

4.1 Experimental Setup

4.1.1 Datasets.

We adopt two synthetic datasets with artificial distribution shifts based on two representative node classification benchmarks, Citeseer [86] and Amazon-Photo [69], in which ground-truth generation processes are controllable. And we also consider another two real-world datasets OGB-Arxiv and OGB-Proteins from Open Graph Benchmark [33]. The statistics of these datasets are provided in Table 2.
Table 2.
 CiteseerAmazon-PhotoOGB-ArxivOGB-Proteins
#Nodes3,3277,650169,343132,534
#Edges9,104238,1621,166,24339,561,252
#Classes68402
MetricAccuracyAccuracyAccuracyROC-AUC
Table 2. Statistics of the Datasets
#Nodes/#Edges are the number of nodes and edges in the graph of the dataset, respectively. #Classes denotes the number of Classes. Metric is the evaluation metric of the dataset.
Synthetic datasets. Citeseer and Amazon-Photo are two commonly used node classification benchmarks. Citeseer is a citation network where nodes represent papers and edges indicate their citations. Amazon-Photo is a co-purchasing network where nodes represent items and edges represent two items purchased together. For evaluating the model’s out-of-distribution generalization ability, we introduce distribution shifts between the training and testing data.
Following Reference [78], we first use a randomly initialized 2-layer GCN to generate node labels \(Y\) based on the original node features and edges, which can be regarded as invariant and sufficiently predictive information to the labels and denoted by \((X^I, A^I)\). Then, we assign nodes into different environments and create spurious correlations between the label and environment. Based on the label and environment of each node, we generate an additional feature matrix and additional edges as the variant patterns, which are denoted by \((X^S, A^S)\). The generated feature (i.e., \(X^S\)) has the same dimensionality as the original feature (i.e., \(X^I\)) and the number of generated edges (i.e., \(A^S\)) equals the original number of edges (i.e., \(A^I\)). We then concatenate the two feature matrices and add the generated edges into the original graph as the input data, i.e., \((X=[X^I,X^S], A=A^I + A^S)\). The dependence among these variables is illustrated in Figure 3.
More specifically, we set the ground truth number of environments as \(K=3\) and adopt a hyper-parameter \(r \in [0, 1]\) to control the strength of spurious correlations by setting the probability of node \(v\) belonging to the \(k\)th environment as \(P(v \in V^{e_k}) = r\) if \(k \equiv y_v (\mathrm{mod}\ K)\) and \(P(v \in V^{e_k}) = (1-r) / 2\) otherwise. Intuitively, nodes with the same labels more likely belong to the same environment. For example, for the nodes whose labels are 1 or 4, the probability of these nodes belonging to the 1st environment is \(r\) and the probability belonging to the 2nd or 3rd environment is \((1-r)/2\). In the \(K=3\) case, \(r=1/3\) means there is no spurious correlation and a larger \(r\) indicates a higher spurious correlation between the label and environment. We set \(r_{test} = 1/3\) and vary \(r_{train}\) in \(\lbrace 1/3, 0.5, 0.7\rbrace\) to generate testing and training graphs, respectively, which simulates different strengths of distribution shifts. We hold out 10% nodes from the training graph for validation.
After obtaining the environment of each node, we generate variant node features \(X^S\) by a two-layer MLP given the label and environment ID as the input. Then, we generate variant edges \(A^S\) by connecting nodes with similar variant node features. In particular, we first calculate the scores of any potential edges (i.e., edges not in \(A^I\)) by cosine similarity of variant node features of the two nodes. According to the scores, we select Top-\(t\) edges in all potential edges to form the variant edges \(A^S\), where the number of invariant and variant edges is equal, i.e., \(t\) is the number of edges in \(A^I\).
OGB-Arxiv. This dataset consists of Arxiv CS papers from 40 subject areas and their citations. The task is to predict the 40 subject areas of the papers,3 e.g., cs.AI, cs.LG, cs.OS. Instead of the semi-supervised/adaptation setting where unlabeled testing data is available during training [33], we follow the more common and challenging out-of-distribution generalization [2, 4, 11, 40, 42, 64] setting, i.e., the testing nodes are not available in the training stage. Since several latent influential environment factors (e.g., the popularity of research topics) can change significantly over time, the properties of citation networks will be varying in different time ranges. Therefore, the node distribution shifts on OGB-Arxiv are introduced by selecting papers published before 2011 as training set, within 2011–2014 as validation set, and within 2014–2016/2016–2018/2018–2020 as three testing sets.
OGB-Proteins. In this dataset, nodes represent proteins and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression, or homology [71]. The task is to predict the presence of protein functions in a binary classification setup. We also follow the out-of-distribution generalization [2, 4, 11, 40, 42, 64] setting, i.e., the testing nodes are not available in the training stage, instead of the semi-supervised setting. Since the latent influential environment factors can vary from different species that the proteins come from, the properties and associations of proteins will also be different in different species. Therefore, the node distribution shifts on OGB-Proteins are introduced by selecting nodes into training/validation/testing sets according to their species. Specifically, the training set and validation set include proteins and their associations from four and one species, respectively. And each of the three testing sets consists of proteins and their associations from one of the left three species.
The datasets are publicly available as follows:

4.1.2 Baselines.

We compare our INL with the following representative state-of-the-art methods:
ERM [74]: We use ERM to denote the backbone GNN models, which are trained with the standard empirical risk minimizing, namely, minimizing the sum of risks across environments and training samples.
GroupDRO 4 [65]: It handles the problem that the distribution minority lacks sufficient training and seeks to explicitly optimize the worst performance over a distribution set to achieve OOD generalization performance.
IRM 5 [4]: It is a representative invariant learning method. To learn invariances across environments for enabling OOD generalization, it seeks to find data representations or features so the optimal classifier on top of that representation matches for all environments. We conduct random environment partitions on the nodes of input graph for training, because this method needs the explicit environment labels in advance.
V-REx 6 [42]: This method is proven to be able to recover the causal mechanisms of the targets and is robust to distribution shifts. Specifically, it minimizes the risk variances of the training environments for reducing the risk variances of the test environments, leading to good OOD generalization. Since this method relies on the explicit environment labels that are unavailable for the nodes in multiple latent environments, we conduct random environment partitions on the nodes of input graph during training stage.
EERM 7 [78]: It is a recent pioneering work that can tackle node-level prediction tasks under distribution shifts and achieves a valid solution for the node-level OOD problem under mild conditions. It studies invariant predictions on graph by assuming all nodes share a single environment. However, it ignores the more common and challenging situation that nodes are from multiple latent environments.
GIL [45]: It learns invariant graph-level representations under distribution shifts. However, it only focuses on the graph-level generalization on graph classification tasks but cannot tackle the key problem studied in this article where distribution shifts exist on nodes. In the experiments, we modify its every module from graph-level to node-level for comparisons.
Since all the methods are model-agnostic, we use GCN [38] as the GNN backbone on the synthetic datasets and adopt GraphSAGE [30] and GAT [75] on the real-world datasets for a comprehensive comparison. Intuitively, the node classification on the synthetic datasets is simpler than that on the real-world datasets. Therefore, the classical GNN model, GCN, is used on the synthetic datasets, while relatively advanced models, GraphSAGE and GAT, are considered on the real-world datasets.

4.1.3 Implementation Details.

The number of epochs for optimizing our proposed method (i.e., Epoch in Algorithm 1) and baselines is set to 200 for the synthetic datasets (i.e., Citeseer and Amazon-Photo) and 500 for the real-world datasets (i.e., OGB-Arxiv and OGB-Proteins). The number of epochs for clustering to infer environments in each training epoch (i.e., Epoch_Cluster in Algorithm 1) is 20. The Adam optimizer is adopted for gradient descent. Since we focus on node classification tasks, we use the cross-entropy loss as the loss function \(\ell\). The classifier \(w\) is instantiated as a two-layer MLP. The activation function is ReLU [1]. The evaluation metric is ROC-AUC for OGB-Proteins datasets and accuracy for the others. For \(\mathrm{GNN^{M}}\), \(\mathrm{GNN^{C}}\), and \(\mathrm{GNN^{I}}\), the number of layers is set to 2 on all the datasets. The dimensionality of the node representations \(d\) is 32 on the synthetic datasets, 128 on OGB-Arxiv, and 256 on OGB-Proteins. Note that these GNNs including \(\mathrm{GNN^{M}}\), \(\mathrm{GNN^{C}}\), \(\mathrm{GNN^{I}}\) are shared for all ego-subgraphs following References [34, 78]. The invariance regularizer coefficient \(\lambda\) in Equation (8) is chosen from \(\lbrace 10^{-4}, 10^{-2}, 10^{0}\rbrace\). The number of the inferred environments \(|\mathcal {E}_{infer}|\) is chosen from \(\lbrace 2, 3, 4\rbrace\), which is the dimensionality of the vector \(C_v\) indicating the node \(v\)’s environment in the cluster assignment matrix \(C\). We report mean results and standard deviations of 10 runs. The selected \(\lambda\) and \(|\mathcal {E}_{infer}|\) are reported in Table 3.
Table 3.
 CiteseerAmazon-PhotoOGB-ArxivOGB-Proteins
\(\lambda\) \(10^{-4}\) \(10^{-4}\) \(10^{-2}\) \(10^{0}\)
\(|\mathcal {E}_{infer}|\)3334
Table 3. Selected Hyper-parameters of \(\lambda\) and \(|\mathcal {E}_{infer}|\) of Our Method on Each Dataset
As for the baselines, we implement them using the official source codes. We conduct the hyper-parameter search for each baseline covering the search range of both our method and the original paper (if the search range is reported). The search range and the selected hyper-parameters of the baselines are reported in Table 4. The other hyper-parameters of the baselines are kept consistent with our method as described above.
Table 4.
  RangeCiteseerAmazon-PhotoOGB-ArxivOGB-Proteins
Number of Training EnvironmentsIRM{2, 3, 4}3232
GroupDRO{2, 3, 4, 5}2244
V-REx{2, 3, 4}3422
EERM{2, 3, 4, 5, 10}3543
GIL{2, 3, 4}2233
Regularizer CoefficientIRM \(\lbrace 10^{-4}, 10^{-2}, 10^{0}\rbrace\) \(10^{-2}\) \(10^{-4}\) \(10^{-2}\) \(10^{-2}\)
V-REx \(\lbrace 10^{-4}, 10^{-2}, 10^{0}, 10^{2}, 10^{4}\rbrace\) \(10^{-4}\) \(10^{-4}\) \(10^{0}\) \(10^{-2}\)
EERM \(\lbrace 10^{-4}, 10^{-2}, \frac{1}{3}, 0.5, 1.0, 2.0, 5.0\rbrace\) \(10^{-2}\)2.01.01.0
GIL \(\lbrace 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2} ,10^{-1}, 10^{0}\rbrace\) \(10^{-4}\) \(10^{-3}\) \(10^{-2}\) \(10^{-2}\)
Table 4. Selected Hyper-parameters of the Baselines on Each Dataset
We conduct the experiments with the following hardware and software configurations:
Operating System: Ubuntu 18.04.1 LTS
CPU: Intel(R) Xeon(R) CPU E5-2699 [email protected] GHz
GPU: NVIDIA GeForce RTX 3090 with 24 GB of Memory
Software: Python 3.6.5; NumPy 1.19.2; PyTorch 1.10.1; PyTorch Geometric 2.0.3 [25].

4.2 Experiments on Synthetic Datasets

The experimental results are shown in Table 5, from which we have the following observations: Our proposed INL consistently and significantly outperforms the baselines and achieves the best performance in all settings. The results demonstrate the effectiveness of our proposed method in handling distribution shifts, which has a remarkable out-of-distribution generalization ability. The general invariant learning methods, e.g., IRM, GroupDRO, V-REx, only have slight improvements to ERM. EERM is a recently proposed invariant method specifically designed for learning node representations but assumes a single environment is shared for all the nodes. EERM outputs competitive results in some settings but fails to obtain consistent improvements, indicating modeling multiple latent environments is crucial for handling distribution shifts in graph. GIL achieves promising gains over the other baselines, but the proposed method still performs better than it.
Table 5.
 CiteseerAmazon-Photo
\(r_{train}\) \(r=1/3\) \(r=0.5\) \(r=0.7\) \(r=1/3\) \(r=0.5\) \(r=0.7\)
GCN(ERM)47.09\(\pm\)3.4445.36\(\pm\)5.5440.09\(\pm\)2.1248.26\(\pm\)2.2647.91\(\pm\)3.2439.23\(\pm\)5.27
IRM48.84\(\pm\)2.7545.39\(\pm\)2.0742.89\(\pm\)2.3853.75\(\pm\)1.3150.98\(\pm\)3.0942.23\(\pm\)2.75
GroupDRO49.32\(\pm\)6.4746.30\(\pm\)5.4440.68\(\pm\)2.8349.62\(\pm\)6.4547.65\(\pm\) 8.3441.15\(\pm\)5.50
V-REx47.53\(\pm\)3.6543.11\(\pm\)4.0641.03\(\pm\)4.2947.13\(\pm\)8.0148.53\(\pm\) 8.3737.49\(\pm\)5.39
EERM53.07\(\pm\)4.3945.50\(\pm\)3.6841.53\(\pm\)1.9652.25\(\pm\)5.9051.03\(\pm\)2.9341.69\(\pm\)4.63
GIL55.71\(\pm\)1.2447.42\(\pm\)2.1044.87\(\pm\)3.2653.19\(\pm\)2.7450.01\(\pm\)2.0641.79\(\pm\)3.98
INL60.48\(\pm\)0.77*\(^*\)56.74\(\pm\)0.75*\(^*\)54.78\(\pm\)2.50*\(^*\)55.86\(\pm\)1.63*\(^*\)55.07\(\pm\)2.27*\(^*\)46.90\(\pm\)2.06*\(^*\)
Improvement4.77\(\uparrow\)9.32\(\uparrow\)9.91\(\uparrow\)2.11\(\uparrow\)4.04\(\uparrow\)4.67\(\uparrow\)
Table 5. Node Classification Accuracy (%) on Testing Sets of the Synthetic Datasets
In each column, the boldfaced and the underlined score denotes the best and the second-best result, respectively. Numbers in the lower right corner denote standard deviations. “\(^*\)”indicates the statistically significant improvements (one-tailed t-test with \(p \lt 0.05\)) upon the best baseline.
In addition, when \(r_{train} = 1/3\), i.e., no distribution shifts between training and testing data, our proposed method also achieves the best results, meaning that learning invariant ego-subgraphs for nodes is also beneficial. As \(r_{train}\) grows larger, the performance of all the methods tends to decrease, since there exists a larger degree of distribution shift. Nevertheless, our proposed method is able to maintain the most relatively stable performance. In fact, the performance gap between INL and the best results of baselines becomes more significant as the degree of distribution shift increases. For example, the accuracy improvements against the strongest baselines increases from 4.77% to 9.91% when \(r_{train}\) changes from \(1/3\) to 0.7 on Citeseer, indicating the powerful OOD generalization ability of our method under various complex distribution shifts.
To further analyze whether our method can accurately capture the invariant ego-subgraphs under distribution shifts, we compare the output invariant node features and structures with the ground-truth on the synthetic dataset Citeseer. The evaluation metric is ROC-AUC. The results in Figure 4 show that the ROC-AUC for discovering invariant node features and structures is around 70% and 80%, respectively, which is significantly higher than random selection (50%). It demonstrates our INL can discover the truly predictive invariant ego-subgraphs and further make OOD generalized predictions.
Fig. 4.
Fig. 4. Results of discovering the ground-truth invariant node features and edges on Citeseer.

4.3 Experiments on Real-world Graphs

We further evaluate the effectiveness of our method on two real-world graph datasets, i.e., OGB-Arxiv and OGB-Proteins from OGB [33]. The properties of citation networks can change significantly in different time ranges. So, the node distribution shifts on OGB-Arxiv are introduced by selecting papers published before 2011 as training set, within 2011–2014 as validation set, and within 2014–2016/2016–2018/2018–2020 as testing sets. For OGB-Proteins dataset, since the interactions between proteins can vary from different species that the proteins come from, we split the protein nodes into training/validation/test sets according to their species. We assume the test nodes are strictly unseen during training stage, which is more common in practice and more challenging than the default setting of OGB [33].
The experimental results are presented in Table 6. Our proposed method consistently achieves the best performance, indicating that INL can well handle distribution shifts existing in real-world scenarios. For example, INL increases the classification accuracy by 3.41% on OGB-Arxiv (tested on 2016–2018 with GraphSAGE backbone) and ROC-AUC by 7.54% on OGB-Proteins (tested on species-1 with GAT backbone) against the strongest baselines, respectively. Besides, different datasets have different distribution shifts, and none of the baselines can consistently achieve promising OOD generalized performance as our method. Therefore, the results show that our proposed method can well handle diverse types of distribution shifts in real graph datasets.
Table 6.
DatasetOGB-ArxivOGB-Proteins
BackboneMethod2014-20162016-20182018-2020Species-1Species-2Species-3
GraphSAGEERM45.24\(\pm\)0.6042.25\(\pm\)1.0238.75\(\pm\)0.9766.44\(\pm\)0.4864.18\(\pm\)0.5957.61\(\pm\)1.72
IRM45.31\(\pm\)0.5642.48\(\pm\)1.9840.23\(\pm\)1.0767.03\(\pm\)0.4164.38\(\pm\)0.8757.54\(\pm\)1.13
GroupDRO45.35\(\pm\)0.6842.56\(\pm\)0.8839.26\(\pm\)0.8166.28\(\pm\)0.2764.51\(\pm\)0.3557.87\(\pm\)0.89
V-REx45.27\(\pm\)0.7142.51\(\pm\)1.1339.31\(\pm\)0.9667.43\(\pm\)0.1864.38\(\pm\)0.5157.71\(\pm\)1.42
EERM46.15\(\pm\)0.9843.27\(\pm\)1.0141.61\(\pm\)0.9666.40\(\pm\)0.5964.39\(\pm\)0.1257.12\(\pm\)1.21
GIL47.92\(\pm\)0.4545.78\(\pm\)0.6241.27\(\pm\)0.9167.39\(\pm\)0.8666.54\(\pm\)1.3855.81\(\pm\)1.76
INL49.43\(\pm\)0.53*49.19\(\pm\)0.98*46.34\(\pm\)0.87*72.20\(\pm\)0.41*69.47\(\pm\)0.72*61.07\(\pm\)1.45*
GATERM45.94\(\pm\)1.0343.52\(\pm\)0.9540.42\(\pm\)0.9866.34\(\pm\)0.4564.35\(\pm\)0.6057.83\(\pm\)1.75
IRM46.73\(\pm\)0.9144.32\(\pm\)0.9142.04\(\pm\)0.9966.33\(\pm\)0.3064.61\(\pm\)0.4356.91\(\pm\)0.93
GroupDRO45.95\(\pm\)0.8943.52\(\pm\)1.2540.43\(\pm\)1.3266.30\(\pm\)0.2764.52\(\pm\)0.3157.95\(\pm\)0.79
V-REx45.93\(\pm\)0.8745.69\(\pm\)0.8141.01\(\pm\)1.0366.14\(\pm\)0.5864.31\(\pm\)0.6057.73\(\pm\)1.32
EERM45.99\(\pm\)1.2245.32\(\pm\)0.8442.01\(\pm\)1.3666.35\(\pm\)0.4864.32\(\pm\)0.2156.13\(\pm\)0.98
GIL47.70\(\pm\)0.9345.65\(\pm\)1.4141.87\(\pm\)1.8966.31\(\pm\)0.6967.12\(\pm\)0.8955.98\(\pm\)0.83
INL50.37\(\pm\)1.01*49.12\(\pm\)1.23*45.35\(\pm\)1.32*73.89\(\pm\)0.39*71.42\(\pm\)0.28*60.36\(\pm\)1.12*
Table 6. Node Classification Results (Accuracy for OGB-Arxiv, ROC-AUC for OGB-Proteins, %) on Testing Sets of the Real-world Datasets
The boldfaced and the underlined score denotes the best and the second-best result, respectively. Numbers in the lower right corner denote standard deviations. “\(^*\)”indicates the statistically significant improvements (one-tailed t-test with \(p \lt 0.05\!\)) upon the best baseline.
Besides the quantitative evaluation, we plot a showcase from the OGB-Arxiv to intuitively validate the effectiveness of our method. Figure 5 shows that the learned invariant ego-subgraph \(G_v^I\) (denoted by solid lines) and variant ego-subgraph \(G_v^S\) (denoted by dashed lines) of one node \(v\) (ID: 139,332). We plot the top-five selected edges by the masks for simplicity. It can be observed that the invariant ego-subgraph \(G_v^I\) learned by our method accurately corresponds to the neighbors in the ego-graph from the same subject area (i.e., artificial intelligence), which have truly predictive and invariant relations with the centered node. However, the variant ego-subgraph \(G_v^S\) highlights the neighbors that are from different subject areas that are published in the same year with the centered node and have a high citation index (spurious feature). Besides, there is another paper \(u\) whose subject area is information retrieval (IR) that also cites those papers with high citation indexes, meaning that the node \(u\) has similar variant patterns with node \(v\) so they are in the same environment. We can observe that these nodes form clear cluster structures based on the variant ego-subgraphs, demonstrating the effectiveness of the proposed graph clustering algorithm in inferring latent environments.
Fig. 5.
Fig. 5. The learned invariant and variant ego-subgraphs of the papers \(v\) and \(u\) from OGB-Arxiv.

4.4 Analysis of Node Environment Inference

In our proposed model, all components are jointly optimized. To show that the node environment inference module and invariance regularization module can mutually promote each other, we record the test accuracy, the modularity, which is a measurement for the quality of graph cluster, and the normalized mutual information (NMI) [41], which is another metric (falling within the range \([0, 1]\)) for evaluating the clustering accuracy, as the model is trained. The results on Citeseer (\(r_{train} = 0.7\)) are shown in Figure 6. We can observe that the test accuracy and the modularity (clustering properties) improve synchronously over training. The results show that, as the training stage progresses, the invariant ego-subgraph generator is optimized so it can generate more informative invariant ego-subgraphs and therefore improve the performance on the testing set. However, accurately discovering invariant ego-subgraphs can also promote identifying variant ego-subgraphs, which capture the environment-discriminate features and better infer the latent environments. In addition, we observe that the test accuracy and the NMI (clustering accuracy) also improve collectively over training. Notice that INL achieves such results without needing any ground-truth environment label.
Fig. 6.
Fig. 6. The test accuracy and the performance of environment inference w.r.t. training epochs.
These empirical results well support the following points: (1) The invariant and variant patterns widely exist in real-world graphs, and our proposed INL can well identify invariant/ variant ego-subgraphs under distribution shifts with multiple latent environments. (2) The variant ego-subgraphs form clear clustering structures, and our INL can capture such patterns to accurately infer the environment labels of nodes. (3) Based on the inferred environments, our INL learns node representations by the invariant ego-subgraph for each node so it can achieve better OOD generalization performance. The environment inference and invariance regularization module can mutually enhance each other.

4.5 Ablation Studies

We perform ablation studies over the key components of the invariant ego-subgraph generator \(\Psi\), i.e., masks on node features and edges, to understand their functionalities more deeply. We compare INL with the following two ablated versions: (1) w/o node feature mask: It removes the node feature mask by setting both invariant and variant node features in the ego-graph \(G_v\) to \(X_v\), i.e., \(X_v^I=X_v^S=X_v\). (2) w/o edge mask: It removes the edge mask by setting both invariant and variant edges in the ego-graph \(G_v\) to \(A_v\), i.e., \(A_v^I=A_v^S=A_v\). The results of the two ablated versions drop compared with INL, as shown in Figure 7. The performance gaps between INL and the two ablated versions become more significant as the degree of distribution shift increases (i.e., \(r_{train}\) from 1/3 to 0.7), which demonstrates the significance of accurately identifying the invariant node features and edges by the learnable masks.
Fig. 7.
Fig. 7. Ablation studies of our method. We plot the accuracy (%) on the Citeseer datasets with different strengths of spurious correlations.

4.6 Training Dynamics

We can observe the convergence of our proposed method empirically, although the clustering objective in environment inference (i.e., Equation (4)) and invariance objective in invariance regularization (i.e., Equation (8)) are iteratively optimized. In Figures 8(a) and (b), we show the two objectives in the training process on Citeseer (\(r_{train}=0.7\)) and OGB-Arxiv, respectively. The loss converges before reaching the maximal training epoch, while the results on the other datasets show similar patterns.
Fig. 8.
Fig. 8. The invariance objective and clustering objective in the training process on two datasets.
In Figure 9, we also show the objective of the inner iteration in Algorithm 1, i.e., the training dynamics of the clustering objective in one epoch of the outer iteration. The epoch of the outer iteration is specified as 100 and 250 for Citeseer (\(r_{train}=0.7\)) and OGB-Arxiv, respectively, which is the middle of the whole training process, while the results in other epochs of the outer iteration show similar patterns.
Fig. 9.
Fig. 9. The clustering objective in one epoch of the training process on two datasets.

4.7 Time Complexity Analysis

The time complexity of the proposed INL is \(O(\left|E\right|d+\left|V\right|d^2)\), where \(\left|V\right|\) and \(\left|E\right|\) denote the number of nodes and edges, respectively, and \(d\) is the dimensionality of the node representations. Specifically, we adopt the message-passing GNN, which has a complexity of \(O(\left|E\right|d+\left|V\right|d^2),\) to instantiate the GNN components in INL, and the GNNs are shared for all ego-graphs. Since we only need to generate mask for the existing edges in graphs, the time complexity of generating invariant and variant ego-subgraphs and further obtaining their representations is \(O(\left|E\right|d+\left|V\right|d^2)\). The time complexity of calculating the modularity matrix \(B\) in environment inference is \(O(\left|E\right|(d+|\mathcal {E}_{infer}|)+\left|V\right|(d+|\mathcal {E}_{infer}|)^2)\), where \(|\mathcal {E}_{infer}|\) denotes the number of inferred environments. The time complexity of the invariance regularizer is \(O(|\mathcal {E}_{infer}|d^2)\), as the number of parameters for most GNNs is \(O(d^2)\). Since \(|\mathcal {E}_{infer}|\) are small constants, the overall time complexity of INL is \(O(\left|E\right|d+\left|V\right|d^2)\). In comparison, the time complexity of other GNN-based node representation methods is also \(O(\left|E\right|d+\left|V\right|d^2)\). Therefore, the time complexity of our proposed INL is on par with the existing methods.
In addition to the analysis of the time complexity, the empirical time cost of the proposed method and baselines are also tested. We show the results on Citeseer (\(r_{train}=0.7\)) in Figure 10 while the results on other datasets show similar patterns. The results indicate that INL does not introduce infeasible time cost for achieving the best performances in practice. Its time cost for each training epoch is comparable with the baselines and more efficient than some competitive methods, demonstrating the efficiency and effectiveness of our method.
Fig. 10.
Fig. 10. The comparisons of empirical time cost per epoch during training our method and baselines on Citeseer (\(r_{train}=0.7\)).

4.8 Comparisons with GNNExplainer

We compare the output invariant node features and structures generated by the proposed INL and GNNExplainer [88] with the ground-truth on the synthetic dataset Citeseer. Specifically, we generate post hoc explanations from GNNExplainer as the identified invariant ego-subgraphs, where we use the models trained under ERM as the models to explain. The evaluation metric is ROC-AUC. The results in Table 7 show that the masks on invariant node features and edges generated by GNNExplainer can be easily affected by the spurious correlations. Moreover, even when spurious correlations do not exist, the ROC-AUC of masks on invariant node features and edges generated by our INL still outperforms the result of the explainability method GNNExplainer, showing the effectiveness of INL when identifying invariant patterns.
Table 7.
 Node Feature MaskEdge Mask
\(r_{train}\)\(r=1/3\)\(r=0.5\)\(r=0.7\)\(r=1/3\)\(r=0.5\)\(r=0.7\)
GNNExplainer61.75\(\pm\)2.3850.18\(\pm\)3.0940.87\(\pm\)4.1977.30\(\pm\)3.9167.09\(\pm\)4.1551.94\(\pm\)7.10
INL68.04\(\pm\)2.1969.18\(\pm\)2.0670.16\(\pm\)2.5478.68\(\pm\)3.1079.09\(\pm\)3.2180.51\(\pm\)3.13
Table 7. Results (ROC-AUC, %) of Discovering the Ground-truth Invariant Node Features and Edges on Citeseer

4.9 Hyper-parameter Sensitivity

We investigate the sensitivity of hyper-parameters of our method, including the number of inferred environments \(|\mathcal {E}_{infer}|\), the invariance regularizer coefficient \(\lambda\), and the number of epochs for clustering to infer environments in each training epoch (i.e., Epoch_Cluster in Algorithm 1). For simplicity, we only report the results on Citeseer (\(r_{train}=0.7\)) and OGB-Arxiv (2016–2018 with GraphSage backbone) in Figures 1113, while the results on other datasets show similar patterns.
Fig. 11.
Fig. 11. Impact of the number of inferred environment \(|\mathcal {E}_{infer}|\). Red and blue lines denote the results of our INL, and grey dashed lines are the best results of all baselines.
Fig. 12.
Fig. 12. Impact of the invariance regularizer coefficient \(\lambda\). Red and blue lines denote the results of our INL, and grey dashed lines are the best results of all baselines.
Fig. 13.
Fig. 13. Impact of the number of epochs for clustering to infer environments in each training epoch (i.e., Epoch_Cluster in Algorithm 1). Red and blue lines denote the results of our INL, and grey dashed lines are the best results of all baselines.
First, the number of inferred environments has a slight impact on the model performance. For Citeseer, the performance reaches a peak when \(|\mathcal {E}_{infer}|=3\), showing that INL achieves the best result when the number of environments matches the ground truth. For OGB-Arxiv, the best number of environments is \(|\mathcal {E}_{infer}|=5\). A plausible reason is that OGB-Arxiv dataset consists of more nodes and edges, which form more environments than Citeseer. Second, we also find the coefficient \(\lambda\) has a slight influence on the performance, indicating that we need to properly balance the classification loss and the invariance regularizer term. Finally, a proper value of the hyper-parameter Epoch_Cluster is important. A small value may not be sufficient to infer the environments accurately, while a very large value is unnecessary and may affect the training efficiency. Although an appropriate choice of hyper-parameters can further improve the performance, our method is not very sensitive to hyper-parameters. Figures 1113 show that INL can outperform the best baselines with a wide range of hyper-parameters choices.

5 Related Works

In this section, we review the related works of node representation learning, generalization of GNNs, explainability of GNNs, invariant learning, and modularity.

5.1 Node Representation Learning

Node representation learning on graphs has been extensively studied, such as random-walk based methods [19, 29, 63] and matrix factorization-based methods [10, 12, 62]. Recently, graph neural networks (GNNs) [28, 38, 75] have revolutionized the field of node representation learning [96]. They generally utilize a neighborhood aggregation (or message passing) paradigm to capture the structural information within nodes’ neighborhood. The message passing of the \(t\)th layer in GNNs is usually denoted as:
\begin{equation} \mathbf {Z}_v^{(t)} = \mathrm{COMBINE}^{(t)}\left(\mathbf {Z}_v^{(t-1)}, \mathbf {m}_v^{(t)}\right)\!,\ \ \mathbf {m}_v^{(t)} = \mathrm{AGGREGATION}^{(t)} \left(\lbrace \mathbf {Z}_u^{(t-1)}\rbrace \right)\!, \end{equation}
(25)
where \(u\) is the neighbor of node \(v\). \(\mathbf {Z}_v^{(t)}\) represents the embedding of node \(v\) at the \(t\)th layer, and \(\mathbf {Z}_v^{(0)}\) is initialized with the input node feature. \(\mathbf {m}_v^{(t)}\) represents the aggregated message from the neighbors of node \(v\). \(\mathrm{COMBINE}^{(t)}(\cdot)\) and \(\mathrm{AGGREGATION}^{(t)}(\cdot)\) are the combination and aggregation functions of GNNs [89]. Many GNNs and their variants [30, 46, 53, 59, 90, 98] have been proposed, achieving state-of-the-art performance on various tasks and demonstrating profound successes in challenging applications, such as recommendation systems [9, 26, 31, 77, 83], information retrieval [17, 91, 95], drug discovery [18, 80], protein function prediction [33, 36], traffic forecasting [21, 37], and so on. However, most existing GNNs do not consider the out-of-distribution generalization ability, so their performances drop substantially on testing data with distribution shifts [33, 44, 80].

5.2 Generalization of GNNs

A few recent works begin to study the generalization ability of GNNs. The early works [27, 48, 66, 76] focus on the generalization bounds over the training distribution, i.e., in-distribution generalization, which is orthogonal to the OOD generalization and not suitable for the distribution shifts studied in this article. More recently, the OOD generalization ability of GNNs starts to receive research interest [7, 39, 43, 58, 79, 82, 87]. In particular, Bevilacqua et al. [7] learn size-invariant representations for tackling the distribution shifts that exist on graph size. DIR [79] is proposed to discover invariant rationales for GNNs. GIL [45] focuses on capturing the invariant relationships between predictive graph structural information and labels under distribution shifts for OOD generalization. These works mostly concentrate on graph-level tasks and largely ignore the more challenging node-level tasks with multiple latent environments. Some works [24, 54, 99] are proposed to deal with semi-supervised node classification under non-I.I.D. setting. They focus on the adaptation ability of GNNs under distribution shifts, i.e., transferring GNN models trained on the source domain (i.e., environment) to the related target domain with different distributions. For example, SR-GNN [99] is proposed to handle distribution shifts between the selected training and testing nodes by adopting CMD [93] and importance sampling. Reference [24] proposes to learn GNN models by considering agnostic label selection bias. However, these works assume that test data are available and will participate in the training process, which is not in the scope of the OOD generalization problem studied in this article. One exception is the very recent pioneering work EERM [78], which studies invariant node learning by assuming all nodes share a single environment. However, it ignores the more common and challenging situation that nodes are from multiple latent environments. We empirically show that our proposed method greatly outperforms EERM by effectively identifying and modeling multiple latent environments.

5.3 Explainability of GNNs

The studies on the explainability of GNNs aim to understand the predictions of black-box GNNs by providing the explanations [20, 72, 92]. They generally try to answer which nodes, edges, or features of the input graph are more important for predicting the labels. Several works are proposed to find a subgraph structure and a small subset of node features for the target nodes as the explanations for GNN’s predictions [49, 52, 88]. For example, GNNExplainer [88] learns the soft masks on edges and node features to explain the predictions with the mask optimization. PGExplainer [52] further learns the approximated discrete masks on edges to explain the predictions with a parameterized mask predictor. GraphMask [68] is a post hoc method for explaining the importance of edges in the graph convolution layer. A recent work [79] finds that these explainability works are very sensitive to distribution shifts as most GNN models and proposes discovering invariant explanations in graph-level classification tasks. However, these works focus on understanding the predictions of GNNs instead of learning node representations for better generalization ability under distribution shifts.

5.4 Invariant Learning

Invariant learning has received surging attentions to enable OOD generalization, aiming to generalize to unseen environments by exploiting the invariant relationships between features and labels across distribution shifts. Several works [2, 4, 11, 40, 42, 64] are proposed to learn invariant model and show guaranteed generalization under distribution shifts. However, most existing methods heavily rely on additional environment labels that have to be explicitly provided in the training dataset. Such annotations for the nodes on graph data are usually unavailable and prohibitively expensive to collect, so that these invariant learning methods are inapplicable. A few works study OOD generalization on latent environments in computer vision [16, 51, 56], which cannot be directly applied to graph data. In summary, how to learn invariant node representations under distribution shifts without explicit environment labels remains largely unexplored in the literature.

5.5 Modularity

The Modularity is generally used to measure the divergence between the number of intra-cluster edges and the expected number of a random graph [60], where nodes \(v\) and \(u\) with degrees \(d_v\) and \(d_u\) are connected with probability \({d_v d_u}/{2m}\) and \(m\) is the edge number. By maximizing the modularity, the nodes are densely connected within each cluster [73]:
\begin{equation} \max _{C} Q = \frac{1}{2m} \mathrm{trace} \left(C^\top A C - \frac{1}{2m} {\rm diag} \left({C^\top \mathbf {d} \mathbf {d}^\top C}\right) \right), \end{equation}
(26)
where \(C\) is a cluster assignment matrix, and \(A\) is the adjacency matrix of the input graph for clustering. \(\mathbf {d}\) and \(m\) indicate the degree vector and the number of edges, respectively. However, there are two obstacles for directly adopting this classical modularity maximization method to learn cluster assignment as the inferred environments. The first is that the modularity maximization ignores the inter-cluster edges whose connecting probability should be minimized in the meantime. The second is that we should use the variant patterns \((X^S, A^S)\) of the input graph for clustering rather than use the whole input graph \((X, A)\). Since the invariant patterns capture the invariant relationships between predictive node features and graph structures with the node labels, the variant patterns in turn capture variant spurious correlations under different distributions.

6 Conclusions

In this article, we study learning invariant node representations under distribution shifts with multiple latent environments and propose a principled and novel method (INL). The proposed method can identify the invariant and variant ego-subgraphs of nodes, infer the environment label of nodes without supervisions, and learn invariant node representations through regularization. Extensive experiments on both synthetic and real-world node classification benchmarks demonstrate the superiority of our method against state-of-the-art baselines when there exist distribution shifts.

Footnotes

1
Although the variant spurious correlations can be potentially useful for predictions in some environments, such correlations are not stable and can change across different environments. It is infeasible to judge whether the variant spurious correlations are still correct or not when the model is deployed in unknown testing environments with distribution shifts. Therefore, for achieving good OOD generalization rather than trivially overfitting the training data, the key idea of invariant learning is to learn invariant models for guaranteed generalization under distribution shifts.
2
We follow this more challenging out-of-distribution generalization [2, 4, 11, 40, 42, 64] setting instead of the semi-supervised/adaptation setting that unlabeled testing graph data is available during training.

A Proofs

A.1 Proof of Proposition 3

Proof.
Let \(a_v^{I,I} = \frac{1}{|\mathcal {N}_v^I|} \sum _{u \in \mathcal {N}_v^I} x_u^I\) be the aggregated invariant node features from invariant ego-subgraph \(G_v^I\). Similarly, we define \(a_v^{S,I} = \frac{1}{|\mathcal {N}_v^I|} \sum _{u \in \mathcal {N}_v^I} x_u^S\), \(a_v^{I,S} = \frac{1}{|\mathcal {N}_v^S|} \sum _{u \in \mathcal {N}_v^S} x_u^I\), and \(a_v^{S,S} = \frac{1}{|\mathcal {N}_v^S|} \sum _{u \in \mathcal {N}_v^S} x_u^S\). The first and second superscript of \(a_v\) indicate the invariant/variant node features and structures, respectively. We further denote \(e_v^I = \frac{1}{|\mathcal {N}_v^I|} \sum _{u \in \mathcal {N}_v^I} e_u\), and \(e_v^S = \frac{1}{|\mathcal {N)}_v^S|} \sum _{u \in \mathcal {N}_v^S} e_u\). The risk of predictor \(f\) is:
\begin{align} \mathcal {R} &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {y_v}|\mathbf {G_v}=G_v} \left[ || \hat{y}_v - y_v ||_2^2 \right] \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ || \left(\theta _1 a_v^{I,I} + \theta _2 a_v^{S,I} + \theta _3 a_v^{I,S} + \theta _4 a_v^{S,S}\right) - \left(a_v^{I,I} + \epsilon _1 \right) ||_2^2 \Bigr ] \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ || \left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,S} + \theta _2 \left(\epsilon _1 + \epsilon _2 + e_v^I \right) + \theta _4\left(\epsilon _1 + \epsilon _2 + e_v^S\right) -\epsilon _1 ||_2^2 \Bigr ]. \end{align}
(27)
The first-order derivative w.r.t. \(\theta _1\) is:
\begin{align} &\frac{\partial \mathcal {R}}{\partial \theta _1} \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,S} + \theta _2 \left(\epsilon _1 + \epsilon _2 + e_v^I \right) + \theta _4\left(\epsilon _1 + \epsilon _2 + e_v^S\right) -\epsilon _1 \right) a_v^{I,I} \Bigr ] \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,I} a_v^{I,S} \right) \right], \end{align}
(28)
where the second equation holds because \(a_v^{I,I}\) is independent with \(\epsilon _1\), \(\epsilon _2\), \(e_v^I\), and \(e_v^S\). Therefore, let \(\frac{\partial \mathcal {R}}{\partial \theta _1} = 0\), we have
\begin{equation} \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ \left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,I} a_v^{I,S} \right] = 0. \end{equation}
(29)
The first-order derivative w.r.t. \(\theta _2\) is:
\begin{align} &\frac{\partial \mathcal {R}}{\partial \theta _2} \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,S} + \theta _2 (\epsilon _1 + \epsilon _2 + e_v^I) + \theta _4 \left(\epsilon _1 + \epsilon _2 + e_v^S\right)-\epsilon _1 \right) \left(a_v^{I,I} + \epsilon _1 + \epsilon _2 + e_v^I \right) \Bigr ] \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,I} a_v^{I,S} + \theta _2 \left(\epsilon _1^2 + \epsilon _2^2 + e_v^I e_v^I \right) + \theta _4 \left(\epsilon _1^2 + \epsilon _2^2 + e_v^I e_v^S \right) - 1\right) \Bigr ] \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ 2 \left(\theta _2 \left(\epsilon _1^2 + \epsilon _2^2 + e_v^I e_v^I \right) + \theta _4 \left(\epsilon _1^2 + \epsilon _2^2 + e_v^I e_v^S \right) - 1\right) \right] \nonumber \nonumber\\ &= \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ 2 \left(\theta _2 \left(2 + e_v^I e_v^I \right) + \theta _4 \left(2 + e_v^I e_v^S \right) - 1 \right) \right], \end{align}
(30)
where the second equation holds because of the independence among \(a_v^{I,I}\), \(\epsilon _1\), \(\epsilon _2\), and \(e_v^I\) or \(e_v^S\). The third equation holds, since we let \(\frac{\partial \mathcal {R}}{\partial \theta _1} = 0\). The last equation holds, since \(\epsilon _1\) and \(\epsilon _2\) follow standard normal distribution. We further let \(\frac{\partial \mathcal {R}}{\partial \theta _2} = 0\) and obtain:
\begin{equation} \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ \theta _2 \left(2 + e_v^I e_v^I \right) + \theta _4 \left(2 + e_v^I e_v^S \right) - 1 \right] = 0. \end{equation}
(31)
Similarly, let \(\frac{\partial \mathcal {R}}{\partial \theta _3} = 0\), we have
\begin{equation} \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ \left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} a_v^{I,S} + \left(\theta _3 + \theta _4 \right) a_v^{I,S} a_v^{I,S} \right] = 0. \end{equation}
(32)
And let \(\frac{\partial \mathcal {R}}{\partial \theta _4} = 0\), we have
\begin{equation} \frac{1}{|V|} \sum _{v \in V} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \left[ \theta _2 \left(2 + e_v^I e_v^S \right) + \theta _4 \left(2 + e_v^S e_v^S \right) - 1 \right] = 0. \end{equation}
(33)
Finally, given Equations (29) and (31)–(33), we can derive the solution:
\begin{equation} {\theta _1 = 1 - \frac{\mu ^S}{2(\mu ^S-\mu ^I)},\ \ \theta _2 = \frac{\mu ^S}{2(\mu ^S-\mu ^I)},\ \ \theta _3 = \frac{\mu ^I}{2(\mu ^S-\mu ^I)},\ \ \theta _4 = \frac{-\mu ^I}{2(\mu ^S-\mu ^I)}. } \end{equation}
(34)

A.2 Proof of Proposition 4

Proof.
If the invariance regularizer \(\mathrm{trace}(\mathrm{Var}_{\mathcal {E}_{infer}}(\nabla _\theta \mathcal {R}^e))\) in Equation (8) reaches the minimum, then we have \(\mathrm{trace}(\mathrm{Var}_{\mathcal {E}_{infer}}(\nabla _\theta \mathcal {R}^e)) = 0\). It means that the variance of \(\frac{\partial \mathcal {R}^e}{\partial \theta _i}\) among all environments is 0, i.e., \(\frac{\partial \mathcal {R}^e}{\partial \theta _i}\) keeps invariant between any two environments, \(i=1,2,3,4\). Recall that
\begin{align} &\frac{\partial \mathcal {R}^e}{\partial \theta _1} \nonumber \nonumber\\ &= \frac{1}{|V^e|} \sum _{v \in V^e} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,S} + \theta _2 \left(\epsilon _1 + \epsilon _2 + e_v^I \right) + \theta _4\left(\epsilon _1 + \epsilon _2 + e_v^S\right) -\epsilon _1 \right) a_v^{I,I} \Bigr ] \nonumber \nonumber\\ &= \frac{1}{|V^e|} \sum _{v \in V^e} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,I} a_v^{I,S} \right) \Bigr ] \end{align}
(35)
and
\begin{equation*} { \begin{aligned}&\frac{\partial \mathcal {R}^e}{\partial \theta _2} \nonumber \nonumber \\ &= \frac{1}{|V^e|} \sum _{v \in V^e} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,S} + \theta _2 \left(\epsilon _1 + \epsilon _2 + e_v^I\right) + \theta _4 \left(\epsilon _1 + \epsilon _2 + e_v^S\right)-\epsilon _1 \right) \left(a_v^{I,I} + \epsilon _1 + \epsilon _2 + e_v^I \right) \Bigr ] \\ &= \frac{1}{|V^e|} \sum _{v \in V^e} \mathbb {E}_{\mathbf {\epsilon _1}, \mathbf {\epsilon _2}} \Bigl [ 2 \left(\left(\theta _1 + \theta _2 - 1\right) a_v^{I,I} a_v^{I,I} + \left(\theta _3 + \theta _4 \right) a_v^{I,I} a_v^{I,S} + \theta _2 \left(\epsilon _1^2 + \epsilon _2^2 + e_v^I e_v^I \right) + \theta _4 \left(\epsilon _1^2 + \epsilon _2^2 + e_v^I e_v^S \right) - 1\right) \Bigr ]. \nonumber \nonumber \\ \end{aligned}} \end{equation*}
Therefore, \(\frac{\partial \mathcal {R}^e}{\partial \theta _i}\) can keep invariant between any two environments for \(i=1,2,3,4\), only when satisfying \(\theta _3 + \theta _4 = 0\), \(\theta _2=0\), and \(\theta _4=0\), Finally, optimizing the invariance regularizer in Equation (8) to the minimum can lead to \([\theta _2, \theta _3, \theta _4] = [0, 0, 0]\), so the model can make predictions only based on the invariant patterns and achieve promising OOD generalization under distribution shifts.□

References

[1]
Abien Fred Agarap. 2018. Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018).
[2]
Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Yoshua Bengio, Ioannis Mitliagkas, and Irina Rish. 2021. Invariance principle meets information bottleneck for out-of-distribution generalization. In Neural Information Processing Systems (NeurIPS).
[3]
Kartik Ahuja, Jun Wang, Amit Dhurandhar, Karthikeyan Shanmugam, and Kush R. Varshney. 2021. Empirical or invariant risk minimization? A sample complexity perspective. In International Conference on Learning Representations.
[4]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893 (2019).
[5]
Albert-Laszlo Barabasi and Zoltan N. Oltvai. 2004. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 5, 2 (2004), 101–113.
[6]
Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. 2019. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations.
[7]
Beatrice Bevilacqua, Yangze Zhou, and Bruno Ribeiro. 2021. Size-invariant graph representations for graph classification extrapolations. In 38th International Conference on Machine Learning. 837–851.
[8]
Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Görke, Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. 2006. Maximizing modularity is hard. arXiv preprint physics/0608255 (2006).
[9]
Desheng Cai, Shengsheng Qian, Quan Fang, Jun Hu, and Changsheng Xu. 2022. User cold-start recommendation via inductive heterogeneous graph neural network. ACM Trans. Inf. Syst. 41, 3 (2022).
[10]
Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. GraRep: Learning graph representations with global structural information. In 24th ACM International on Conference on Information and Knowledge Management (CIKM’15). 891–900.
[11]
Shiyu Chang, Yang Zhang, Mo Yu, and Tommi Jaakkola. 2020. Invariant rationalization. In International Conference on Machine Learning. PMLR, 1448–1458.
[12]
Chong Chen, Min Zhang, Yongfeng Zhang, Yiqun Liu, and Shaoping Ma. 2020. Efficient neural matrix factorization without sampling for recommendation. ACM Trans. Inf. Syst. 38, 2 (2020), 1–28.
[13]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning.
[14]
Xu Chen, Kun Xiong, Yongfeng Zhang, Long Xia, Dawei Yin, and Jimmy Xiangji Huang. 2020. Neural feature-aware recommendation with signed hypergraph convolutional network. ACM Trans. Inf. Syst. 39, 1 (2020), 1–22.
[15]
Gene Ontology Consortium. 2019. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D1 (2019).
[16]
Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. 2021. Environment inference for invariant learning. In International Conference on Machine Learning. PMLR, 2189–2200.
[17]
Hejie Cui, Jiaying Lu, Yao Ge, and Carl Yang. 2022. How can graph neural networks help document retrieval: A case study on CORD19 with concept map generation. In European Conference on Information Retrieval. Springer, 75–83.
[18]
Limeng Cui and Dongwon Lee. 2022. KETCH: Knowledge graph enhanced thread recommendation in healthcare forums. In 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 492–501.
[19]
Peng Cui, Shao-Wei Liu, Wen-Wu Zhu, Huan-Bo Luan, Tat-Seng Chua, and Shi-Qiang Yang. 2014. Social-sensed image search. ACM Trans. Inf. Syst. 32, 2 (2014), 1–23.
[20]
Enyan Dai and Suhang Wang. 2021. Towards self-explainable graph neural network. In 30th ACM International Conference on Information & Knowledge Management (CIKM’21). 302–311.
[21]
Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Lange, Todd Hester, Luis Perez, Marc Nunkesser, Seongjae Lee, Xueying Guo, Brett Wiltshire, et al. 2021. ETA prediction with graph neural networks in Google Maps. In 30th ACM International Conference on Information & Knowledge Management (CIKM’21). 3767–3776.
[22]
David Easley, Jon Kleinberg, et al. 2012. Networks, Crowds, and Markets. Cambridge Books.
[23]
Abbas El Gamal and Young-Han Kim. 2011. Network Information Theory. Cambridge University Press.
[24]
Shaohua Fan, Xiao Wang, Chuan Shi, Kun Kuang, Nian Liu, and Bai Wang. 2022. Debiased graph neural networks with agnostic label selection bias. IEEE Trans. Neural Netw. Learn. Syst. (2022).
[25]
Matthias Fey and Jan E. Lenssen. 2019. Fast graph representation learning with PyTorch geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds.
[26]
Chen Gao, Xiang Wang, Xiangnan He, and Yong Li. 2022. Graph neural networks for recommender system. In 15th ACM International Conference on Web Search and Data Mining (WSDM’22). 1623–1625.
[27]
Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. 2020. Generalization and representational limits of graph neural networks. In International Conference on Machine Learning. PMLR, 3419–3430.
[28]
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In International Conference on Machine Learning. PMLR, 1263–1272.
[29]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 855–864.
[30]
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In 31st International Conference on Neural Information Processing Systems. 1025–1035.
[31]
Xiangnan He, Zhaochun Ren, Emine Yilmaz, Marc Najork, and Tat-Seng Chua. 2021. Graph technologies for user modeling and recommendation: Introduction to the special issue—Part 1. ACM Trans. Inf. Syst. 40, 2 (Sep.2021). DOI:
[32]
Kanglin Hsieh et al. 2021. Drug repurposing for COVID-19 using graph neural network and harmonizing multiple evidence. Scient. Rep. 11, 1 (2021).
[33]
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets for machine learning on graphs. Adv. Neural Inf. Process. Syst. 33 (2020), 22118–22133.
[34]
Kexin Huang and Marinka Zitnik. 2020. Graph meta learning via local subgraphs. Neural Inf. Process. Syst. 33 (2020).
[35]
Liwei Huang, Yutao Ma, Yanbo Liu, Bohong Danny Du, Shuliang Wang, and Deyi Li. 2021. Position-enhanced and time-aware graph convolutional network for sequential recommendations. ACM Trans. Inf. Syst. 41, 1 (2021).
[36]
Biaobin Jiang, Kyle Kloster, David F. Gleich, and Michael Gribskov. 2017. AptRank: An adaptive PageRank model for protein function prediction on bi-relational graphs. Bioinformatics 33, 12 (2017), 1829–1836.
[37]
Weiwei Jiang and Jiayun Luo. 2021. Graph neural network for traffic forecasting: A survey. arXiv preprint arXiv:2101.11174 (2021).
[38]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.
[39]
Boris Knyazev, Graham W. Taylor, and Mohamed Amer. 2019. Understanding attention and generalization in graph neural networks. Adv. Neural Inf. Process. Syst. 32 (2019), 4202–4212.
[40]
Masanori Koyama and Shoichiro Yamaguchi. 2020. Out-of-distribution generalization with maximal invariant predictor. arXiv preprint arXiv:2008.01883 (2020).
[41]
Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. 2004. Estimating mutual information. Phys. Rev. E 69, 6 (2004), 066138.
[42]
David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville. 2021. Out-of-distribution generalization via risk extrapolation (REX). In International Conference on Machine Learning.
[43]
Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022. OOD-GNN: Out-of-distribution generalized graph neural network. IEEE Trans. Knowl. Data Eng. (2022).
[44]
Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022. Out-of-distribution generalization on graphs: A survey. arXiv preprint arXiv:2202.07987 (2022).
[45]
Haoyang Li, Ziwei Zhang, Xin Wang, and Wenwu Zhu. 2022. Learning invariant graph representations for out-of-distribution generalization. In Advances in Neural Information Processing Systems.
[46]
Jianxin Li, Hao Peng, Yuwei Cao, Yingtong Dou, Hekai Zhang, Philip Yu, and Lifang He. 2021. Higher-order attribute-enhancing heterogeneous graph neural networks. IEEE Trans. Knowl. Data Eng. 35, 1 (2021).
[47]
Yang Li, Buyue Qian, Xianli Zhang, and Hui Liu. 2020. Graph neural network-based diagnosis prediction. Big Data 8, 5 (2020), 379–390.
[48]
Renjie Liao, Raquel Urtasun, and Richard Zemel. 2020. A PAC-Bayesian approach to generalization bounds for graph neural networks. In International Conference on Learning Representations.
[49]
Wanyu Lin, Hao Lan, and Baochun Li. 2021. Generative causal explanations for graph neural networks. In International Conference on Machine Learning. PMLR, 6666–6679.
[50]
Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. 2021. Heterogeneous risk minimization. In International Conference on Machine Learning. PMLR.
[51]
Jiashuo Liu, Zheyuan Hu, Peng Cui, Bo Li, and Zheyan Shen. 2021. Integrated latent heterogeneity and invariance learning in kernel space. In Advances in Neural Information Processing Systems.
[52]
Dongsheng Luo, Wei Cheng, Dongkuan Xu, Wenchao Yu, Bo Zong, Haifeng Chen, and Xiang Zhang. 2020. Parameterized explainer for graph neural network. Adv. Neural Inf. Process. Syst. 33 (2020).
[53]
Zihan Luo, Jianxun Lian, Hong Huang, Hai Jin, and Xing Xie. 2022. Ada-GNN: Adapting to local patterns for improving graph neural networks. In 15th ACM International Conference on Web Search and Data Mining (WSDM’22). 638–647.
[54]
Jiaqi Ma, Junwei Deng, and Qiaozhu Mei. 2021. Subgroup generalization and fairness of graph neural networks. Adv. Neural Inf. Process. Syst. 34 (2021).
[55]
Ting Ma, Longtao Huang, Qianqian Lu, and Songlin Hu. 2022. KR-GCN: Knowledge-aware reasoning with graph convolution network for explainable recommendation. ACM Trans. Inf. Syst. 41, 1 (2022).
[56]
Toshihiko Matsuura and Tatsuya Harada. 2020. Domain generalization using a mixture of multiple latent domains. In AAAI Conference on Artificial Intelligence. 11749–11756.
[57]
Miller McPherson, Lynn Smith-Lovin, and James M. Cook. 2001. Birds of a feather: Homophily in social networks. Ann. Rev. Sociol. 27, 1 (2001).
[58]
Siqi Miao, Mia Liu, and Pan Li. 2022. Interpretable and generalizable graph learning via stochastic attention mechanism. In International Conference on Machine Learning.
[59]
Christopher Morris, Martin Ritzert, Matthias Fey, William L. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. 2019. Weisfeiler and Leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence. 4602–4609.
[60]
Mark E. J. Newman. 2006. Modularity and community structure in networks. Proc. Nat. Acad. Sci. 103, 23 (2006).
[61]
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2015. A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1 (2015), 11–33.
[62]
Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1105–1114.
[63]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online learning of social representations. In 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 701–710.
[64]
Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. 2018. Invariant models for causal transfer learning. J. Mach. Learn. Res. 19, 1 (2018).
[65]
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. 2019. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019).
[66]
Franco Scarselli, Ah Chung Tsoi, and Markus Hagenbuchner. 2018. The Vapnik–Chervonenkis dimension of graph and recursive neural networks. Neural Netw. 108 (2018), 248–259.
[67]
Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
[68]
Michael Sejr Schlichtkrull, Nicola De Cao, and Ivan Titov. 2021. Interpreting graph neural networks for NLP with differentiable edge masking. In International Conference on Learning Representations.
[69]
Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018).
[70]
Jonathan M. Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M. Donghia, Craig R. MacNair, Shawn French, Lindsey A. Carfrae, Zohar Bloom-Ackermann, et al. 2020. A deep learning approach to antibiotic discovery. Cell 180, 4 (2020), 688–702.
[71]
Damian Szklarczyk, Annika L. Gable, David Lyon, Alexander Junge, Stefan Wyder, Jaime Huerta-Cepas, Milan Simonovic, Nadezhda T. Doncheva, John H. Morris, Peer Bork, et al. 2019. STRING v11: Protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D1 (2019), D607–D613.
[72]
Juntao Tan, Shijie Geng, Zuohui Fu, Yingqiang Ge, Shuyuan Xu, Yunqi Li, and Yongfeng Zhang. 2022. Learning and evaluating graph neural network explanations based on counterfactual and factual reasoning. In ACM Web Conference. 1018–1027.
[73]
Anton Tsitsulin, John Palowitch, Bryan Perozzi, and Emmanuel Müller. 2020. Graph clustering with graph neural networks. arXiv preprint arXiv:2006.16904 (2020).
[74]
Vladimir N. Vapnik. 1999. An overview of statistical learning theory. IEEE Trans. Neural Netw. 10, 5 (1999), 988–999.
[75]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
[76]
Saurabh Verma and Zhi-Li Zhang. 2019. Stability and generalization of graph convolutional neural networks. In 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1539–1548.
[77]
Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, and Minghui Qiu. 2020. Global context enhanced graph neural networks for session-based recommendation. In 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 169–178.
[78]
Qitian Wu, Hengrui Zhang, Junchi Yan, and David Wipf. 2022. Handling distribution shifts on graphs: An invariance perspective. In International Conference on Learning Representations.
[79]
Yingxin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat-Seng Chua. 2022. Discovering invariant rationales for graph neural networks. In International Conference on Learning Representations.
[80]
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. 2018. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 2 (2018), 513–530.
[81]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How powerful are graph neural networks? In International Conference on Learning Representations.
[82]
Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S. Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2021. How neural networks extrapolate: From feedforward to graph neural networks. In International Conference on Learning Representations.
[83]
Jun Yang, Weizhi Ma, Min Zhang, Xin Zhou, Yiqun Liu, and Shaoping Ma. 2021. LegalGNN: Legal information enhanced graph neural network for recommendation. ACM Trans. Inf. Syst. 40, 2 (2021), 1–29.
[84]
Tianchi Yang, Linmei Hu, Chuan Shi, Houye Ji, Xiaoli Li, and Liqiang Nie. 2021. HGAT: Heterogeneous graph attention networks for semi-supervised short text classification. ACM Trans. Inf. Syst. 39, 3 (2021), 1–29.
[85]
Yiying Yang, Zhongyu Wei, Qin Chen, and Libo Wu. 2019. Using external knowledge for financial event prediction based on graph neural networks. In 28th ACM International Conference on Information and Knowledge Management (CIKM’19). 2161–2164.
[86]
Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In International Conference on Machine Learning. PMLR, 40–48.
[87]
Gilad Yehudai, Ethan Fetaya, Eli Meirom, Gal Chechik, and Haggai Maron. 2021. From local structures to size generalization in graph neural networks. In International Conference on Machine Learning. PMLR, 11975–11986.
[88]
Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. GNNexplainer: Generating explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 32 (2019), 9240.
[89]
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Adv. Neural Inf. Process. Syst. 33 (2020), 5812–5823.
[90]
Wenhui Yu, Xiao Lin, Jinfei Liu, Junfeng Ge, Wenwu Ou, and Zheng Qin. 2021. Self-propagation graph neural network for recommendation. IEEE Trans. Knowl. Data Eng. 34, 12 (2021).
[91]
Xueli Yu, Weizhi Xu, Zeyu Cui, Shu Wu, and Liang Wang. 2021. Graph-based hierarchical relevance matching signals for ad-hoc retrieval. In Web Conference. 778–787.
[92]
Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. 2020. Explainability in graph neural networks: A taxonomic survey. arXiv preprint arXiv:2012.15445 (2020).
[93]
Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. 2017. Central moment discrepancy (CMD) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811 (2017).
[94]
Ge Zhang, Zhao Li, Jiaming Huang, Jia Wu, Chuan Zhou, Jian Yang, and Jianliang Gao. 2022. eFraudCom: An e-commerce fraud detection system via competitive graph neural networks. ACM Trans. Inf. Syst. 40, 3 (2022), 1–29.
[95]
Yuan Zhang, Dong Wang, and Yan Zhang. 2019. Neural IR meets graph embedding: A ranking model for product search. In World Wide Web Conference. 2390–2400.
[96]
Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A survey. IEEE Trans. Knowl. Data Eng. 34, 1 (2020).
[97]
Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. 2019. T-GCN: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transport. Syst. 21, 9 (2019), 3848–3858.
[98]
Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. 2021. Interpreting and unifying graph neural networks with an optimization framework. In Web Conference. 1215–1226.
[99]
Qi Zhu, Natalia Ponomareva, Jiawei Han, and Bryan Perozzi. 2021. Shift-robust GNNs: Overcoming the limitations of localized graph training data. Adv. Neural Inf. Process. Syst. 34 (2021).
[100]
Marinka Zitnik, Monica Agrawal, and Jure Leskovec. 2018. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34, 13 (2018), i457–i466.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 42, Issue 1
January 2024
924 pages
EISSN:1558-2868
DOI:10.1145/3613513
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 August 2023
Online AM: 13 June 2023
Accepted: 22 May 2023
Revised: 11 April 2023
Received: 20 September 2022
Published in TOIS Volume 42, Issue 1

Check for updates

Author Tags

  1. Graph neural networks
  2. node representation learning
  3. distribution shift

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • National Natural Science Foundation of China
  • Beijing National Research Center for Information Science and Technology (BNRist)
  • Beijing Key Lab of Networked Multimedia, China National Postdoctoral Program for Innovative Talents
  • China Postdoctoral Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,406
  • Downloads (Last 6 weeks)130
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media