4.1.1 Datasets.
We adopt two synthetic datasets with artificial distribution shifts based on two representative node classification benchmarks, Citeseer [
86] and Amazon-Photo [
69], in which ground-truth generation processes are controllable. And we also consider another two real-world datasets OGB-Arxiv and OGB-Proteins from Open Graph Benchmark [
33]. The statistics of these datasets are provided in Table
2.
Synthetic datasets. Citeseer and Amazon-Photo are two commonly used node classification benchmarks. Citeseer is a citation network where nodes represent papers and edges indicate their citations. Amazon-Photo is a co-purchasing network where nodes represent items and edges represent two items purchased together. For evaluating the model’s out-of-distribution generalization ability, we introduce distribution shifts between the training and testing data.
Following Reference [
78], we first use a randomly initialized 2-layer GCN to generate node labels
\(Y\) based on the original node features and edges, which can be regarded as invariant and sufficiently predictive information to the labels and denoted by
\((X^I, A^I)\). Then, we assign nodes into different environments and create spurious correlations between the label and environment. Based on the label and environment of each node, we generate an additional feature matrix and additional edges as the variant patterns, which are denoted by
\((X^S, A^S)\). The generated feature (i.e.,
\(X^S\)) has the same dimensionality as the original feature (i.e.,
\(X^I\)) and the number of generated edges (i.e.,
\(A^S\)) equals the original number of edges (i.e.,
\(A^I\)). We then concatenate the two feature matrices and add the generated edges into the original graph as the input data, i.e.,
\((X=[X^I,X^S], A=A^I + A^S)\). The dependence among these variables is illustrated in Figure
3.
More specifically, we set the ground truth number of environments as \(K=3\) and adopt a hyper-parameter \(r \in [0, 1]\) to control the strength of spurious correlations by setting the probability of node \(v\) belonging to the \(k\)th environment as \(P(v \in V^{e_k}) = r\) if \(k \equiv y_v (\mathrm{mod}\ K)\) and \(P(v \in V^{e_k}) = (1-r) / 2\) otherwise. Intuitively, nodes with the same labels more likely belong to the same environment. For example, for the nodes whose labels are 1 or 4, the probability of these nodes belonging to the 1st environment is \(r\) and the probability belonging to the 2nd or 3rd environment is \((1-r)/2\). In the \(K=3\) case, \(r=1/3\) means there is no spurious correlation and a larger \(r\) indicates a higher spurious correlation between the label and environment. We set \(r_{test} = 1/3\) and vary \(r_{train}\) in \(\lbrace 1/3, 0.5, 0.7\rbrace\) to generate testing and training graphs, respectively, which simulates different strengths of distribution shifts. We hold out 10% nodes from the training graph for validation.
After obtaining the environment of each node, we generate variant node features \(X^S\) by a two-layer MLP given the label and environment ID as the input. Then, we generate variant edges \(A^S\) by connecting nodes with similar variant node features. In particular, we first calculate the scores of any potential edges (i.e., edges not in \(A^I\)) by cosine similarity of variant node features of the two nodes. According to the scores, we select Top-\(t\) edges in all potential edges to form the variant edges \(A^S\), where the number of invariant and variant edges is equal, i.e., \(t\) is the number of edges in \(A^I\).
OGB-Arxiv. This dataset consists of Arxiv CS papers from 40 subject areas and their citations. The task is to predict the 40 subject areas of the papers,
3 e.g., cs.AI, cs.LG, cs.OS. Instead of the semi-supervised/adaptation setting where unlabeled testing data is available during training [
33], we follow the more common and challenging out-of-distribution generalization [
2,
4,
11,
40,
42,
64] setting, i.e., the testing nodes are not available in the training stage. Since several latent influential environment factors (e.g., the popularity of research topics) can change significantly over time, the properties of citation networks will be varying in different time ranges. Therefore, the node distribution shifts on OGB-Arxiv are introduced by selecting papers published before 2011 as training set, within 2011–2014 as validation set, and within 2014–2016/2016–2018/2018–2020 as three testing sets.
OGB-Proteins. In this dataset, nodes represent proteins and edges indicate different types of biologically meaningful associations between proteins, e.g., physical interactions, co-expression, or homology [
71]. The task is to predict the presence of protein functions in a binary classification setup. We also follow the out-of-distribution generalization [
2,
4,
11,
40,
42,
64] setting, i.e., the testing nodes are not available in the training stage, instead of the semi-supervised setting. Since the latent influential environment factors can vary from different species that the proteins come from, the properties and associations of proteins will also be different in different species. Therefore, the node distribution shifts on OGB-Proteins are introduced by selecting nodes into training/validation/testing sets according to their species. Specifically, the training set and validation set include proteins and their associations from four and one species, respectively. And each of the three testing sets consists of proteins and their associations from one of the left three species.
The datasets are publicly available as follows:
4.1.3 Implementation Details.
The number of epochs for optimizing our proposed method (i.e., Epoch in Algorithm 1) and baselines is set to 200 for the synthetic datasets (i.e., Citeseer and Amazon-Photo) and 500 for the real-world datasets (i.e., OGB-Arxiv and OGB-Proteins). The number of epochs for clustering to infer environments in each training epoch (i.e., Epoch_Cluster in Algorithm 1) is 20. The Adam optimizer is adopted for gradient descent. Since we focus on node classification tasks, we use the cross-entropy loss as the loss function
\(\ell\). The classifier
\(w\) is instantiated as a two-layer MLP. The activation function is ReLU [
1]. The evaluation metric is ROC-AUC for OGB-Proteins datasets and accuracy for the others. For
\(\mathrm{GNN^{M}}\),
\(\mathrm{GNN^{C}}\), and
\(\mathrm{GNN^{I}}\), the number of layers is set to 2 on all the datasets. The dimensionality of the node representations
\(d\) is 32 on the synthetic datasets, 128 on OGB-Arxiv, and 256 on OGB-Proteins. Note that these GNNs including
\(\mathrm{GNN^{M}}\),
\(\mathrm{GNN^{C}}\),
\(\mathrm{GNN^{I}}\) are shared for all ego-subgraphs following References [
34,
78]. The invariance regularizer coefficient
\(\lambda\) in Equation (
8) is chosen from
\(\lbrace 10^{-4}, 10^{-2}, 10^{0}\rbrace\). The number of the inferred environments
\(|\mathcal {E}_{infer}|\) is chosen from
\(\lbrace 2, 3, 4\rbrace\), which is the dimensionality of the vector
\(C_v\) indicating the node
\(v\)’s environment in the cluster assignment matrix
\(C\). We report mean results and standard deviations of 10 runs. The selected
\(\lambda\) and
\(|\mathcal {E}_{infer}|\) are reported in Table
3.
As for the baselines, we implement them using the official source codes. We conduct the hyper-parameter search for each baseline covering the search range of both our method and the original paper (if the search range is reported). The search range and the selected hyper-parameters of the baselines are reported in Table
4. The other hyper-parameters of the baselines are kept consistent with our method as described above.
We conduct the experiments with the following hardware and software configurations:
•
Operating System: Ubuntu 18.04.1 LTS
•
GPU: NVIDIA GeForce RTX 3090 with 24 GB of Memory
•
Software: Python 3.6.5; NumPy 1.19.2; PyTorch 1.10.1; PyTorch Geometric 2.0.3 [
25].