research-article

Open access

FairGAT: Fairness-Aware Graph Attention Networks

Authors:

O. Deniz Kose,

Yanning ShenAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 7

Article No.: 164, Pages 1 - 20

https://doi.org/10.1145/3645096

Published: 19 June 2024 Publication History

PDF eReader

Abstract

Graphs can facilitate modeling various complex systems such as gene networks and power grids as well as analyzing the underlying relations within them. Learning over graphs has recently attracted increasing attention, particularly graph neural network (GNN)–based solutions, among which graph attention networks (GATs) have become one of the most widely utilized neural network structures for graph-based tasks. Although it is shown that the use of graph structures in learning results in the amplification of algorithmic bias, the influence of the attention design in GATs on algorithmic bias has not been investigated. Motivated by this, the present study first carries out a theoretical analysis in order to demonstrate the sources of algorithmic bias in GAT-based learning for node classification. Then, a novel algorithm, FairGAT, which leverages a fairness-aware attention design, is developed based on the theoretical findings. Experimental results on real-world networks demonstrate that FairGAT improves group fairness measures while also providing comparable utility to the fairness-aware baselines for node classification and link prediction.

1 Introduction

We live in the era of connectivity, in which the behaviors of humans and devices are increasingly driven by their relations to others. Thus, a significant amount of data is collected from various interconnected systems, such as social networks [44], power grid networks [39], and gene networks [49], to name a few. Learning from such data can benefit better understanding and designing of the corresponding networked systems. This motivates the increasing attention towards learning over graphs [8]. Among different approaches, graph neural networks (GNNs) have become the state-of-the-art in graph-based learning due to their success in various tasks [3, 11, 50].

GNNs employ different aggregation mechanisms to obtain node embeddings by aggregating the representations of neighbors. Among different GNN layers, graph attention networks (GATs) [47] have become one of the most widely utilized GNN designs [54]. GATs improve the conventional aggregation schemes over graph structure by leveraging masked self-attention layers. Specifically, instead of assigning uniform weights to each neighbor, hence, assuming each neighbor to be of equal importance, the attention mechanism learns neighbor-specific weights. Such non-uniform weights result in a more flexible aggregation framework and can find the more relevant neighbors.

Algorithmic bias refers to the performance gap incurred by machine learning (ML) algorithms with respect to certain sensitive/protected attributes (e.g., gender, ethnicity) in the context of group fairness [41]. For example, the accuracy differences between groups of people from different ethnicities in a face recognition model correspond to the algorithmic bias with respect to the sensitive attribute race. It has been demonstrated that GNN-based frameworks not only propagate the already existing algorithmic bias but may even amplify it due to the utilization of biased structural information [10]. For example, nodes in social networks tend to connect to other nodes with similar attributes, which leads to a denser connectivity between the nodes with the same sensitive attributes [23]. Hence, by aggregating information from the neighbors, the representations obtained by GNNs may be highly correlated with the sensitive attributes. This causes indirect discrimination in the ensuing learning tasks, even when the sensitive attributes are not directly used in training [20].

Despite the increasing popularity of GATs, the impact of attention on algorithmic bias has not yet been studied to the best of our knowledge. Motivated by this, the present work analyzes the sources of bias in a GAT-based neural network trained for node classification. Specifically, our analysis characterizes all the factors that play a role in the disparity of predictions for different sensitive groups. Based on the theoretical findings for the sources of bias, a novel algorithm, called FairGAT, is designed. FairGAT introduces a fairness-aware attention layer that can mitigate bias via a novel attention learning strategy. The proposed algorithm improves fairness while providing comparable utility to state-of-the-art fairness-aware schemes, is efficient in terms of time complexity, and can be flexibly employed together with other fairness enhancement methods. The contributions of this work can be summarized as follows:

(c1)

For a neural network consisting of multiple GAT layers, we present a theoretical analysis that illuminates the sources of bias leading to disparities between different sensitive groups.

(c2)

Presented analysis strategy can pave the way for further theoretical findings illustrating the sources of bias for different GNN layers, e.g., graph convolutional networks [28].

(c3)

Based on the developed analysis, we devise a novel algorithm, FairGAT, with three main steps. Each step in the algorithm combats one of the theoretically identified sources of bias for a GAT-based neural network.

(c4)

FairGAT introduces a new attention mechanism that can mitigate algorithmic bias. The proposed fair attention learning strategy is efficient, i.e., it does not incur significant additional computational complexity compared with conventional GATs.

(c5)

The experimental results are obtained over six real-world networks for node classification and link prediction. The comparative results show that FairGAT can improve group fairness measures compared with the fairness-aware baselines while providing similar utility.

2 Related Work

ML over graphs. Conventional graph learning approaches can be summarized under two categories: factorization-based and random walk-based approaches. Factorization-based methods minimize the difference between the inner product of the created node representations and a deterministic similarity metric (that is typically based on the graph structure) between the nodes, e.g., [2, 7, 38]. Random walk-based approaches, on the other hand, employ stochastic measures of similarity and target at node representations whose inner products reflect the considered stochastic similarity measure between the nodes, e.g., [9, 18, 42, 46]. Recently, GNNs have gained popularity and have become the state-of-the-art for a number of graph-based tasks; see, e.g., [16, 21, 24, 48, 51, 55, 56].

Fairness-aware learning over graphs. Fairwalk [43] serves as a seminal work for fairness-aware random walk-based studies [26]. In addition, [4, 10, 15] propose to use adversarial regularizers to mitigate bias in GNNs. Another strategy is to use a Bayesian approach in which the sensitive information is modeled in the prior distribution to enhance fairness over graphs [6]. Furthermore, [35] performs a PAC-Bayesian analysis and links the notion of subgroup generalization to accuracy disparity, and [53] proposes several strategies, including GNN-based ones, to reduce bias for the heterogeneous information networks. Specifically for fairness-aware link prediction, [5] introduces a regularizer, and [33, 34] propose strategies that alter the adjacency matrix. With a specific consideration of individual fairness over graphs, [12] proposes a ranking-based framework. An alternative approach in fairness-aware graph-based learning is to modify the input graph to combat bias via automated or manually designed fair graph data augmentations [1, 13, 30, 32, 45]. Differing from the majority of these prior works, our work first develops a theoretical analysis for algorithmic bias for our system of interest (GATs), based on which it proposes a systematic framework to mitigate the bias. In our preliminary work [31], we formulated a fairness-aware attention mechanism in order to reduce the correlation between the sensitive attributes and aggregated node representations. This article extends and completes our prior work [31] by theoretically analyzing a statistical parity-related bias measure in an end-to-end GAT-based neural network, thereby providing a more thorough strategy for bias mitigation. We also provide the rigorous proof of the theorem presented in [31] in addition to our new findings.

3 Preliminaries

The focus of this study is designing a fair GAT-based neural network for node classification for a given graph \(\mathcal {G}:=(\mathcal {V}, \mathcal {E})\), where \(\mathcal {V}:=\) \(\left\lbrace v_{1}, v_{2}, \ldots , v_{N}\right\rbrace\) is the set of nodes and \(\mathcal {E} \subseteq \mathcal {V} \times \mathcal {V}\) denotes the set of edges. Nodal features and graph adjacency matrices of the input graph \(\mathcal {G}\) are denoted by \(\mathbf {X} \in \mathbb {R}^{N \times F}\) and \(\mathbf {A} \in \lbrace 0,1\rbrace ^{N \times N}\), respectively, where \(\mathbf {A}_{i j}=1\) if and only if \((v_{i}, v_{j}) \in \mathcal {E}\). This work considers a single, binary sensitive attribute for each node, which is denoted by \(\mathbf {s} \in \lbrace 0,1\rbrace ^{N}\). The learned node representations in the neural network at layer l are denoted by \(\mathbf {H}^{l+1}\). \(\mathbf {x}_{i} \in \mathbb {R}^{F}\), \(\mathbf {h}^{l+1}_{i} \in \mathbb {R}^{F^{l}}\), \(s_{i} \in \lbrace 0,1\rbrace\), and \(\mathcal {N}_i\) denote the feature vector, the representation at layer l, the sensitive attribute, and the set of neighbors of node \(v_{i}\), respectively. \(\mathcal {S}_{0}\) and \(\mathcal {S}_{1}\) denote the set of nodes whose sensitive attributes are 0 and 1, respectively. Define inter-edge set \(\mathcal {E}^{\chi }:=\lbrace e_{ij}|v_i \in \mathcal {S}_a, v_j \in \mathcal {S}_b, a\ne b\rbrace\), while intra-edge set is defined as \(\mathcal {E}^{\omega }:= \lbrace e_{ij}|v_i \in \mathcal {S}_a, v_j \in \mathcal {S}_b, a= b\rbrace\). Similarly, the set of nodes having at least one inter-edge is denoted by \(\mathcal {S}^{\chi }\), while \(\mathcal {S}^{\omega }\) defines the set of nodes that have no inter-edges (i.e., they only have intra-edges). The intersection of the sets \(\mathcal {S}_{0}\) and \(\mathcal {S}^{\chi }\) is denoted by \(\mathcal {S}_{0}^{\chi }\).

GNNs create node embeddings by aggregating the representations of neighbors for each node. Most GNN structures implicitly assume that all neighbors have the same importance to the anchor node. On the other hand, GATs learn weights \(\alpha _{ij}\), which indicates the importance of neighbor node j to the anchor node i. Via learning attention coefficients \(\alpha _{ij}\), GATs can select the most relevant neighbors to the anchor node, which results in a more flexible framework than equal weight assignment.

4 Methodology

The success of GATs for various tasks has increased their popularity for graph-based learning [54], whereas the effect of such attention design on algorithmic bias has not yet been considered. However, the attention coefficients are often correlated with the node representations, which may entail bias. Hence, the employment of the attention mechanism may inherit this bias and even amplify it, which motivates us to design a fairness-aware attention mechanism herein. To this end, we first group the neighboring nodes of each anchor node into two subsets based on their sensitive attributes. Our goal is then to find the optimal amount of attention to be assigned to each subset to mitigate bias. For a particular node \(v_{k}\), let \(\alpha _{k}^{\chi }\) denote the attention that is assigned to the neighbors from different sensitive groups than the anchor node, i.e., \(\alpha _{k}^{\chi } := \sum _{a \in \mathcal {N}(k) \cap S_i} \alpha _{k a} \text{ if } v_k \in \mathcal {S}_j \text{ and } i \ne j\). Ideally, it would be of interest to search for the optimal amount of \(\alpha _{k}^{\chi }\) for each node \(v_{k}\) separately. However, such a per-node approach would incur high complexity, which can be undesirable (potentially even infeasible) for large graphs. Motivated by this, we take a global approach instead, which seeks the optimal value of \(\alpha ^{\chi } := \alpha _{k}^{\chi }=\sum _{a \in \mathcal {N}(k) \cap S_i} \alpha _{k a}, \forall v_k \in \mathcal {S}_j \text{ if } i \ne j\). The numerical results in Section 5 show that such a global approach can indeed effectively mitigate bias while protecting the utility of the ensuing task.

Before moving forward to develop a bias mitigation strategy, this section first investigates the sources of bias in a GAT-based neural network trained for node classification. As the bias measure, the disparity between the predictions for different sensitive groups is utilized:

\begin{equation} \delta _{\hat{y}}:=\left\Vert \operatorname{mean}(\mathbf {\hat{y}}_{j} \mid s_{j}=0) - \operatorname{mean}(\mathbf {\hat{y}}_{j} \mid s_{j}=1) \right\Vert _{2}, \end{equation}

(1)

where \(\mathbf {\hat{y}}_{j}\) denotes the predicted soft label for node \(v_{j}\) and \(\operatorname{mean}(\cdot)\) is the sample mean operator. Note that \(\delta _{\hat{y}}\) generalizes the commonly utilized group fairness metric statistical parity [14], \(\Delta _{S P}:=|P(\hat{c}_{j}=1 \mid s_{j}=0)-P(\hat{c}_{j}=1 \mid s_{j}=1)|,\) where \(\hat{c}\) stands for the prediction of hard/class labels when \(\mathbf {\hat{y}}_{j}\) is output by the sigmoid activation and denotes the probability that \(v_{j}\) has the class label of 1. Specifically, focusing on the term \(P(\hat{c}_{j}=1 \mid s_{j}=0)\), it follows that

\begin{equation} \begin{split} P(\hat{c}_{j}=1 \mid s_{j}=0) &= \int _{0}^{1} P(\hat{c}_{j}=1 \mid \hat{y}_{j}, s_{j}=0) P(\hat{y}_{j} \mid s_{j}=0) d\hat{y}_{j} \\ & = \int _{0}^{1} P(\hat{c}_{j}=1 \mid \hat{y}_{j}) P(\hat{y}_{j} \mid s_{j}=0) d\hat{y}_{j}, \end{split} \end{equation}

(2)

where the last equality follows from the random variables \(s_{j} \rightarrow \hat{y}_{j} \rightarrow \hat{c}_{j}\) forming a Markov chain. Furthermore, \(P(\hat{c}_{j}=1 \mid \hat{y}_{j}) = \hat{y}_{j}\), as \(\hat{y}_{j}\) is assumed to be a soft label denoting the probability that \(v_{j}\) has the class label of 1, leading to

\begin{equation} P(\hat{c}_{j}=1 \mid s_{j}=0) = \int _{0}^{1} \hat{y}_{j} P(\hat{y}_{j} \mid s_{j}=0) d\hat{y}_{j}=\operatorname{mean}(\hat{y}_{j} \mid s_{j}=0). \end{equation}

(3)

The same can also be derived for \(P(\hat{c}_{j}=1 \mid s_{j}=1)\), where \(P(\hat{c}_{j}=1 \mid s_{j}=1) = \operatorname{mean}(\hat{y}_{j} \mid s_{j}=1)\), proving that \(\Delta _{S P}\) is a special case of \(\delta _{\hat{y}}\) when \(\mathbf {\hat{y}}_{j}\) is output by sigmoid activation and denotes the probability that \(v_{j}\) has the class label of 1. After the sources of bias are demonstrated, our proposed algorithm, FairGAT, will be developed and presented in the remainder of this section.

4.1 Bias Analysis

This subsection aims to illuminate the factors that lead to the disparity between the predictions for different sensitive groups, \(\delta _{\hat{y}}\), in a GAT-based network trained for node classification. Let \(\mathbf {Z}^{l+1}\) denote the aggregated representations by the lth GAT layer with ith row \(\mathbf {z}^{l+1}_{i}:= \sum _{j \in \mathcal {N}_{i}} \alpha ^{l}_{i j} \mathbf {c}^{l+1}_{j}\), where \(\mathbf {c}^{l+1}_{i}:=\mathbf {W}^{l} \mathbf {h}^{l}_{i}\). The sample means of \(\mathbf {c}^{l+1}\) and \(\mathbf {z}^{l+1}\) vectors are represented by \(\bar{\mathbf {c}}_{s}^{l+1} :={\rm mean}(\mathbf {c}^{l+1}_j \mid v_{j} \in \mathcal {S}_{s})\) and \(\bar{\mathbf {z}}_{s}^{l+1} :={\rm mean}(\mathbf {z}^{l+1}_j \mid v_{j} \in \mathcal {S}_{s})\) for the nodes in sensitive group \(\mathcal {S}_{s}\) for \(s=0,1\). The following assumption is made for the theoretical findings in this work:

A1 (Finite-valued representations): \(\Vert \mathbf {c}^{l+1}_j-\bar{\mathbf {c}}_{s}^{l+1}\Vert _{\infty } \le (\Delta _{c}^{(s)})^{l+1}\), \(\forall v_{j} \in \mathcal {S}_{s}\) with \(s \in \lbrace 0,1\rbrace\), where \(\Delta ^{l+1}_{c} = \operatorname{max}((\Delta _{c}^{(0)})^{l+1}, (\Delta _{c}^{(1)})^{l+1})\). \(\Vert \mathbf {z}^{l+1}_j-\bar{\mathbf {z}}_{s}^{l+1}\Vert _{\infty } \le (\Delta _{z}^{(s)})^{l+1}\), \(\forall v_{j} \in \mathcal {S}_{s}\) with \(s \in \lbrace 0,1\rbrace\), where \(\Delta ^{l+1}_{z} = \operatorname{max}((\Delta _{z}^{(0)})^{l+1}, (\Delta _{z}^{(1)})^{l+1})\). Here, \(\operatorname{max}(\cdot ,\cdot)\) outputs the element-wise maximum of the input vectors.

Based on this assumption, Theorem 4.1 demonstrates the factors that contribute to the disparity between the representations of different sensitive groups obtained at the lth GAT layer. Specifically, Theorem 4.1 upper bounds the term \(\delta _{h}^{l+1}:=\Vert \operatorname{mean}(\mathbf {h}^{l+1}_{j} \mid s_{j}=0) - \operatorname{mean}(\mathbf {h}^{l+1}_{j} \mid s_{j}=1)\Vert _{2}\). The proof of this theorem is presented in Appendix B.

Theorem 4.1.

The disparity between the representations of different sensitive groups that are output by the lth GAT layer, \(\delta _{h}^{l+1}\), can be upper bounded by

\begin{equation} \delta _{h}^{l+1} \le L \Big (\sigma _{max}(\mathbf {W}^{l}) \big | (R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)\big |\delta _{h}^{l} + 2 \sqrt {N} \Delta ^{l+1}_{c} + 2 \sqrt {N} \Delta ^{l+1}_{z} \Big), \end{equation}

(4)

where L is the Lipschitz constant of the utilized nonlinear activation, \(\sigma _{max}(\cdot)\) denotes the largest singular value of the input matrix, and \(R_{1}^{\chi }:=\frac{|S^{\chi }_{1}|}{|\mathcal {S}_{1}|}, R_{0}^{\chi }:=\frac{|S^{\chi }_{0}|}{|\mathcal {S}_{0}|}\).

Theorem 4.1 explains the factors that can amplify the disparity of representations throughout a GAT layer. We note that in the majority of our baselines herein, we observed that a single, final, fully connected layer is employed for supervised node classification, after multiple GNN layers. Thus, we expand our analysis in Theorem 4.1 to systems that also include a fully connected layer, through Lemma 4.2. Specifically, Lemma 4.2 investigates the sources of bias by upper bounding \(\delta _{h}^{l+1}\) for a fully connected layer, whose proof is provided in Appendix C.

Lemma 4.2.

The disparity between the representations from different sensitive groups that are output by the lth fully connected layer with input–output relationship \(\mathbf {H}^{l+1} = \sigma (\mathbf {H}^{l} \mathbf {W}^{l})\) (\(\sigma\) denoting the nonlinear activation), \(\delta _{h}^{l+1}\), can be upper bounded by

\begin{equation} \delta _{h}^{l+1} \le L (\sigma _{max}(\mathbf {W}^{l}) \delta _{h}^{l} + 2 \sqrt {N} \Delta ^{l+1}_{z}), \end{equation}

(5)

where \(\mathbf {W}^{l}\) is the learnable weight matrix at the lth fully connected layer and \(\mathbf {Z}^{l+1} = \mathbf {H}^{l} \mathbf {W}^{l}\).

Remark. The results of Theorem 4.1 together with Lemma 4.2 can help characterize the sources of bias in a neural network with multiple GAT layers, possibly followed by a fully connected layer. To exemplify a special case, consider a neural network consisting of a single GAT layer with input–output relation \(\boldsymbol {h}_i^{1}=\sigma (\sum _{j \in \mathcal {N}_i} \alpha ^{0}_{i j} \cdot \boldsymbol {W}^{0} \boldsymbol {x}^{0}_j)\), followed by a fully connected layer with input–output relation \(\mathbf {\hat{y}}_{i}= \mathbf {h}^{2}_{i} =\sigma (\boldsymbol {W}^{1} \mathbf {h}^{1}_i)\). Lemma 4.2 implies that the disparity between the outputs of different sensitive groups, \(\delta _{\hat{y}}:=\left\Vert \operatorname{mean}(\mathbf {\hat{y}}_{j} \mid s_{j}=0) - \operatorname{mean}(\mathbf {\hat{y}}_{j} \mid s_{j}=1) \right\Vert _{2}\), can be upper bounded by

\begin{equation} \delta _{\hat{y}} \le L (\sigma _{max}(\mathbf {W}^{1}) \delta _{h}^{1} + 2 \sqrt {N} \Delta ^{2}_{z}). \end{equation}

(6)

Theorem 4.1 further shows that

\begin{equation} \delta _{h}^{1} \le L (\sigma _{max}(\mathbf {W}^{0}) |(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)| \delta _{x} + 2 \sqrt {N} \Delta ^{1}_{c} + 2 \sqrt {N} \Delta ^{1}_{z}), \end{equation}

(7)

where \(\delta _{x}:=\Vert \operatorname{mean}(\mathbf {x}_{j} \mid s_{j}=0) - \operatorname{mean}(\mathbf {x}_{j} \mid s_{j}=1) \Vert _{2}\). The bias measure \(\delta _{\hat{y}}\) for the overall network can be upper bounded based on Equations (6) and (7). Our framework, FairGAT, motivates by lowering this upper bound so that the overall scheme leads to a smaller \(\delta _{\hat{y}}\) value. Note that while the results in Equations (6) and (7) are presented for an example network with a single GAT layer for demonstrative purposes, they can easily be extended to multiple layers of GATs by utilizing Theorem 4.1 and Lemma 4.2.

Overall, our above analysis illuminates the factors that play a role in the propagated bias towards the predictions of a GAT-based neural network, which also hints at the possible solutions to combat such bias. Since \(\delta _{h}^{l}\) characterizes the disparity between the output representations at layer l, the relevant terms in Equations (4) and (5) should be properly controlled to avoid the amplification of bias. Specifically, the following steps can be applied in order to design a fair GAT-based network.

(1)

Fair attention learning: Theorem 4.1 implies that the total amount of attention assigned to inter-edges, \(\alpha ^{\chi }\), can be manipulated to reduce the resulting intrinsic bias, since \(\delta _{\hat{y}}\) is a function of \(|(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)|\).

(2)

Spectral normalization of weight matrices: The analysis shows that the propagated bias towards the predictions is affected by the spectral properties of the weight matrices \(\mathbf {W}^{l}\) at every layer l. Specifically, it can be observed from the upper bounds of \(\delta _{h}^{l}\) in Equations (4) and (5) that the largest singular value \(\sigma _{max}(\mathbf {W}^{l})\) (i.e., spectral norm) of the weight matrix should not be larger than 1 in order not to amplify the already existing disparity. Therefore, spectral normalization is applied to \(\mathbf {W}^{l}\) at every layer l to guarantee that \(\sigma _{max}(\mathbf {W}^{l}) \le 1\).

(3)

Scaling representations: Finally, Theorem 4.1 and Lemma 4.2 suggest that the maximal deviation \(\Delta _{c}\) at every attention layer, and \(\Delta _{z}\) at every attention or fully connected layer, influence the bias measure \(\delta _{\hat{y}}\). Such deviations can be manipulated for bias mitigation.

Remark. Theorem 4.1 also suggests that the disparity between the nodal features of different sensitive groups, \(\delta _{x}\), should be decreased for a lower upper bound on \(\delta _{\hat{y}}\). Since \(\delta _{x}\) is not influenced by the changes in other factors, the present framework will not focus on this term. However, the proposed framework herein can be employed in conjunction with a fairness-aware nodal feature manipulation strategy, such as the ones developed in [13, 32], to reduce \(\delta _{x}\).

4.2 Proposed Scheme: FairGAT

Building upon the theoretical analysis in Section 4.1, this subsection develops a novel framework to reduce \(\delta _{\hat{y}}\) for a GAT-based network trained for node classification. The overall scheme includes three steps as mentioned before, the design of which will be discussed in detail. The overall algorithm is presented in Algorithm 1.

4.2.1 Fair Attention Learning.

Theorem 4.1 demonstrates that the total amount of attention assigned to inter-edges, \(\alpha ^{\chi }\), can be manipulated to lower the term \(|(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)|\) for bias mitigation. Motivated by this, a fairness-aware graph-attention layer is designed in this step, where the learned attention minimizes the bias-related term \(|(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)|\). To this end, the optimal amount of attention that should be assigned to the inter-edges is analyzed herein. Specifically, we consider the following optimization problem:

\begin{equation} \begin{array}{rrclcl} \displaystyle (\alpha ^{\chi })^{*} = \min _{\alpha ^{\chi }} & {|R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1|}\\ \textrm {s.t.} & 0 \le \alpha ^{\chi } \le \alpha ^{\chi }_{max}. \\ \end{array} \end{equation}

(8)

Similar to its \(\alpha ^{\chi }\) counterpart, let \(\alpha ^{\omega }\) denote the total amount of attention assigned to the neighbors from the same sensitive group, that is, \(\alpha ^{\omega } := \alpha _{k}^{\omega }=\sum _{a \in \mathcal {N}(k) \cap S_i} \alpha _{k a}, \forall v_k \in \mathcal {S}_j \text{ if } i = j\). Then, by their definitions, the optimal amount of attention that should be assigned to the intra-edges equals \((\alpha ^{\omega })^{*} = 1 - (\alpha ^{\chi })^{*}\). In Equation (8), \(\alpha ^{\chi }_{max} \le 1\) specifies the maximum amount of attention that can be assigned to the inter-edges, thus is a hyperparameter used to provide a trade-off between the utility and fairness. Note that the extreme case of having \((\alpha ^{\chi })^{*}=1\) would mean that \((\alpha ^{\omega })^{*}=0\), implying that the information coming from the neighbors with the same sensitive attribute as the anchor node is not used at all, which is expected to degrade the overall utility.

It always holds that \(|R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1| \ge 0\), where equality is achieved when \(\alpha ^{\chi }= \frac{1}{R_{1}^{\chi } + R_{0}^{\chi }}\). Therefore, if \(0 \le \frac{1}{R_{1}^{\chi } + R_{0}^{\chi }} \le \alpha _{max}^{\chi }\), the optimal solution becomes \((\alpha ^{\chi })^{*}= \frac{1}{R_{1}^{\chi } + R_{0}^{\chi }}\). For the case in which \(\frac{1}{R_{1}^{\chi } + R_{0}^{\chi }} \ge \alpha _{max}^{\chi }\), the optimal solution is obtained on the boundary where \((\alpha ^{\chi })^{*}= \alpha _{max}^{\chi }\). Overall, the optimal solution of the problem in Equation (8) can then be obtained in closed form as

\begin{equation} (\alpha ^{\chi })^{*}= {\left\lbrace \begin{array}{ll} \alpha ^{\chi }_{max}, & \text{if}\ R_{1}^{\chi } + R_{0}^{\chi } \lt \frac{1}{\alpha ^{\chi }_{max}}, \\ \frac{1}{R_{1}^{\chi } + R_{0}^{\chi }} , & \text{else}. \end{array}\right.} \end{equation}

(9)

We design our fair attention layer such that the utilized attention coefficients satisfy the optimal amount of attention that should be assigned to the inter-edges, \((\alpha ^{\chi })^{*}\), presented in Equation (9). Overall, the fair attention design and information aggregation process in FairGAT can be summarized as

(1)

\(e\left(\boldsymbol {h}_i^{l}, \boldsymbol {h}^{l}_j\right)=\operatorname{LReLU}\left((\boldsymbol {a}^{l})^{\top } \cdot \left[\boldsymbol {W}^{l} \boldsymbol {h}^{l}_i \Vert \boldsymbol {W}^{l} \boldsymbol {h}^{l}_j\right]\right)\),

(2)

\begin{align} \hspace{-17.07164pt} \alpha ^{l}_{i j}=\!{\left\lbrace \begin{array}{ll} (\alpha ^{\chi })^{*} \frac{\exp \left(e\left(\boldsymbol {h}^{l}_i, \boldsymbol {h}^{l}_j\right)\right)}{\sum _{j^{\prime } \in \mathcal {N}_i \cap \mathcal {S}_{q}} \exp \left(e\left(\boldsymbol {h}^{l}_i, \boldsymbol {h}^{l}_{j^{\prime }}\right)\right)},&\hspace{-2.84526pt} \text{ if } v_i \in \mathcal {S}_{p}, v_j \in \mathcal {S}_{q} \text{ and } p \ne q \\ (\alpha ^{\omega })^{*} \frac{\exp \left(e\left(\boldsymbol {h}^{l}_i, \boldsymbol {h}^{l}_j\right)\right)}{\sum _{j^{\prime } \in \mathcal {N}_i \cap \mathcal {S}_{q}} \exp \left(e\left(\boldsymbol {h}^{l}_i, \boldsymbol {h}^{l}_{j^{\prime }}\right)\right)}, &\hspace{-2.84526pt} \text{ if } v_i \in \mathcal {S}_{p}, v_j \in \mathcal {S}_{q} \text{ and } p = q \end{array}\right.} \end{align}

(10)

(3)

\(\boldsymbol {h}_i^{l+1}=\sigma \left(\sum _{j \in \mathcal {N}_i} \alpha ^{l}_{i j} \cdot \boldsymbol {W}^{l} \boldsymbol {h}^{l}_j\right)\).

Here, \(\mathbf {a}^{l}\), \(\mathbf {W}^{l}\) are learnable parameters at layer l, while LReLU stands for LeakyReLU [36].

Note that step 2 in the proposed fair attention design ensures the optimal values of \((\alpha ^{\chi })^{*}\) and \((\alpha ^{\omega })^{*}\) in Equation (9). Furthermore, the proposed scheme differs from conventional attention learning in the employment of individual softmax activation functions over different sensitive groups, which does not add significant additional computational complexity over the conventional scheme (see Table 7 in Appendix F for empirical runtime results). Therefore, fair attention learning proposed herein provides an efficient solution for bias mitigation while also enjoying the flexible non-uniform weights assigned to different neighbors, similar to conventional GATs.

Table 1.

	Acc (\(\%\))	\(\Delta _{S P}\) (\(\%\))	\(\Delta _{E O}\) (\(\%\))	Acc (\(\%\))	\(\Delta _{S P}\) (\(\%\))	\(\Delta _{E O}\) (\(\%\))	Acc (\(\%\))	\(\Delta _{S P}\) (\(\%\))	\(\Delta _{E O}\) (\(\%\))
	Pokec-z			Pokec-n			Recidivism
GAT	\(66.26 \pm 0.9\)	\(3.63 \pm 2.6\)	\(4.30 \pm 2.4\)	\(67.50 \pm 0.4\)	\(2.26 \pm 2.8\)	\(3.56 \pm 1.7\)	\(\mathbf {95.63} \pm 0.2\)	\(8.08 \pm 1.7\)	\(2.09 \pm 1.0\)
FairGNN	\(\mathbf {67.71} \pm 0.7\)	\(\mathbf {2.27} \pm 0.9\)	\(2.31 \pm 1.0\)	\(65.81 \pm 0.8\)	\(2.21 \pm 1.4\)	\(2.97 \pm 1.3\)	\(95.18 \pm 0.2\)	\(7.31 \pm 1.9\)	\(1.27 \pm 1.0\)
EDITS	\(63.89 \pm 0.7\)	\(3.27 \pm 2.0\)	\(2.93 \pm 2.2\)	\(63.47 \pm 0.9\)	\(2.01 \pm 1.5\)	\(2.48 \pm 2.3\)	\(88.52 \pm 0.6\)	\(\mathbf {6.59} \pm 2.1\)	\(1.73 \pm 1.1\)
NIFTY	\(66.59 \pm 0.8\)	\(4.21 \pm 1.4\)	\(4.19 \pm 2.7\)	\(\mathbf {68.41} \pm 1.5\)	\(1.41 \pm 0.7\)	\(2.30 \pm 1.8\)	\(88.52 \pm 2.3\)	\(6.74 \pm 2.4\)	\(1.48 \pm 1.9\)
FairGAT	\(66.29 \pm 0.6\)	\(2.55 \pm 0.5\)	\(\mathbf {1.63} \pm 0.9\)	\(67.81 \pm 1.1\)	\(\mathbf {0.71} \pm 0.7\)	\(\mathbf {1.23} \pm 0.6\)	\(94.93 \pm 0.1\)	\(7.39 \pm 1.8\)	\(\mathbf {1.02} \pm 0.05\)

Table 1. Comparative Results

Table 2.

	Acc (\(\%\))	\(\Delta _{S P}\) (\(\%\))	\(\Delta _{E O}\) (\(\%\))	Acc (\(\%\))	\(\Delta _{S P}\) (\(\%\))	\(\Delta _{E O}\) (\(\%\))	Acc (\(\%\))	\(\Delta _{S P}\) (\(\%\))	\(\Delta _{E O}\) (\(\%\))
	Pokec-z			Pokec-n			Recidivism
GAT	\(66.26 \pm 0.9\)	\(3.63 \pm 2.6\)	\(4.30 \pm 2.4\)	\(67.50 \pm 0.4\)	\(2.26 \pm 2.8\)	\(3.56 \pm 1.7\)	\(\mathbf {95.63} \pm 0.2\)	\(8.08 \pm 1.7\)	\(2.09 \pm 1.0\)
1& 2	\(66.84 \pm 0.3\)	\(3.93 \pm 2.8\)	\(4.85 \pm 2.7\)	\(64.60 \pm 0.7\)	\(4.37 \pm 1.7\)	\(4.29 \pm 3.3\)	\(89.95 \pm 0.4\)	\(\mathbf {7.06} \pm 2.2\)	\(1.81 \pm 2.1\)
1& 3	\(66.74 \pm 0.8\)	\(\mathbf {2.10} \pm 1.5\)	\(2.21 \pm 1.8\)	\(66.91 \pm 1.0\)	\(1.95 \pm 1.5\)	\(3.64 \pm 2.6\)	\(95.23 \pm 0.2\)	\(8.12 \pm 2.1\)	\(1.44 \pm 0.5\)
2& 3	\(\mathbf {67.78} \pm 0.6\)	\(5.59 \pm 1.1\)	\(5.69 \pm 1.5\)	\(\mathbf {67.59} \pm 0.8\)	\(1.57 \pm 1.1\)	\(2.27 \pm 1.0\)	\(95.38 \pm 0.2\)	\(7.94 \pm 1.7\)	\(1.71 \pm 1.0\)
FairGAT	\(66.29 \pm 0.6\)	\(2.55 \pm 0.5\)	\(\mathbf {1.63} \pm 0.9\)	\(67.14 \pm 1.1\)	\(\mathbf {0.90} \pm 0.6\)	\(\mathbf {1.68} \pm 1.3\)	\(94.93 \pm 0.1\)	\(7.39 \pm 1.8\)	\(\mathbf {1.02} \pm 0.1\)

Table 2. Ablation Study

Table 3.

Cora	Accuracy (\(\%\))	\(\Delta DP_{m}\) (\(\%\))	\(\Delta EO_{m}\) (\(\%\))	\(\Delta DP_{g}\) (\(\%\))	\(\Delta EO_{g}\) (\(\%\))	\(\Delta DP_{s}\) (\(\%\))	\(\Delta EO_{s}\) (\(\%\))
GAT	\(\mathbf {75.82} \pm 0.99\)	\(43.16 \pm 0.52\)	\(25.39 \pm 1.71\)	\(13.16 \pm 2.41\)	\(19.80 \pm 2.29\)	\(82.13 \pm 3.83\)	\(98.33 \pm 3.73\)
FairDrop	\(74.94 \pm 1.21\)	\(\mathbf {40.84} \pm 2.73\)	\(25.48 \pm 2.04\)	\(17.99 \pm 2.61\)	\(25.56 \pm 3.69\)	\(85.06 \pm 3.56\)	\(100.00 \pm 0.00\)
FairGAT	\(74.73 \pm 1.34\)	\(\mathbf {40.37} \pm 2.80\)	\(\mathbf {17.83} \pm 2.54\)	\(\mathbf {12.97} \pm 4.82\)	\(\mathbf {17.55} \pm 3.62\)	\(\mathbf {73.19} \pm 7.64\)	\(\mathbf {96.26} \pm 7.33\)
Citeseer	Accuracy (\(\%\))	\(\Delta DP_{m}\) (\(\%\))	\(\Delta EO_{m}\) (\(\%\))	\(\Delta DP_{g}\) (\(\%\))	\(\Delta EO_{g}\) (\(\%\))	\(\Delta DP_{s}\) (\(\%\))	\(\Delta EO_{s}\) (\(\%\))
GAT	\(68.49 \pm 1.15\)	\(28.01 \pm 2.60\)	\(16.39 \pm 2.70\)	\(24.64 \pm 8.60\)	\(35.65 \pm 13.65\)	\(62.62 \pm 8.00\)	\(73.81 \pm 13.05\)
FairDrop	\(65.98 \pm 1.57\)	\(\mathbf {23.80} \pm 2.87\)	\(12.50 \pm 4.26\)	\(32.05 \pm 5.74\)	\(47.32 \pm 7.72\)	\(65.37 \pm 3.87\)	\(73.81 \pm 5.82\)
FairGAT	\(\mathbf {69.06} \pm 0.91\)	\(25.50 \pm 2.24\)	\(\mathbf {8.67} \pm 2.28\)	\(\mathbf {24.28} \pm 5.56\)	\(\mathbf {31.26} \pm 5.03\)	\(\mathbf {55.14} \pm 6.48\)	\(\mathbf {60.58} \pm 6.60\)
PubMed	Accuracy (\(\%\))	\(\Delta DP_{m}\) (\(\%\))	\(\Delta EO_{m}\) (\(\%\))	\(\Delta DP_{g}\) (\(\%\))	\(\Delta EO_{g}\) (\(\%\))	\(\Delta DP_{s}\) (\(\%\))	\(\Delta EO_{s}\) (\(\%\))
GAT	\(\mathbf {75.82} \pm 0.50\)	\(34.61 \pm 0.67\)	\(17.39 \pm 0.98\)	\(6.84 \pm 0.89\)	\(6.75 \pm 1.34\)	\(41.71 \pm 0.82\)	\(31.17 \pm 1.34\)
FairDrop	\(75.09 \pm 0.45\)	\(34.04 \pm 0.97\)	\(17.99 \pm 0.95\)	\(\mathbf {5.96} \pm 1.08\)	\(\mathbf {6.38} \pm 1.84\)	\(41.89 \pm 1.42\)	\(36.08 \pm 2.29\)
FairGAT	\(75.39 \pm 0.87\)	\(\mathbf {28.68} \pm 1.31\)	\(\mathbf {7.92} \pm 0.80\)	\(7.10 \pm 0.66\)	\(9.38 \pm 0.75\)	\(\mathbf {35.20} \pm 2.13\)	\({\bf 17.87} \pm 2.24\)

Table 3. Comparative Results of FairGAT for Link Prediction

4.2.2 Spectral Normalization.

As shown in Equations (6) and (7), the disparity between the input representations, \(\delta _{h}^{l}\), is multiplied by \(\sigma _{max}(\mathbf {W}^{l})\) at every layer l. FairGAT ensures that this factor does not amplify the already existing bias by applying spectral normalization to \(\mathbf {W}^{l}\), i.e., \(\sigma _{max}(\operatorname{SN}(\mathbf {W}^{l}))=1\), where \(\operatorname{SN}(\cdot)\) denotes a spectral normalization operator that applies to the input matrix.

The largest singular value, also known as the spectral norm, of a matrix \(\mathbf {W} \in \mathbb {R}^{F_1 \times F_2}\) equals \(\sigma _{max}(\mathbf {W})=\max _{\boldsymbol {\xi } \in \mathbb {R}^{F_1}, \boldsymbol {\xi } \ne \mathbf {0}} \frac{\Vert \mathbf {W} \boldsymbol {\xi }\Vert _2}{\Vert \boldsymbol {\xi }\Vert _2}\). Consider the input–output relation \(\mathbf {\hat{y}}=\sigma \left(\boldsymbol {W} \mathbf {h}\right)\). Here, if a perturbation \(\boldsymbol {\xi }\) is applied to the input, i.e., \(\mathbf {\tilde{y}}=\sigma \left(\boldsymbol {W} (\mathbf {h} + \boldsymbol {\xi })\right)\), we have that

\begin{equation} \begin{split} \frac{\Vert \mathbf {\tilde{y}} - \mathbf {\hat{y}}\Vert _{2}}{\Vert \boldsymbol {\xi }\Vert _{2}} &= \frac{\Vert \sigma \left(\boldsymbol {W} \mathbf {h}\right) - \sigma \left(\boldsymbol {W} (\mathbf {h} + \boldsymbol {\xi })\right)\Vert _{2}}{\Vert \boldsymbol {\xi }\Vert _{2}}\\ &\le \frac{L\Vert \left(\boldsymbol {W} \mathbf {h}\right) - \left(\boldsymbol {W} (\mathbf {h} + \boldsymbol {\xi })\right)\Vert _{2}}{\Vert \boldsymbol {\xi }\Vert _{2}}=\frac{L\Vert \boldsymbol {W} \boldsymbol {\xi })\Vert _{2}}{\Vert \boldsymbol {\xi }\Vert _{2}}\le L \sigma _{max}(\mathbf {W}). \end{split} \end{equation}

(11)

Therefore, although limiting the spectral norm of weight matrices is primarily utilized to prevent bias amplification in our work, it can also help to improve the robustness and generalizability of neural networks [52]. In particular, spectral normalization can help FairGAT be more robust when the training and test data distributions do not match.

Remark. It is important to note that, even though FairGAT applies spectral normalization to ensure \(\sigma _{max}(\mathbf {W}^{l})=1\) at every layer l, \(\sigma _{max}(\mathbf {W}^{l})\) can be further reduced for a lower upper bound on \(\delta _{\hat{y}}\). For this purpose, a hyperparameter \(\kappa \le 1\) can be utilized to scale normalized weight matrices, as scaling a matrix would lead to the scaling of singular values (hence, also the largest one), i.e., \(\kappa \mathbf {W} = \sum _{i=1}^r \kappa \sigma _i \mathbf {u}_i \mathbf {v}_i^*\). Note that the upper bound of \(\delta _{\hat{y}}\) is minimized if \(\sigma _{max}(\mathbf {W})=0\) by setting \(\kappa =0\). However, such scaling would also prevent any learning, as the same predictions would be output by the model, providing perfect group fairness yet completely disregarding the utility. Thus, introducing \(\kappa\) allows a trade-off between fairness and utility, which can potentially improve performance. However, we did not introduce such a hyperparameter in order to alleviate the parameter-tuning process.

4.2.3 Scaling Representations.

Both Theorem 4.1 and Lemma 4.2 demonstrate that the maximal deviation of the aggregated representations, \(\Delta _{z}\), influences the disparity between the output representations of different sensitive groups. Furthermore, Theorem 4.1 suggests that the maximal deviation \(\Delta _{c}\) is another factor in the disparity of attention layers. Motivated by these findings, FairGAT scales \(\mathbf {Z}^{l+1}\) and \(\mathbf {C}^{l+1}\) by a factor \(\eta\) at every layer l, which also scales \(\Delta _{z}\) and \(\Delta _{c}\) by the same factor, i.e., \(\Vert \eta \mathbf {z}^{l+1}_j-\eta \bar{\mathbf {z}}_{s}^{l+1}\Vert _{\infty } = \eta \Vert \mathbf {z}^{l+1}_j- \bar{\mathbf {z}}_{s}^{l+1}\Vert _{\infty }\le (\eta \Delta _{z}^{(s)})^{l+1}\), \(\forall v_{j} \in \mathcal {S}_{s}\). Here, \(\eta\) is a hyperparameter utilized to provide a trade-off between the fairness and utility.

5 Experiments

5.1 Datasets and Experimental Setup

Datasets. The performance of the proposed FairGAT framework is evaluated on node classification over real-world social networks Pokec-z and Pokec-n [10] and the Recidivism graph [25]. Pokec-z and Pokec-n are sampled from an anonymized version of the Pokec network of 2012 (a social network from Slovakia), where nodes correspond to users who live in two major regions and the region information is utilized as the sensitive attribute [10]. The working field of the users is binarized and utilized as the labels to be predicted in node classification. For building the Recidivism graph, the information of defendants (corresponding to nodes) who got released on bail at the U.S. state courts during 1990–2009 [25] is used, where the edges are created based on the affinity of past criminal records and demographics. Ethnicity of the defendants is used as the sensitive attribute for this graph, and the node classification task classifies defendants into bail or no bail [1].

Although we focused on node classification when building FairGAT in Section 4, we also consider link prediction as an alternative task for evaluation. For link prediction, experimental results are obtained over real-world citation networks Cora, Citeseer, and PubMed. These citation networks consider articles as nodes and descriptions of articles as their nodal attributes. In these datasets, similar to the setups in [34, 45], the category of the articles is used as the sensitive attribute for link prediction. Statistical information for all datasets is presented in Appendix D.

Evaluation Metrics. Classification accuracy is used to measure the utility for node classification. As fairness metrics, two quantitative measures of group fairness are used: statistical parity [14], \(\Delta _{S P}:=|P(\hat{c}_{j}=1 \mid s_{j}=0)-P(\hat{c}_{j}=1 \mid s_{j}=1)|\), and equal opportunity [22], \(\Delta _{E O}:=|P(\hat{c}_{j}=1 \mid y_{j}=1, s_{j}=0)-P(\hat{c}_{j}=1 \mid y_{j}=1, s_{j}=1)|\), where y is the ground truth label, and \(\hat{c}\) stands for the predicted binary class label. For both metrics, lower values are desired.

For link prediction experiments, accuracy is again employed as the utility metric. For fairness evaluation, \(\Delta DP_{m}\), \(\Delta EO_{m}\), \(\Delta DP_{g}\), \(\Delta EO_{g}\), \(\Delta DP_{s}\), and \(\Delta EO_{s}\) that are introduced in [45] are utilized. These metrics measure the demographic parity difference and equalized odd difference among multiple sensitive groups, where \(\Delta DP=\max _s E[\hat{Y}_{j} \mid e_{j} \in \mathcal {S}_{s}]-\min _s E[\hat{Y}_{j} \mid e_{j} \in \mathcal {S}_{s}]\) and \(\Delta EO=\max (\Delta \mathrm{TPR}, \Delta \mathrm{FPR})\) for \(\Delta \mathrm{TPR}:= \max _s E[\hat{Y}_{j}=1 \mid e_{j} \in \mathcal {S}_{s}, Y_{j}=1] -\min _s E[\hat{Y}_{j}=1 \mid e_{j} \in \mathcal {S}_{s}, Y_{j}=1]\) and \(\Delta \mathrm{FPR}:= \max _s E[\hat{Y}_{j}=1 \mid e_{j} \in \mathcal {S}_{s}, Y_{j}=0] -\min _s E[\hat{Y}_{j}=1 \mid e_{j} \in \mathcal {S}_{s}, Y_{j}=0]\). Here, \(\hat{Y}_{j}\) is the prediction for the existence of the edge \(e_{j}\) and \(Y_{j}\) is the ground-truth label for whether \(e_{j}\) exists in the input graph (\(Y_{j}=1\)) or not (\(Y_{j}=0\)). In the fairness measures, different subindices correspond to different sensitive group definitions. Specifically, subscripts m, g, and s represent the sensitive groups: mixed dyadic, group dyadic, and subgroup dyadic defined in [37, 45], respectively.

Implementation details. In the experiments, a network that consists of two attention layers (conventional GAT layers for the baselines) and one fully connected layer is trained in a supervised manner for node classification, where each attention layer is followed by a ReLU activation. This structure is kept the same for all baselines as well as FairGAT for a fair performance comparison. The model is trained over \(40\%\) of the nodes, while the remaining nodes are equally divided into validation and test sets. The test-set performance of the model that performs the best on the validation set is reported. For all experiments, results are collected for five random data splits.

For link prediction, the experimental setting in [45] is kept the same, where a two-layer attention network (consisting of conventional GAT layers for the baseline) is trained for supervised link prediction. The hyperparameters of FairGAT and all other baselines are tuned via a grid search on cross-validation sets; see Appendix E for the utilized hyperparameter values in all experiments.

Baselines. Herein, we present the performances of three fairness-aware baseline studies for node classification: FairGNN [10], EDITS [13], and NIFTY [1]. For improving fairness in a supervised setting, FairGNN [10] employs adversarial debiasing and a covariance-based regularizer together. EDITS [13] creates debiased versions of the nodal attributes and the graph structure, which are then input to the GAT network for node classification for the results in this work. Finally, NIFTY [1] employs a layer-wise weight normalization scheme along with a fair graph augmentation. For link prediction, FairDrop [45] is utilized as the fairness-aware baseline, which applies an edge manipulation on the input graph to mitigate structural bias.

5.2 Results for Node Classification

The results of node classification are presented in Table 1 in terms of fairness and utility metrics for both FairGAT and baselines. In Table 1, “GAT” represents the natural baseline where the conventional GAT layers [47] are employed for node classification without any spectral normalization (step 2 in FairGAT algorithm) or representation scaling (step 3 in FairGAT algorithm).

The results in Table 1 demonstrate that FairGAT significantly improves the naïve baseline, GAT, in terms of fairness metrics while yielding similar utility. Specifically, FairGAT achieves \(30\%\) to \(60\%\) improvement in all fairness measures for every dataset compared with GAT, except for \(\Delta _{SP}\) for Recidivism. Furthermore, FairGAT consistently outperforms every fairness-aware baseline in terms of \(\Delta _{EO}\) on all datasets. While FairGAT also leads to the best \(\Delta _{SP}\) value on Pokec-n, FairGNN [10] results in a better performance in terms of \(\Delta _{SP}\) on Pokec-z. However, the fairness improvements provided by FairGNN are observed to vary over different datasets (e.g., it is the worst-performing fairness-aware baseline on Pokec-n), which can be explained by the instability issues related to adversarial training [29]. On the Recidivism graph, all other fairness-aware baselines result in better or similar \(\Delta _{SP}\) values compared with FairGAT. However, for NIFTY [1] and EDITS [13], the superior \(\Delta _{SP}\) performance on the Recidivism dataset is accompanied by a considerable decrease in classification accuracy. Furthermore, it can be observed that FairGAT leads to the lowest standard deviation values for fairness measures for all datasets. Hence, it provides a better robustness in terms of fairness. Overall, the results demonstrate that FairGAT generally improves the fairness measures and consistently achieves better stability in terms of fairness compared with other state-of-the-art fairness-aware baselines while providing similar utility to the conventional GAT network.

An ablation study is also provided in Table 2 in order to demonstrate the influences of different steps in Algorithm 1. In Table 2, “1” stands for the employment of fair attention layers, as described in Equation (10). Moreover, “2” and “3” represent spectral normalization of weight matrices and hidden representation scaling that are detailed in Section 4.2, respectively. Overall, the ablation study signifies that FairGAT typically achieves the best fairness measures, together with similar or better utility, compared with a framework that lacks one of the steps in Algorithm 1. Therefore, the ablation study suggests that all steps in FairGAT are essential for the success and robustness of the algorithm.

In order to demonstrate the time complexity incurred by the proposed framework, Table 7 in Appendix F presents the average runtime of each epoch for FairGAT and the baselines. Overall, the results confirm our claim that the proposed fair attention design does not add significant computational complexity over the conventional GAT layers. Furthermore, the results also demonstrate that FairGAT can provide a more computationally efficient solution to combat bias compared with other fairness-aware baselines.

5.3 Results for Link Prediction

Although our analysis is developed for node classification and for a binary-sensitive attribute, we also obtain experimental results for link prediction. For this task, the utilized datasets Cora, Citeseer, and PubMed contain non-binary sensitive attributes for which we still employ the fair attention described in Equation (10) by directly tuning \((\alpha ^{\chi })^{*}\) as a hyperparameter. The results in Table 3 demonstrate that FairGAT typically outperforms FairDrop [45] in terms of fairness measures while providing similar or better utility. Overall, the experimental results signify that FairGAT also shows promising results for the link prediction task and non-binary sensitive attributes.

6 Conclusion and Future Work

This study presents a fairness-aware graph-based learning framework, FairGAT, which leverages a novel attention learning strategy that can mitigate bias. The design of the proposed scheme is based on a theoretical analysis that illuminates the sources of bias in a GAT-based neural network trained for node classification. The fair attention design in FairGAT incurs negligible additional computational complexity compared with the conventional GAT layer, and it can be flexibly employed with other fairness enhancement strategies. Experiments on real-world networks for node classification demonstrate that FairGAT typically provides better fairness measures together with similar utility compared with the state-of-the-art fairness-aware baselines. Furthermore, our link prediction results show the promising fairness performance of FairGAT for link prediction and non-binary sensitive attributes as well.

To address the limitations of FairGAT in terms of its applicability to broader settings, our primary future directions include (i) extension of the present analysis to multiple, non-binary sensitive attributes; and (ii) the consideration of different aggregation schemes in bias analysis.

Supplementary Material

tkdd-2023-09-0605-File002 (tkdd-2023-09-0605-file002.zip)

Supplementary material

Download
12.56 MB

A Lemma A.1 and Its Proof

Lemma A.1.

The disparity between the aggregated representations \(\mathbf {Z} \in \mathbb {R}^{N \times F}\) from different sensitive groups is related to the disparity between the hidden representations \(\mathbf {H}:= \sigma (\mathbf {Z}) \in \mathbb {R}^{N \times F}\) from different sensitive groups as follows:

\begin{equation} \begin{split}\delta _{h}&:=\left\Vert \operatorname{mean}(\mathbf {h}_{j} \mid s_{j}=0) - \operatorname{mean}(\mathbf {h}_{j} \mid s_{j}=1)\right\Vert _{2}\\ &= \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}_{j}\right\Vert _{2}\\ &\le L\left(\left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {z}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {z}_{j}\right\Vert _{2} + 2 \sqrt {N} \Delta _{z} \right). \end{split} \end{equation}

(12)

Here, L is the Lipschitz constant of nonlinear activation \(\sigma\), and maximal deviation \(\Delta _{z}\) is defined to be \(\Delta _{z}:= \operatorname{max}(\Delta ^{(0)}_{z}, \Delta ^{(1)}_{z})\), where \(\Vert \mathbf {z}_j-\bar{\mathbf {z}}_{s}\Vert _{\infty } \le \Delta ^{(s)}_{z}\), \(\forall v_j \in \mathcal {S}_{s}\) with \(\bar{\mathbf {z}}_{s} = \frac{1}{|\mathcal {S}^{s}|} \sum _{v_j \in \mathcal {S}_{s}} \mathbf {z}_{j}\) for \(s=0,1\).

Proof of Lemma A.1: Note that as this analysis applies to every layer in the same way, we drop the superscript l used to denote the layer. The disparity between the representations from different sensitive groups follow as

\begin{equation} \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}_{j}\right\Vert _{2}= \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \sigma (\mathbf {z}_{j})-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}}\sigma (\mathbf {z}_{j})\right\Vert _{2} . \end{equation}

(13)

We can write \(\mathbf {z}_{j}= \bar{\mathbf {z}}_{s} + \boldsymbol {\delta }^{(s)}_{j}\), \(\forall v_j \in \mathcal {S}_{s}\), where \(\bar{\mathbf {z}}_{s} = \frac{1}{|\mathcal {S}_{s}|} \sum _{v_j \in \mathcal {S}_{s}} \mathbf {z}_{j}\) for \(s=0,1\). If the activation function \(\sigma (.)\) is Lipschitz continuous with Lipschitz constant L (applies to several nonlinear activations, such as rectified linear unit (ReLU), sigmoid), the following holds:

\begin{equation} \begin{split}\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - L|\delta ^{(0)}_{i,j}| \le \operatorname{\sigma }(z_{i,j})&= \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i} + \delta ^{(0)}_{i,j})\\ &\le \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) + L |\delta ^{(0)}_{i,j}|, \forall i=1, \ldots F, \forall v_j \in \mathcal {S}_{0} \\ \operatorname{\sigma }(\bar{\mathbf {z}}_{0}) - L|\boldsymbol {\delta }^{(0)}_{j}| \preccurlyeq \operatorname{\sigma }(\mathbf {z}_{j})&= \operatorname{\sigma }(\bar{\mathbf {z}}_{0} + \boldsymbol {\delta }^{(0)}_{j})\\ &\preccurlyeq \operatorname{\sigma }(\bar{\mathbf {z}}_{0}) + L|\boldsymbol {\delta }^{(0)}_{j}|, \forall v_j \in \mathcal {S}_{0} \end{split} , \end{equation}

(14)

where \(|.|\) takes the element-wise absolute value of the input. The same inequalities can also be written for \(\mathcal {S}_{1}\):

\begin{equation} \begin{split} \operatorname{\sigma }(\bar{\mathbf {z}}_{1}) - L|\boldsymbol {\delta }^{(1)}_{j}| \preccurlyeq \operatorname{\sigma }(\mathbf {z}_{j})= \operatorname{\sigma }(\bar{\mathbf {z}}_{1} + \boldsymbol {\delta }^{(1)}_{j}) \preccurlyeq \operatorname{\sigma }(\bar{\mathbf {z}}_{1}) + L|\boldsymbol {\delta }^{(1)}_{j}|,\\ \forall v_j \in \mathcal {S}_{1}. \end{split} \end{equation}

(15)

Based on Equations (13), (14), and (15), the following holds:

\begin{equation} \begin{split}\frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} &\left(\operatorname{\sigma }(\bar{\mathbf {z}}_{0}) - L|\boldsymbol {\delta }^{(0)}_{j}| \right) - \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \left(\operatorname{\sigma }(\bar{\mathbf {z}}_{1}) + L|\boldsymbol {\delta }^{(1)}_{j}| \right) \preccurlyeq \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}_{j}\\ &\preccurlyeq \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \left(\operatorname{\sigma }(\bar{\mathbf {z}}_{0}) + L|\boldsymbol {\delta }^{(0)}_{j}|\right) - \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j\in \mathcal {S}_{1}} \left(\operatorname{\sigma }(\bar{\mathbf {z}}_{1}) - L|\boldsymbol {\delta }^{(1)}_{j}| \right) \end{split} \end{equation}

(16)

\begin{equation} \begin{split}\operatorname{\sigma }(\bar{\mathbf {z}}_{0}) &- \operatorname{\sigma }(\bar{\mathbf {z}}_{1}) - \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\boldsymbol {\delta }^{(0)}_{j}| - \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\boldsymbol {\delta }^{(1)}_{j}| \preccurlyeq \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}_{j}\\ &\preccurlyeq \operatorname{\sigma }(\bar{\mathbf {z}}_{0}) - \operatorname{\sigma }(\bar{\mathbf {z}}_{1}) + \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\boldsymbol {\delta }^{(0)}_{j}| + \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\boldsymbol {\delta }^{(1)}_{j}| \end{split} . \end{equation}

(17)

Define \(\mathbf {a}:=\operatorname{\sigma }(\bar{\mathbf {z}}_{0}) - \operatorname{\sigma }(\bar{\mathbf {z}}_{1}) - \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\boldsymbol {\delta }^{(0)}_{j}| - \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\boldsymbol {\delta }^{(1)}_{j}|\) and \(\mathbf {b}:= \operatorname{\sigma }(\bar{\mathbf {z}}_{0}) - \operatorname{\sigma }(\bar{\mathbf {z}}_{1}) + \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\boldsymbol {\delta }^{(0)}_{j}| + \frac{1}{|\mathcal {S}^{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\boldsymbol {\delta }^{(1)}_{j}|\). Let \(\bar{\mathbf {h}}_{s}\) denote \(\frac{1}{|\mathcal {S}_{s}|} \sum _{v_j \in \mathcal {S}_{s}} \mathbf {h}_{j}\) for \(s=0,1\). Then, Equation (17) leads to

\begin{equation} |(\bar{\mathbf {h}}_{0})_{i}-(\bar{\mathbf {h}}_{1})_{i}| \le \operatorname{max}(|a_{i}|,|b_{i}|), \forall i=1,\ldots , F. \end{equation}

(18)

If we consider the case \(|a_{i}| \ge |b_{i}|\). Then,

\begin{equation} \begin{split} |(\bar{\mathbf {h}}_{0})_{i}-(\bar{\mathbf {h}}_{1})_{i}| &\le \left|\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) - \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\delta ^{(0)}_{j,i}| - \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\delta ^{(1)}_{j,i}|\right|\\ &\le \left|\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i})\right| + \left|\frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\delta ^{(0)}_{j,i}| \right| + \left|\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\delta ^{(1)}_{j,i}| \right|\\ &\le \left|\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) \right| + L| \Delta ^{(0)} | +L | \Delta ^{(1)} |, \end{split} \end{equation}

(19)

where \(\Delta ^{(0)}_{i}:= \max _{j}|\delta ^{(0)}_{j,i}|\), \(\Delta ^{(1)}_{i}:= \max _{j}|\delta ^{(1)}_{j,i}|\) and \(\Delta ^{(0)}:=\Vert \boldsymbol {\Delta }^{(0)}\Vert _{\infty }\) and \(\Delta ^{(1)}:=\Vert \boldsymbol {\Delta }^{(1)}\Vert _{\infty }\).

Consider the term \(\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i})\):

\begin{equation} \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) = \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i} + (\bar{\mathbf {z}}_{1})_{i} - (\bar{\mathbf {z}}_{1})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}). \end{equation}

(20)

Utilizing Equations (14) and (15), \(\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i} + (\bar{\mathbf {z}}_{1})_{i} - (\bar{\mathbf {z}}_{1})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i})\) can be bounded by below and above:

\begin{equation} \begin{split}\operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) - L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) &\le \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i} + (\bar{\mathbf {z}}_{1})_{i} - (\bar{\mathbf {z}}_{1})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) \\ &\le \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) + L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) \end{split} \end{equation}

(21)

\begin{equation} \begin{split}- L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| \le \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i} + (\bar{\mathbf {z}}_{1})_{i} &- (\bar{\mathbf {z}}_{1})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) \le L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| \end{split} \end{equation}

(22)

\begin{equation} \left| \operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) \right| \le L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}|. \end{equation}

(23)

Therefore,

\begin{equation} |(\bar{\mathbf {h}}_{0})_{i}-(\bar{\mathbf {h}}_{1})_{i}| \le L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| + L\left| \Delta ^{(0)} \right| + L\left| \Delta ^{(1)} \right|, \forall i \text{ such that } |a_{i}| \ge |b_{i}|. \end{equation}

(24)

The next step is to consider the case \(|a_{i}| \lt |b_{i}|\). For this case, the following inequalities hold:

\begin{equation} \begin{split} |(\bar{\mathbf {h}}_{0})_{i}-(\bar{\mathbf {h}}_{1})_{i}| &\le \left|\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) + \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\delta ^{(0)}_{j,i}| + \frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\delta ^{(1)}_{j,i}|\right|\\ &\le \left|\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i})\right| + \left|\frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} L|\delta ^{(0)}_{j,i}| \right| + \left|\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} L|\delta ^{(1)}_{j,i}| \right|\\ &\le \left|\operatorname{\sigma }((\bar{\mathbf {z}}_{0})_{i}) - \operatorname{\sigma }((\bar{\mathbf {z}}_{1})_{i}) \right| + L\left| \Delta ^{(0)} \right| +L \left| \Delta ^{(1)} \right|\\ & \le L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| + L\left| \Delta ^{(0)} \right| + L\left| \Delta ^{(1)} \right|, \forall i \text{ such that } |a_{i}| \ge |b_{i}|. \end{split} \end{equation}

(25)

Combining Equations (24) and (25) and defining \(\Delta _{z}:=\operatorname{max}(\Delta ^{(0)}, \Delta ^{(1)})\), the following inequality can be written:

\begin{equation} |(\bar{\mathbf {h}}_{0})_{i}-(\bar{\mathbf {h}}_{1})_{i}| \le L|(\bar{\mathbf {z}}_{0})_{i} - (\bar{\mathbf {z}}_{1})_{i}| + 2L\left| \Delta _{z} \right| , \forall i=1, \dots , F, \end{equation}

(26)

which concludes as follows:

\begin{equation} \Vert \bar{\mathbf {h}}_{0}-\bar{\mathbf {h}}_{1}\Vert _2 \le L\left(\Vert \mathbf {\bar{z}}_{0} - \mathbf {\bar{z}}_{1}\Vert _{2} + 2 \sqrt {N} \Delta _{z} \right). \end{equation}

(27)

B Proof of Theorem 4.1

Here, without loss of generality, we will consider the lth GAT layer, where the input representations are denoted by \(\mathbf {H}^{l}\) and output representations are \(\mathbf {H}^{l+1}\). The disparity between the output representations follows:

(28)

Let’s redefine aggregated representation for node i at lth GAT layer as \({\mathbf {z}}^{l+1}_{i}= \sum _{v_j \in \mathcal {N}_{i}} \alpha ^{l}_{i j} {\mathbf {c}}^{l+1}_{j}\) for GATs, where \({\mathbf {c}}^{l+1}_{i}=\mathbf {W}^{l} {\mathbf {h}}^{l}_{i}\). Lemma A.1 shows that the deviation between the output representations can be upper bounded by the following term:

\begin{equation} \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}^{l+1}_{j}\right\Vert _{2} \le L\left(\left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {z}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {z}^{l+1}_{j}\right\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{z} \right), \end{equation}

(29)

where \(\Delta ^{l+1}_{z}\) is the maximal deviation of aggregated representations \(\mathbf {Z}^{l+1}\) at the lth GAT layer. Based on this upper bound, the term \(\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {z}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {z}^{l+1}_{j}\Vert _{2}\) will be analyzed. We first consider the terms \(\frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} \mathbf {z}^{l+1}_{j}\) and \(\frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} \mathbf {z}^{l+1}_{j}\) individually. Based on the assumptions A1 in the main text, the following can be written:

\begin{equation} \begin{split} \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} \mathbf {z}^{l+1}_{j} &\in \frac{1}{|\mathcal {S}_{1}|}\left(\sum _{v_{k} \in S^{\chi }_{1}}\left(\sum _{a \in \mathcal {N}(k) \cap S_0} \alpha ^{l}_{k a} \bar{\mathbf {c}}_{0}^{l+1} + \sum _{b \in \mathcal {N}(k) \cap S_1} \alpha ^{l}_{k b} \bar{\mathbf {c}}_{1}^{l+1} \right)\right.\\ &\left.+ \sum _{v_{n} \in S^{\omega }_{1}} \sum _{o \in \mathcal {N}(n) \cap S_1} \alpha ^{l}_{n o} \bar{\mathbf {c}}_{1}^{l+1}\right) \pm \Delta ^{l+1}_{c}\boldsymbol {1}, \\ &\in \frac{1}{|\mathcal {S}_{1}|}\left(\sum _{v_{k} \in S^{\chi }_{1}}(\alpha ^{\chi } \bar{\mathbf {c}}_{0}^{l+1} + \alpha ^{\omega } \bar{\mathbf {c}}_{1}^{l+1}) + \sum _{v_{n} \in S^{\omega }_{1}}\bar{\mathbf {c}}_{1}^{l+1}\right) \pm \Delta ^{l+1}_{c}\boldsymbol {1} , \\ &\in \frac{|S^{\chi }_{1}|}{|\mathcal {S}_{1}|} (\alpha ^{\chi } \bar{\mathbf {c}}_{0}^{l+1} + \alpha ^{\omega } \bar{\mathbf {c}}_{1}^{l+1}) + \frac{|S^{\omega }_{1}|}{|\mathcal {S}_{1}|} \bar{\mathbf {c}}_{1}^{l+1} \pm \Delta ^{l+1}_{c}\boldsymbol {1}, \end{split} \end{equation}

(30)

where \(\alpha ^{\chi }=\sum _{a \in \mathcal {N}(k) \cap S_0} \alpha ^{l}_{k a}\) and \(\alpha ^{\omega }=\sum _{b \in \mathcal {N}(k) \cap S_1} \alpha ^{l}_{k b}\) for a node \(v_{k} \in \mathcal {S}_{1}\) based on assumption A2 in the main text with \(\alpha ^{\chi }+\alpha ^{\omega }=1\), and \(\boldsymbol {1} \in \mathbb {R}^{F}\) is a vector with all elements being equal to 1. Let’s define \(R_{1}^{\chi }:=\frac{|S^{\chi }_{1}|}{|\mathcal {S}_{1}|}\) and \(R_{0}^{\chi }:=\frac{|S^{\chi }_{0}|}{|\mathcal {S}_{0}|}\), where \(R_{1}^{\omega }= 1- R_{1}^{\chi }\), \(R_{0}^{\omega }= 1- R_{0}^{\chi }\). Similarly, the expression for the term \(\frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} \mathbf {z}^{l+1}_{j}\) can also be derived as

\begin{equation} \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} \mathbf {z}_{j}^{l+1} \in R_{0}^{\chi } (\alpha ^{\chi } \bar{\mathbf {c}}_{1}^{l+1} + \alpha ^{\omega } \bar{\mathbf {c}}_{0}^{l+1})+ R_{0}^{\omega } \bar{\mathbf {c}}_{0}^{l+1} \pm \Delta ^{l+1}_{c}\boldsymbol {1}. \end{equation}

(31)

Define \(\boldsymbol {\epsilon }^{l+1}:=\frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} \mathbf {z}^{l+1}_{j} - \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} \mathbf {z}^{l+1}_{j}\); the following can be written

\begin{equation} \boldsymbol {\epsilon }^{l+1} \in \bar{\mathbf {c}}_{0}^{l+1} (R_{1}^{\chi }\alpha ^{\chi } - R_{0}^{\chi }\alpha ^{\omega } - R_{0}^{\omega }) - \bar{\mathbf {c}}_{1}^{l+1} (R_{0}^{\chi }\alpha ^{\chi } - R_{1}^{\chi }\alpha ^{\omega } - R_{1}^{\omega }) \pm 2\Delta ^{l+1}_{c}\boldsymbol {1}. \end{equation}

(32)

Use \(R_{1}^{\omega }= 1- R_{1}^{\chi }\) and \(R_{0}^{\omega }= 1- R_{0}^{\chi }\):

\begin{equation} \boldsymbol {\epsilon }^{l+1} \in \bar{\mathbf {c}}_{0}^{l+1} (R_{1}^{\chi }\alpha ^{\chi } - R_{0}^{\chi }\alpha ^{\omega } - 1 + R_{0}^{\chi }) - \bar{\mathbf {c}}_{1}^{l+1} (R_{0}^{\chi }\alpha ^{\chi } - R_{1}^{\chi }\alpha ^{\omega } - 1 + R_{1}^{\chi }) \pm 2\Delta ^{l+1}_{c}\boldsymbol {1}. \end{equation}

(33)

Use \(\alpha ^{\chi }+\alpha ^{\omega }=1\):

\begin{equation} \begin{split} \boldsymbol {\epsilon }^{l+1} &\in \bar{\mathbf {c}}_{0}^{l+1} (R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1) - \bar{\mathbf {c}}_{1}^{l+1} (R_{0}^{\chi }\alpha ^{\chi } + R_{1}^{\chi }\alpha ^{\chi } - 1) \pm 2\Delta ^{l+1}_{c}\boldsymbol {1}, \\ &= (\bar{\mathbf {c}}_{0}^{l+1} - \bar{\mathbf {c}}_{1}^{l+1})((R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)) \pm 2\Delta ^{l+1}_{c}\boldsymbol {1}. \end{split} \end{equation}

(34)

Therefore, \(\Vert \boldsymbol {\epsilon }^{l+1}\Vert _{2}\) can be upper bounded by

\begin{equation} \Vert \boldsymbol {\epsilon }^{l+1}\Vert _{2} \le |(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)| \Vert \bar{\mathbf {c}}_{0}^{l+1} - \bar{\mathbf {c}}_{1}^{l+1}\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{c} . \end{equation}

(35)

Furthermore, consider the term \(\Vert \bar{\mathbf {c}}_{0}^{l+1} - \bar{\mathbf {c}}_{1}^{l+1}\Vert _{2}\), where \(\bar{\mathbf {c}}_{0}^{l+1} ={\rm mean}(\mathbf {c}^{l+1}_j \mid v_{j} \in \mathcal {S}_{s})\) and \({\mathbf {c}}^{l+1}_{j}=\mathbf {W}^{l} {\mathbf {h}}^{l}_{j}\).

\begin{equation} \begin{split} \left\Vert \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} \mathbf {c}_{j}^{l+1} - \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} \mathbf {c}_{j}^{l+1}\right\Vert _{2} &= \left\Vert \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} \mathbf {W}^{l} {\mathbf {h}}^{l}_{j} - \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} \mathbf {W}^{l} {\mathbf {h}}^{l}_{j}\right\Vert _{2}\\ &=\left\Vert \mathbf {W}^{l} \left(\frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} {\mathbf {h}}^{l}_{j} - \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} {\mathbf {h}}^{l}_{j}\right)\right\Vert _{2} \\ &\le \sigma _{max}(\mathbf {W}^{l}) \left\Vert \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} {\mathbf {h}}^{l}_{j} - \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} {\mathbf {h}}^{l}_{j}\right\Vert _{2}, \end{split} \end{equation}

(36)

where \(\sigma _{max}(.)\) outputs the largest singular value of the input matrix. Based on Equations (35) and (36), it follows that

\begin{equation} \Vert \boldsymbol {\epsilon }^{l+1}\Vert _{2} \le \sigma _{max}(\mathbf {W}^{l}) |(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)| \left\Vert \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} {\mathbf {h}}^{l}_{j} - \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} {\mathbf {h}}^{l}_{j}\right\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{c} . \end{equation}

(37)

Finally, combining the results in Equations (29) and (37), the deviation between the output representations from different sensitive groups can be upper bounded by

\begin{equation} \begin{split}&\left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}^{l+1}_{j}\right\Vert _{2} \le L\left(\left\Vert \boldsymbol {\epsilon }^{l+1} \right\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{z} \right),\\ &\le L\left(\sigma _{max}(\mathbf {W}^{l}) |(R_{1}^{\chi }\alpha ^{\chi } + R_{0}^{\chi }\alpha ^{\chi } - 1)| \left\Vert \frac{1}{|\mathcal {S}_{0}|}\sum _{v_{j} \in S_{0}} {\mathbf {h}}^{l}_{j} - \frac{1}{|\mathcal {S}_{1}|}\sum _{v_{j} \in S_{1}} {\mathbf {h}}^{l}_{j}\right\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{c} + 2 \sqrt {N} \Delta ^{l+1}_{z} \right). \end{split} \end{equation}

(38)

C Proof of Lemma 4.2

Lemma A.1 shows that the deviation between the output representations can be upper bounded by the following term:

\begin{equation} \begin{split} \delta _{h}^{l+1}&:=\left\Vert \operatorname{mean}(\mathbf {h}^{l+1}_{j} \mid s_{j}=0) - \operatorname{mean}(\mathbf {h}^{l+1}_{j} \mid s_{j}=1)\right\Vert _{2}\\ &= \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}^{l+1}_{j}\right\Vert _{2} \\ &\le L\left(\left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {z}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {z}^{l+1}_{j}\right\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{z} \right), \end{split} \end{equation}

(39)

where \(\Delta ^{l+1}_{z}\) is the maximal deviation of aggregated representations \(\mathbf {Z}^{l+1}=\mathbf {H}^{l} \mathbf {W}^{l}\) at the lth fully connected layer. The deviation between aggregated representations from different sensitive groups can be upper bounded by

\begin{equation} \begin{split} \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {z}^{l+1}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {z}^{l+1}_{j}\right\Vert _{2} &= \left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {W}^{l} \mathbf {h}^{l}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {W}^{l} \mathbf {h}^{l}_{j}\right\Vert _{2} \\ &= \left\Vert \mathbf {W}^{l} \left(\frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}}\mathbf {h}^{l}_{j}\right)\right\Vert _{2}\\ & \le \sigma _{max}(\mathbf {W}^{l}) \left\Vert \left(\frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}}\mathbf {h}^{l}_{j}\right)\right\Vert _{2}. \end{split} \end{equation}

(40)

Therefore, combining the results of Equations (39) and (40), it follows that

\begin{equation} \begin{split} &\left\Vert \frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l+1}_{j} -\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}} \mathbf {h}^{l+1}_{j}\right\Vert _{2} \\ &\le L \left(\sigma _{max}(\mathbf {W}^{l}) \left\Vert \left(\frac{1}{|\mathcal {S}_{0}|} \sum _{v_j \in \mathcal {S}_{0}} \mathbf {h}^{l}_{j}-\frac{1}{|\mathcal {S}_{1}|} \sum _{v_j \in \mathcal {S}_{1}}\mathbf {h}^{l}_{j}\right)\right\Vert _{2} + 2 \sqrt {N} \Delta ^{l+1}_{z}\right), \end{split} \end{equation}

(41)

which concludes the proof.

D Additional Statistics On Datasets

Further statistical information for the utilized datasets are presented in Tables 4 and 5. F in Table 4 denotes the dimension of nodal features.

Table 4.

Dataset	\(\|\mathcal {S}_{0}\|\)	\(\|\mathcal {S}_{1}\|\)	\(\| \mathcal {E}^{\chi }\|\)	\(\| \mathcal {E}^{\omega }\|\)	F
Pokec-z	4,851	2,808	1,140	28,336	59
Pokec-n	4,040	2,145	943	20,901	59
Recidivism	9,317	9,559	298,098	325,642	17

Table 4. Dataset Statistics for Social Networks

Table 5.

Dataset	\(\|\mathcal {V}\|\)	# sensitive attr.	\(\|\mathcal {E}^{\chi }\|\)	\(\|\mathcal {E}^{\omega }\|\)
Cora	2,708	7	1,428	5,964
Citeseer	3,327	6	1,628	4,746
PubMed	19,717	3	12,254	49,802

Table 5. Dataset Statistics for Citation Networks

E Hyperparameters

We provide the selected hyperparameter values for the GNN model and the proposed framework, FairGAT, for the reproducibility of the presented results. For the node classification task, weights are initialized utilizing Glorot initialization [17] in the GAT-based classifier. All models are trained for 500 epochs by employing an Adam optimizer [27] together with a learning rate of 0.005 and \(\ell _2\) weight decay factor of 0.0005. A 2-layer GAT network followed by a linear layer is employed for node classification (for baselines, the conventional GAT layer [47] is utilized, whereas the layer used in FairGAT employs the fair attention calculation in Equation (10)). The hidden dimension of the node representations is selected as 128 on all datasets. The experimental settings for link prediction are kept the same as in FairDrop [45], where GAT layers are utilized to build the encoder network.

Table 6.

	Pokec-z	Pokec-n	Recidivism	Cora	Citeseer	PubMed
\(\alpha _{max}^{\chi }\)	0.75	0.25	0.75	0.50	0.75	0.50
\(\eta\)	1.00	0.75	\(1.00\)	\(1.00\)	\(1.00\)	\(1.00\)

Table 6. Utilized \(\alpha _{max}^{\chi }\) and \(\eta\) Values for the Presented Results in Tables 1 and 3

The results for baseline schemes, FairGNN [10], EDITS [13], and NIFTY [1] are obtained by choosing the hyperparameters in the corresponding studies via grid search on cross-validation sets with 5 different data splits. Specifically, the values \(0.01, 0.1, 0.2,1\) are examined as the multiplier of adversarial regularizer for FairGNN. The results in Table 1 are obtained by setting the value of this hyperparameter to \(0.01, 0.2\), and 0.1 for Pokec-z, Pokec-n, and Recidivism datasets, respectively. Moreover, the threshold values \(0.001, 0.01, 0.05, 0.3\) are examined for EDITS (except for the Recidivism dataset, for which the study already suggests the optimal hyperparameter value 0.015). The results in Table 1 are obtained for 0.001 and 0.05 for Pokec-z and Pokec-n, respectively. Finally, for NIFTY, the coefficient of the unsupervised loss is tuned among the values \(0.5, 0.6, 0.7, 0.8, 0.9\). The results in Table 1 are obtained for \(0.9, 0.5\), and 0.5 for Pokec-z, Pokec-n, and Recidivism, respectively.

Hyperparameter values for \(\alpha _{max}^{\chi }\) and \(\eta\) that lead to the results in Tables 1 and 3 are presented in Table 6. In this work, these hyperparameters are selected via grid search on cross-validation sets over 5 different data splits. Specifically, the values \(0.25, 0.50, 0.75\) are examined for \(\alpha _{max}^{\chi }\). In the tuning of \(\eta\), for a more stable tuning process, both \(\mathbf {Z}^{l+1}\) and \(\mathbf {C}^{l+1}\) are normalized so that the variation of each feature equals 1 before scaling with \(\eta\). This normalization before scaling also allows the overall scaling factors to be different for different layers l, which provides a better fairness-utility trade-off for different \(\eta\) values. The values \(0.75, 1\) are examined for the final \(\eta\) selection.

F Runtime Analysis

In order to demonstrate the time complexity incurred by the proposed framework, Table 7 presents the average runtime of each epoch for FairGAT and the baselines. Note that the runtime for EDITS [13] is obtained by excluding the time incurred by the fairness-aware preprocessing step. For EDITS, the table shows the average runtime of each epoch for the training of a GAT network with preprocessed inputs.

Table 7.

Epoch (sec)	GAT	FairGNN	EDITS	NIFTY	FairGAT
Pokec-z	0.30	0.68	0.74	1.28	0.31
Pokec-n	0.23	0.50	0.41	0.91	0.24
Recidivism	3.07	6.91	9.31	13.97	3.11

Table 7. Runtime Comparison

G Computing Infrastructures

Software infrastructures: All GNN-based models are trained utilizing PyTorch 1.10.1 [40] and NetworkX 2.5.1 [19]. Hardware infrastructures: Experiments are carried over on 8 AMD Ryzen Threadripper 3970X CPUs.

References

[1]

Chirag Agarwal, Himabindu Lakkaraju, and Marinka Zitnik. 2021. Towards a unified framework for fair and stable graph representation learning. In Uncertainty in Artificial Intelligence. PMLR, Online, 2114–2124.

Abstract

1 Introduction

2 Related Work

3 Preliminaries

4 Methodology

4.1 Bias Analysis

4.2 Proposed Scheme: FairGAT

4.2.1 Fair Attention Learning.

4.2.2 Spectral Normalization.

4.2.3 Scaling Representations.

5 Experiments

5.1 Datasets and Experimental Setup

5.2 Results for Node Classification

5.3 Results for Link Prediction

6 Conclusion and Future Work

Supplementary Material

A Lemma A.1 and Its Proof

B Proof of Theorem 4.1

C Proof of Lemma 4.2

D Additional Statistics On Datasets

E Hyperparameters

F Runtime Analysis

G Computing Infrastructures

References

Cited By

Index Terms

Recommendations

Fairness-Aware Graph Neural Networks: A Survey

Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned

Multi-stage Bias Mitigation for Individual Fairness in Algorithmic Decisions

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations