1 Introduction

Nowadays, machine learning (ML) models are ubiquitous in all human settings. They have even been used in judicial systems to determine whether or not to parole convicted criminals (Angwin et al., 2016). While ML systems have undoubtedly improved human productivity (Furman & Seamans, 2019), they can also propagate biases in data, consequently discriminating against certain demographic groups. Notably, COMPAS, a software for parole sentencing, was found to label black defendants twice as likely as their white counterparts to recidivate over a two-year period (Angwin et al., 2016).

In recent years, there has been significant research on the so-called fairness in machine learning which is primarily concerned with measuring and mitigating algorithmic biases.Footnote 1 Several fairness algorithms constrained to satisfy certain fairness metrics have been designed (Mehrabi et al., 2021). Majority of existing fairness algorithms are designed on the assumption that the training and test datasets are independently and identically distributed (iid) (Li et al., 2024; Maity et al., 2021). In other words, it is assumed that ensuring fairness on the training dataset approximates fairness guarantees on the test dataset.

However in reality, it is common for the distribution of the test dataset (target environment) to drift from that of the training dataset (source environment), thus no longer identically distributed. For example, a recent study demonstrated distributional drift of online learning behaviors within the educational setting at the height of COVID-19 (Impey & Formanek, 2021). In such circumstances, we posit that fairness algorithms cannot guarantee fairness. In recent years, only a handful of research works investigate the robustness of fairness algorithms amidst distributional drifts (Kamp et al., 2021). Nonetheless, there remain gaps in the literature. Firstly, the existing literature does not quantify the size of differential driftFootnote 2 in data distribution across demographic groups. Therefore, it is difficult to characterize the relationship between differential distributional drift and the corresponding fairness. This characterization is needed in order to develop fairness algorithms that are robust against distributional drift. Secondly, there is no comprehensive study that benchmarks fairness algorithms vis-a-vis fairness metrics in the context of data distributional drift. Our study fills this gap by considering a large collection of metrics, algorithms and datasets and explores the conditions under which fairness algorithms may be considered fair amidst distributional drift. More formally, we explore the following research questions (RQs).

  • RQ1: What is the relationship between differential drift in data across demographic groups and algorithmic fairness?

  • RQ2: Are existing fairness algorithms distributional drift aware?

By investigating these RQs, we can diagnose and reveal various ways that distributional drift impacts algorithmic fairness. To answer these questions, we perform an extensive experimental analysis involving 4 baseline models and 7 fairness algorithms using 3 predictive performance and 10 fairness metrics. We run our analysis on 5 real-world datasets across 4 domains: Education, Finance, Employment, and Criminal Justice. The main contributions of our work are as follows:

  • We reveal several interesting relationships between distributional drift—specifically covariate drift—and algorithmic fairness. Especially, we show how covariate drift results in discrimination (or reverse discrimination).

  • We demonstrate the lack of robustness of existing algorithms in the face of covariate drift, while highlighting the need for the proper contextualization of what is fair.

  • We perform extensive experimental analysis that contributes critical empirical evidence on the impact of covariate drift on algorithmic fairness and recommend important policy implications of our findings for relevant stakeholders.

The rest of this paper is organized as follows. Section 2 discusses relevant related works. Section 3 provides preliminary information regarding notations, evaluation metrics, and fairness algorithms. Section 4 discusses our experimental setup. We discuss the results and their implications in Sect. 5. We conclude the paper in Sect. 6.

2 Related work

Fairness algorithms either modify biased data (aka pre-processing algorithms), add a fairness constraint directly to a model’s objective function (aka in-processing algorithms), or modify a biased model’s outcomes (aka post-processing algorithms). In recent years, a number of studies have been done to assess the various fairness algorithms based on different evaluation criteria (Friedler et al., 2017; Roth, 2018; Deho et al., 2022). Existing works tend to compare fairness algorithms in terms of their predictive performance, fairness, and/or fairness-accuracy tradeoff. For instance, Hamilton (2017) compared four fairness algorithms using four fairness metrics across three different datasets. However, the author found that the fairness of the algorithms varied across datasets. In a similar study, Roth (2018) compared three fairness algorithms and found that in most cases, the in-processing approaches tend to achieve fairness better than the pre-processing approaches. Nonetheless, Roth found that the algorithms were inconsistent across datasets and remarked the need for further extensive experiments. Furthermore, studies by Friedler et al. (2017) and Deho et al. (2022) have compared fairness algorithms across several datasets, investigated the effect of hyperparameter variation on fairness algorithms, and evaluated the fairness-accuracy trade-offs of various fairness algorithms. Both studies also found that the performance of fairness algorithms tends to vary across datasets. In this regard, a new line of research that is focused on a comparative analysis of the utility of fairness algorithms due to drifts in data distribution has recently emerged.

Only a handful of work investigate the robustness of fairness algorithms in the face of distributional drift. For example, Castelnovo et al. (2021) trained three fairness algorithms on banking data and observed that the fairness algorithms failed to satisfy demographic parity one year after deployment when the financial data drifted for certain demographic groups. Similarly, Ghodsi et al. (2022) trained one pre-processing and one in-processing algorithm on the newly released Adult Dataset and found that the fair models significantly deteriorate in terms of predictive performance and fairness due to drift in the spatial distribution of races. In what is perhaps the closest to our study, Islam et al. (2022) performed a comprehensive analysis of different fairness algorithms across different datasets and investigated the robustness of the algorithms to what they refer to as data errors, which in some sense, amounts to distributional drift. Specifically, Islam et al. (2022) introduced data errors by (1) swapping certain column values, (2) scaling certain column values, and (3) imputing missing values for certain columns. All the data errors were introduced randomly and disproportionately across demographic groups. Interestingly, Islam et al. (2022) found that pre- and in-processing algorithms tend to be less generalizable, whereas post-processing algorithms were found to be more robust to data errors. In a recent related work, Gardner et al. (2023), performed cross-institutional analysis where learning analytics models from one institution were deployed in another institution. Contrary to prior findings, Gardner et al. (2023) found almost no drop in the fairness of their models when deployed in a different institution where there is a likelihood of distributional drift. It is worth noting that the models used in the study by Gardner et al. (2023), however, were traditional ML models without any fairness constraints.

None of the related work measure the level of distributional drift in the data thus making it difficult to draw a relationship between distributional drift and fairness. Further, existing works that investigate the impact of distributional drift on fairness (Castelnovo et al., 2021; Ghodsi et al., 2022; Kamp et al., 2021) cover a very limited breadth in terms of fairness metrics and algorithms. In particular, they are often limited to three fairness metrics namely, equalized odds, statistical parity (aka demographic parity), and equal opportunity. However, given the fluidity of fairness metrics, accompanied by the impossibility theorem (Chouldechova, 2017), it has become critical to have a more comprehensive benchmarking of fairness algorithms and notions through the lens of data distributional drift.

3 Preliminaries and background

Consistent with literature, we term attributes which are often not legally allowed to be used as the basis for decisions as protected attributes denoted A e.g., race and gender (Siegel, 2003). We represent non-protected attributes by X, actual outcomes by Y and predicted outcomes by \({\hat{Y}}\). We denote the privileged group and favourable outcome by \(A=1\) and \(Y=1\) (or \({\hat{Y}}=1\)) respectively. Conversely, we represent unprivileged and unfavourable outcome by \(A=0\) and \(Y=0\) (or \({\hat{Y}}=0\)) respectively. As per convention in algorithmic fairness literature, we denote demographic groups that are historically “disadvantaged” e.g., racial minorities as unprivileged groups. These groups are protected by law or simply referred to as protected groups (Mehrabi et al., 2021). Non-protected/privileged groups are the demographic groups that are historically advantaged e.g., racial majorities.

3.1 Types of distributional drift and drift detection metric

Given a non-protected input attribute X, and target outcome Y, at time t (e.g., training), and time \(t+k\) (e.g., testing), where \(k >0\), distributional drift can manifest in the following 3 ways (Moreno-Torres et al., 2012):

  1. a)

    Covariate drift: This occurs when the input marginal probability P(x) changes but the conditional probability P(y|x) does not change. Covariate drift is expressed as:

    $$\begin{aligned} \small P_{t}(x)\ne P_{t+k}(x) \wedge P_{t}(y|x)= P_{t+k}(y|x) \end{aligned}$$
    (1)
  2. b)

    Target drift: This occurs when the target marginal probability P(y) changes but the conditional probability P(x|y) does not change. Target drift is expressed as:

    $$\begin{aligned} \small P_{t}(y)\ne P_{t+k}(y) \wedge P_{t}(x|y)= P_{t+k}(x|y) \end{aligned}$$
    (2)
  3. c)

    Concept drift: This occurs when the underlying relationship between the input and the target label changes but the input marginal probability P(x) does not change. Concept drift is expressed as:

    $$\begin{aligned} \small P_{t}(y|x)\ne P_{t+k}(y|x) \wedge P_{t}(x)= P_{t+k}(x) \end{aligned}$$
    (3)

Consistent with well-known related works (Maity et al., 2021; Xu & Wilson, 2021), we focus on the most prevalent drift, covariate drift. Further, we also make the assumption that the condition \(P_{t}(y|x)= P_{t+k}(y|x)\) in covariate drift is satisfied. Consistent with (Xu & Wilson, 2021), this assumption is reasonable due to data insufficiency.

To measure covariate drift (i.e., \(P_t(x) \ne P_{t+k}(x)\)) condition, we adopt the well-known statistical drift detection metric called Jensen-Shannon Distance (JSD) (Endres & Schindelin, 2003). JSD works well for both categorical and numerical variables and is based on the KL divergence. However, unlike KL divergence, JSD is symmetric and returns a finite score between 0 and 1. A drift value 0 signifies no drift and a drift value of 1 signifies maximum drift. More formally, given \(P_t\) and \(P_{t+k}\):

$$\begin{aligned} \small JSD(P_t,P_{t+k})= \sqrt{\frac{D_{KL}(P_t||M)+ D_{KL}(P_{t+k}||M)}{2}} \end{aligned}$$
(4)

where,

  • \(M = \frac{P_t + P_{t+k}}{2}\)

  • We consider \(JSD \ge 0.1\) in this study as a significant driftFootnote 3.

3.2 Evaluation metrics

In this section we present the various predictive performance and fairness metrics considered in this work.

3.2.1 Predictive performance metrics

To measure the predictive performance of our models, we consider 3 popularly used metrics namely accuracy for balanced data, and balanced accuracy and weighted F1-score for imbalanced data. All 3 metrics return a value of 0 (worst) to 1 (best).

3.2.2 Fairness metrics

To measure the fairness of our models, we carefully consider 10 fairness metrics such that each of them belongs to at least one of the 7 clusters discovered from 26 fairness metrics by Majumder et al. (2021). The 7 clusters of metrics represent misclassification metrics (cluster 0,3), differential fairness metrics (cluster 1), individual fairness metrics (cluster 2), confusion matrix based group fairness metrics (cluster 4), between group individual fairness metrics (cluster 5) and intermediate metrics (cluster 6). Metrics in the same cluster were found to satisfy similar notions by the authors. Table 1 summarizes these metrics into two types described as follows:

Group Fairness Metrics. Group fairness seeks to ensure that individuals belonging to different demographic groups are treated equally. The group fairness metrics that we considered in this work are statistical parity (SP) (Dwork et al., 2012), disparate impact (DI) (Feldman et al., 2015), error rate difference (ERD) (Berk et al., 2021), equalized odds (EO) (Hardt et al., 2016), equal opportunity(EOP) (Hardt et al., 2016), positive predictive value difference (PPV-DIFF) (Chouldechova, 2017), and negative predictive value difference (NPV-DIFF) (Berk et al., 2021).

Individual Fairness Metrics. Individual fairness is based on the idea that similar individuals should be treated similarly. There are fewer individual fairness metrics compared to group fairness metrics. The individual fairness metrics considered in this study are within group generalized entropy index (WGEI) (Speicher et al., 2018), within group theil index (WGTI) (Speicher et al., 2018), and consistency (Zemel et al., 2013).

Table 1 Formulae of various fairness metrics

3.3 Fairness algorithms

Fairness algorithms are often designed to facilitate ML models such that the ML models satisfy one or more fairness metrics. In such a case, the ML model is said to be “fair”. Seven well known fairness algorithms covering pre-processing, in-processing and post-processing within an ML pipeline have been considered in this work.

3.3.1 Pre-processing algorithms

Pre-processing algorithms tackle algorithmic unfairness by debiasing data. The fair data generated by the pre-processing algorithm can then be used to train any downstream ML model. The 3 pre-processing techniques we used in this study are Suppression (SUP) (Kamiran & Calders, 2012), Reweighing (RW) (Kamiran & Calders, 2012), and Disparate Impact Remover (DIR) (Feldman et al., 2015).

3.3.2 In-processing algorithms

In-processing algorithms achieve fairness by explicitly introducing fairness constraints in the ML algorithm. We consider 2 in-processing algorithms in this study namely Prejudice Remover (PR) (Kamishima et al., 2012) and Advesarial Debiasing (AdDeb) (Zhang et al., 2018).

3.3.3 Post-Processing algorithms

Post-processing methods involve altering the outcomes of a pre-trained model to attain specific fairness criteria across various groups. The 2 post-processing algorithms considered in this study are Equal Odds algorithm (EQ) (Hardt et al., 2016) and Calibrated Equal Odds (CEq) (Pleiss et al., 2017).

4 Experimental setup

4.1 Datasets and baseline ML algorithms

We used 5 real-world datasets for our comparative analysis. Three of the datasets are commonly used publicly available datasetsFootnote 4Footnote 5 in the algorithmic fairness community while two are proprietary datasets. The two proprietary datasets are anonymized counts of first semester MoodleFootnote 6 engagement records for 2 mandatory STEM courses—coded as NWF and ITF—in a large public Australian university. Table 2 presents a summary of all the datasets used, which are also briefly described as follows. Proprietary Datasets: NWF and ITF represent students’ demographic and engagement records from 2015 to 2020 for two courses respectively. Public Datasets: BAF is synthetic data generated from anonymized real-world bank fraud detection dataset spanning 8 months (Jesus et al., 2022). The Adult dataset is a income census dataset detailing whether a person’s income exceeds 50K US dollars (Becker & Kohavi, 1996). The COMPAS dataset contains criminal history and COMPAS risk scores for defendants from Broward County (Angwin et al., 2016).

Table 2 Description of all datasets. The seasons show how a dataset was partitioned into the 3 seasons format

For the baseline algorithms, i.e., the algorithms without any fairness constraint, we consider 4 classical ML algorithms that have been extensively used in several related works as baselines for comparing fairness algorithms, namely Logistic Regression (LR), eXtreme Gradient Boosted Trees (XGB), Random Forest (RF), and Support Vector Machines (SVM) (Islam et al., 2022).

4.2 Analysis of covariate drift in datasets

Given that the important covariates have significant influence on the prediction outcomes (Zien et al., 2009), we are interested in investigating the relationship between covariate importance, covariate drift, and algorithmic unfairness. To that end, we pursued two key objectives. Firstly, we are interested in knowing if a significant drift in the important covariates will correspond to significant levels of unfairness. Secondly, we are interested in knowing if unfairness flows in the direction of the drift of the important covariates. Specifically, we want to investigate whether, if the most important covariates drifts significantly for a particular demographic group, the measured bias will also flow in that same direction.

Ranking of Covariates. To determine the important covariates, for each dataset, we trained the 4 baseline models, i.e., RF, LR, XGB, and SVM and used the co-efficient weights (for LR and SVM), covariate importance scores (for XGB and RF), and the well-established SHapley Additive exPlanations (SHAP) values (for all 4 models). All the covariate importance scores were normalized to be between 0 and 1 and used to rank the covariates. Ranking results have been excluded due to space constraints. The supplementary results of all experiments in this paper are available at this link.

Covariate Drift: All datasets were partitioned into 3 seasons based on the timestamp of the records in each dataset. Thus simulating historical training dataset at time \(t_0\) denoted, and two test datasets at times \(t_1\) and \(t_2\) respectively. For example, the NWF dataset is partitioned as (1) Historical data at time \(t_0\) (2015–2017), i.e., long before COVID-19 (LBC); (2) Pre-covid data at time \(t_1\) (2018–2019), i.e., immediate pre-COVID-19 (PC); and (3) Peri-covid data at time \(t_2\) (2020), i.e., during or peri-COVID (PeC). We computed the size of covariate drift across the 3 seasons using Jensen-Shannon Distance (JSD) in the following fashion: \(JSD_{t_{01}}(X_{t_0},X_{t_1})\) and \(JSD_{t_{02}}(X_{t_0},X_{t_2})\) (c.f. Equation 4). Figures 1 and  2 show the covariate drift patterns of the NWF and BAF datasets. It can be observed that the figures show an inconsistent mix of gradual and sudden drifts. This observation is important as it demonstrates that there can be different levels of drift, which require tailored attention, unlike the works of (Islam et al., 2022; Ghodsi et al., 2022) which assume the same level of drift and thus treat the impact of drift to fairness equally. We noted that the Adult and COMPAS datasets did not have timestamped records. In this case we first set the original dataset as the Historical dataset at season \(t_0\). We then introduced a calculated artificial drift to the most important covariates and denote this dataset as Pre-covid equivalent at season \(t_1\). We did the same for the Peri-covid equivalent at season \(t_2\). We introduced the calculated drifts randomly across demographic groups. We refer to these two drifted datasets as Drift 1 and Drift 2 respectively. For instance, Fig. 3 shows the introduced drift for the COMPAS dataset. We created the artificial drift using the following equation:

$$\begin{aligned} \text {Artificial Drift} = {\bar{X}} \cdot k + c \end{aligned}$$
(5)

where,

  • \(\bar{X}\) is the mean of the covariate; \(c \in [0.001,0.1]\) is small constant to ensure non-zero drift; and \(k \in [0,3]\) is the level of drift to be introduced.

4.3 Model training and testing

We refer to the models without any fairness constraints as baseline models, and the models with fairness constraints as fairness-aware models. Overall, we used a total 4 baseline models and 22Footnote 7 fairness-aware models for all our experiments as shown in Table 3.

Train-Test Split: Recall that all datasets are partitioned into 3 seasons at times \(t_0\), \(t_1\), and \(t_2\) which represent the historical training dataset, the test \(\hbox {dataset}_1\), and the test \(\hbox {dataset}_2\) respectively. To compare the impact of the implicit iid assumption often made in the literature, we used cross-validation and bootstrapping to simulate temporal impact of the iid assumption as follows. In the cross-validation, we performed 40 shuffled repeats of 5-fold cross validation, making a total of 200 train-test runs on the historical training dataset. The results from this cross-validation represents situation where the training and test datasets are drawn from the same distribution, i.e., the iid assumption. In the bootstrapping, we trained each model on 200 bootstrapped samples of the historical dataset and tested them on the test \(\hbox {dataset}_1\) and test \(\hbox {dataset}_2\). The results from this bootstrapping experiment represent the scenario where the training and test datasets do not follow the iid assumption. All hyper-parameters where optimized accordingly.

5 Results and discussion

Fig. 1
figure 1

Drifts of top-6 important covariates for NWF dataset. P = privileged group, UP = unprivileged group. LBC= long before covid (\(t_0\)). PC = pre-covid (\(t_1\)), and PeC = peri-covid (\(t_2\))

Fig. 2
figure 2

Drifts of top-6 important covariates for BAF dataset

Fig. 3
figure 3

Drifts of top-6 important covariates for COMPAS dataset

Table 3 Configurations of fairness-aware models w.r.t to baseline models
Fig. 4
figure 4

DCD versus unfairness for DIR+XGB model for NWF dataset. Results are similar for all models, metrics and datasets

Fig. 5
figure 5

DCD versus unfairness for DIR+XGB model for BAF dataset

5.1 Relationship between differential covariate drift and fairness

Differential Covariate Drift (DCD), which is the difference between the covariate drift of a privileged group and an unprivileged group, is first calculated. We then computed the overall DCD by finding the signed and absolute average of the DCDs in the top-6 most important covariates.Footnote 8 From hereon, we simply refer to the overall DCD as DCD. To establish the relationship between DCD and fairness, we computed the rate of the change of DCD wrt fairness. Figures 4 and  5 show the relationship between DCD and fairness on the NWF and BAF datasets respectively. In both figures, \(\alpha\) and \(\mu\) represent the absolute and signed DCDs respectively, and the annotations represent the gradient from the origin (0,0). Furthermore, in the x–y plane, all the negative coordinates are against the unprivileged group and the positive coordinates are against the privileged group. In both figures, it can be seen that the relationship between DCD and fairness is not straightforward.

Firstly, we observed that unfairness does not always flow in the direction of the DCD. In other words, if the unprivileged group, for example, has a higher DCD, it does not necessarily mean that the models will be biased in favour of the unprivileged group. For example, consider Fig. 4. In terms of the ERD metric, we observed that the model was biased for the unprivileged (ERD = 0.07) Peri-covid. Consistently, we observed that the absolute DCD for Peri-covid (i.e., \(t_0\) vs. \(t_2\)) was also higher for the unprivileged (DCD = 0.2). In contrast, for the same ERD metric, Pre-covid, we observed that the model was biased against the unprivileged group (ERD = \(-\)0.06) even when the absolute DCD for Pre-covid (i.e., \(t_0\) vs. \(t_1\)) was still higher for the the unprivileged groups (DCD = 0.15). We made similar observations across all metrics and all models—both baseline and fairness-aware.

Key takeaway: Across all datasets, fairness metrics and models, we show that positive (resp. negative) DCD does not imply positive (resp. negative) discrimination.

Secondly, we observed that the size of the DCD is not always proportional to size of the unfairness. That is, given two DCDs, the lower DCD does not necessarily imply lower unfairness and vice versa. For instance, in Fig. 4, for the EO metric, the absolute DCD for Peri-covid is higher (0.2) than the DCD for Pre-covid (0.15), however, we observed that the unfairness in the EO metric for Peri-covid is lower (0.05) than that of the Pre-covid (0.07). Similar observations were made for different models across different datasets (e.g., see Fig. 5).

Key takeaway: Differential covariate drift may not always cause unfairness. This finding is consistent with Gardner et al. (2023) and contrary to that of Castelnovo et al. (2021)

Furthermore, we observed that the significance of the drift in the important covariates is indicative of its consequent impact on algorithmic unfairness. Consider Figs. 1 and  2 which represent covariate drifts across privileged and unprivileged demographic groups for the NWF and BAF datasets respectively. In these figures, the red-dashed lines represent the baseline drift above which covariate drifts are considered significant. The blue bars represent the covariate drifts at times \(t_0\) versus \(t_1\), and the orange bars represent covariate drifts at times \(t_0\) vs. \(t_2\). The top-6 covariates are shown in order of importance row-wise, from left to right. For the NWF, that is Fig. 1, it can be clearly seen that the most important covariates, e.g., last login and quiz, significantly drifted. On top of that, the DCDs were relatively high—especially for the unprivileged group. Consequently, unfairness was high across all fairness metrics as shown in Fig. 4. Indeed, unfairness could be as high as 0.28 (28%) in terms of the NPV-DIFF metric. In contrast, consider the Fig. 2 for the BAF dataset. We observed that the top-4 most important covariates showed insignificant drifts. In addition to the insignificance of the drifts, the DCDs were very little. In fact, the highest DCD for the BAF dataset was approximately a measly \(-\)0.03 (3%) as compared the DCD of NWF which could be as high as 0.2 (20%). Consequently, we observed that the models trained on the BAF dataset displayed little unfairness. Similar observations were made across the other datasets.

Key takeaway: Significant drifts coupled with high DCD in important covariates is likely to result in higher algorithmic unfairness.

5.2 Robustness of fairness algorithms to covariate drift

This section focuses on the ability of fairness-aware models to maintain their predictive accuracy and fairness guarantees in the presence of covariate drift. Figures 6 and  7 are the results of this experiment in terms of equalized odds statistical parity for the BAF and ITF datasets respectively. In both figures, the green-dashed line indicates the ideal fairness score. The gray bars indicate the fairness score when the training and test datasets are from the same distribution at \(t_0\), thus following the iid assumption. The purple and olive bars indicate the fairness scores for training dataset at \(t_0\) and test datasets at \(t_1\) and \(t_2\) respectively. The purple and olive bars represents the non-iid assumption. Negative fairness scores represent unfairness against the unprivileged group and positive scores indicate otherwise. Furthermore, the annotations on the bars are the average weighted F1-scores for each model on each test dataset. The bar heights are the average of the particular fairness metric and the error bars are the standard errors of the means. We made the following observations.

Fig. 6
figure 6

Robustness of baseline and fairness-aware models on the BAF dataset in terms of EO metric

In both Figs. 6 and  7, we observed that none of the fairness-aware models that we evaluated is consistently robust. None of the fairness-aware models we investigated was consistently able to maintain their fairness and predictive capability in the face of covariate drift across different metrics and datasets. Nevertheless, we observed that certain algorithms performed better than other algorithms on certain datasets at times. For instance, on the BAF dataset, we observed that the PR algorithm consistently achieved near perfect fairness in the EO metric with minimal drop in predictive accuracy as shown in Fig. 6. In fact, except for the consistency metric which we found to be insensitive to the fairness-aware models just like Majumder et al. (2021) did, we found that the PR algorithm consistently performed well in terms of all predictive accuracy and fairness metrics at times \(t_0\) through to \(t_2\). It is worth noting that the BAF dataset is highly imbalanced in the target class and the demographic groups have equal base rates. In general, we observed that the in-processing approaches, specifically the PR algorithm, tend to be relatively robust as compared to the pre- and post-processing approaches which partly contradicts the findings of Islam et al. (2022). We made similar observations across other datasets and fairness metrics, except in terms of the consistency metric which was insensitive. Key takeaway: Existing fairness-aware models are often not able maintain their utility in the presence of covariate drift across different fairness metrics and datasets

We also observed that the fairness of an algorithm is a function of time. We found that claims of fairness of an algorithm on a static snapshot of data—as is commonly done in literarture—can be misleading. We made this observations across datasets for different fairness metrics. For instance, consider the Fig. 7 for the ITF dataset. In the fourth quadrant (clockwise), it can be clearly seen that the DIR+XGB was the least fair model in terms of the SP metric at \(t_0\), and even at \(t_1\). However, at \(t_2\), the DIR+XGB was the most fairest. Similar trends can be observed in the other quadrants. Indeed our results suggest that a supposedly unfair algorithm, might actually be fair, using the same metric, in the same domain, but on (test) dataset from a different time. The reverse of this logic also holds. Therefore, we posit that concluding a fairness approach as fair or unfair should be contextualized. Key takeaway: Unless claims of fairness or unfairness are contextualized e.g., wrt time, conclusions about a particular fairness algorithm’s “superiority” can be misleading.

Fig. 7
figure 7

Robustness of baseline and fairness-aware models on the ITF dataset in terms of SP metric

5.3 Impact of design choices on fairness

We made other interesting observations based on design choices such as hyper-parameter optimization and baseline model selection across all datasets and fairness metrics. For example, consider Fig. 6 which represents the fairness of both baseline and fairness-aware models in terms of the EO metric on the BAF dataset at times \(t_0\) to \(t_2\). Given the same pre-processed data from the DIR algorithm, DIR+XGB seems to be the fairest whereas the DIR+RF seems to be the least fair at times \(t_0\), \(t_1\), and \(t_2\). In fact, the DIR+RF was discriminatory against the unprivileged group whereas the DIR+SVM appeared to discriminate for the unprivileged. We made similar observations for the post-processing approaches that were run on top of the ML models. Furthermore, whereas Islam et al. (2022) observed that the post-processing approaches tend to be generally less be impacted by the choice of ML model as compared to the pre-processing, we observed that neither approach consistently has an edge over the other. Similar to Roth (2018), we rather found the in-processing approaches, particularly the PR algorithm to be less impacted since they are tied to a single model at design. Key takeaway: The choice of downstream ML model is a key indicator of the fairness and predictive accuracy for pre- and post-processing algorithms, thus should be carefully considered.

5.4 Implications for practice

From our findings, we discuss some useful practical implications.

There is the need for continuous monitoring and evaluation of fairness algorithms. An implication of this is that, before classifying a fairness algorithm as fair and fit for deployment, or classifying an algorithm as unfair and should be discarded, fairness practitioners may have to continuously monitor the deployed algorithm as the tides of fairness can turn really fast. A few recent works have started to explore this line of research (Henzinger et al., 2023a, b). This is an emerging research area that requires significant attention.

The flexibility (e.g., in choice of base ML models) in pre- and post-processing approaches can be both strength and weakness. The ability of pre- and post processing approaches to pair up with any downstream models gives them the convenience of variety. Moreover, pre- and post-post processing approaches allow seamless integration with popular and powerful ML libraries such as sci-kit learn. However, the downstream models that are applied to the fair data are not designed with any fairness constraints. Therefore, the fairness of the downstream models are not guaranteed. Moreover, extra customization and hyper-parameter optimizations may undo whatever fairness was incorporated in the fair data. Additionally, pre-processing approaches mostly aim to correct the ground truth labels. Therefore, predictive error or accuracy-based fairness metrics cannot be catered for by pre-processing approaches (Islam et al., 2022). Fairness practitioners may address this weakness by ensembling pre- and in-processing approaches where the flexiblity of pre-processing is hybridized with the strictness of in-processing approaches.

The cause of algorithmic unfairness is a cocktail of latent variables. Differential covariate drift, is without a doubt, a source of algorithmic unfairness, thus should be addressed. However, wrongly attributing unfairness to differential covariate drift may cause the unfairness to persist even if the algorithm is monitored round-the-clock for any drift and appropriately handled. The source of algorithmic unfairness is multifaceted. Therefore, fairness researchers may have to identify the source of unfairness on a case by case basis and address it accordingly instead of proffering a one-size-fits-all solution.

6 Conclusion

In this work, we investigated the relationship between differential covariate drift and algorithmic unfairness, and we further analyzed the robustness of 7 existing fairness algorithms in the face of covariate drift. We found that significant drifts in important covariates in addition to higher differential covariate drifts often leads to unfairness. We also found that none of the existing fairness-aware algorithms that we evaluated are robust in the presence of covariate drift. Even more interestingly, in contrast to certain prior studies, we found that there is no correlation between the magnitude and direction of data distributional drift and the ensuing level and direction of unfairness. Based on these insights, the study offers policy implications related to the impact of data distributional drift on fairness algorithms. These implications are important for relevant stakeholders, offering valuable guidance on addressing and mitigating fairness issues in the presence of covariate drift.

Recently, there are some algorithms that have been designed with distributional drift in mind (Chen et al., 2022; Du & Wu, 2021; Taskesen et al., 2020; Rezaei et al., 2021). In a future study, we intend to perform similar investigations on these algorithms to ascertain if they are indeed robust to distributional drift as claimed. Furthermore, we intend to investigate an interesting line of research which suggests that stability of fairness algorithms can be achieved via surrogate functions by reducing surrogate fairness-gaps and variance (Yao et al., 2024).

https://shorturl.at/4ErID