This section describes the proposed framework for reasoning about adaptation of ML-based systems. We start by discussing the assumptions and design goals underlying the framework and its requirements for ensuring the design goals. Then, we focus on its novel aspects, namely (1) how to formally model ML components in order to reason about the impacts of ML mispredictions on system utility, (2) how to predict the costs/benefits of different adaptation tactics, and (3) how to integrate these predictions with the formal model.
4.2 Self-adaptation Manager’s Architecture
Similarly to previous work [
6,
8,
49], the proposed framework leverages a self-adaptation manager that adopts a MAPE-K [
37] architecture, as illustrated in Figure
1. The following paragraphs briefly describe each module of the architecture.
Environment. Generates events that constitute the inputs to the ML-based system, and hence to its ML component(s). These events may cause ML mispredictions and a decrease in system utility.
ML-based system. Implements the domain-specific tasks, for which it relies on at least one ML component and may rely on several other components, both ML and non-ML. ML-based systems include, for instance, financial fraud detection [
3,
54] and machine translation systems [
23,
59]. These systems normally rely on both ML and non-ML components to fulfill their objectives, namely detecting fraudulent transactions and translating sentences, respectively. A system component that incorporates an ML model is considered an ML component. Such components can be adapted via the execution of the tactic selected by the self-adaptation manager.
Self-adaptation manager. Provides the required functionalities for ML adaptation. The novelty of the proposed framework with respect to existing self-adaptation managers lies in the operation of the Analyze and Plan components:
Analyze—contains the cost/benefit predictors (referred to as
Adaptation Impact Predictors (AIPs), introduced in Section
4.4), which leverage historical data of previous adaptations and of their impact on the ML component’s quality (e.g., accuracy) to estimate its future performance in case an adaptation is or is not executed, and
Plan—comprises the adaptation planner, which relies on a formal model of the system being adapted and on a probabilistic model checker to synthesize the adaptation strategy that optimizes system utility.
The self-adaptation manager triggers the execution of any of the tactics available to adapt the system. Although the diagram in Figure
1 considers two example tactics (retraining ML components or NOP (not performing any adaptation)), as discussed in our prior work [
13], the ML literature has proposed a number of approaches that can be leveraged as adaptation tactics, such as:
—
Unlearning [
9,
10]: useful when data in the ML model’s knowledge base no longer represents the environment or contributes to increased model quality. Removes unwanted samples faster and more efficiently than by retraining the model without those samples.
—
Transfer learning [
34,
51]: helpful if ML model
\(M1\) is expected to have to deal with environmental conditions that a different ML model
\(M2\) has dealt with. Leverages data from
\(M2\) so that
\(M1\) learns how to react/predict for the (expected) upcoming and previously unknown/unseen scenario/context.
—
Human-based labeling [
62]: convenient when the ML model has high uncertainty and the decision/prediction is critical, or has high impact on the system. Offloads decision to human operator/user, who can typically offer more assurance (especially in the case of expert users/operators).
The self-adaptation manager should thus abide by the following key requirements:
R1
Provide the means to predict the effects of adapting and not adapting the model on its future accuracy
R2
Include a way to characterize in a compact but meaningful way the error of an ML component
R3
Be able to determine the impact of ML mispredictions on overall system utility
The following sections describe the Analyze and Plan components, explaining how these requirements are met.
4.3 Formally Modeling ML Components
This section details how we formally model the ML components to capture their error in a compact but meaningful way (realizing R2) and its impact on system utility (realizing R3).
ML component definition. Depending on the domain of operation of a system, the most appropriate type of ML component varies. For instance, while in cyber-physical systems it is common to see reinforcement learning ML components [
38], in the context of fraud detection [
44,
54] or medical diagnosis systems [
25], offline trained ML components (e.g., decision trees or neural networks) are more common. We focus our analysis on offline trained ML components and specifically on ML classifiers. (Note that it is possible to transform a regressor into a classifier by discretizing the target domain, although this implies introducing an intrinsic prediction error due to the chosen discretization granularity.)
ML component state. Since the goal of the proposed framework is to model ML components, and in particular the impact of their mispredictions on system utility, we require a way to evaluate their classification performance. Classification models are typically evaluated based on a popular construct known as confusion matrix [
61], which provides a statistical characterization of the model’s quality by describing the distribution of its misclassification errors. For a classification problem with
N classes, the confusion matrix normalized by rows
\(\mathcal {C}\) (the rows represent the actual sample class and sum to 1) contains, in each cell
\((i,j)\), the ratio of samples of class
i (ground truth) that have been classified as being of class
j (prediction). For the simpler case of binary classification problems, the confusion matrix is reduced to a
\(2\times 2\) matrix where each cell specifies the following:
True-Positive Rate (TPR)—percentage of examples of the positive class that the model classified as such;
True-Negative Rate (TNR)—percentage of examples of the negative class that the model classified as such;
False-Positive Rate (FPR)—percentage of examples of the negative class that the model classified as positive;
False-Negative Rate (FNR)—percentage of examples of the positive class that the model classified as negative. This representation allows for extracting further error metrics such as the model’s accuracy and F1-score.
The row-normalized confusion matrix has the following relevant properties:
(1) generic—can be computed for different ML models (e.g., random forest or neural network);
(2) tractable—is compact and abstract enough to be encoded into a formal model;
(3) expressive—captures the predictive performance of the ML model and allows for computing several error metrics; and
(4) extensible—can be used to model the impacts of executing different adaptation tactics [
13] (e.g., retrain, NOP) by updating its cells. These properties make it a natural fit to model ML components and hence realize R2. Depending on the predictive models used by the model checker to estimate the evolution of the confusion matrix (or the cost of executing an adaptation tactic), the state of the ML component can be extended with additional variables, e.g., that describe the expected data shifts on the input or output.
ML component interface. Since the framework aims to adapt offline-trained ML components, we define the base interface as being composed of the methods
query and
retrain. The ML component interface can be extended to incorporate additional adaptation tactics (e.g., transfer learning [
51] or unlearning [
9,
10]), if those are indeed available for the managed system. As the name suggests,
retrain models the execution of a retrain procedure of the ML component by triggering an update of its row-normalized confusion matrix. The techniques employed to predict how the confusion matrix of an ML component evolves as a result of a retrain procedure are described in Section
4.4. The
query method models the process of asking the ML component for predictions for a set of inputs. Specifically, this method should abstract over the concrete input/output values of the samples and of the predictions, requiring only the total number of inputs for the ML component and the expected distribution of (real) output classes
\({\bf O}\) (given by the probability
\(p_i\) for an input to be of class
i, for all classes
\(i\in [1,N]\)). The method returns a (non-normalized) confusion matrix
\(\mathcal {C}^*\) that reports in position
\((i,j)\) the (absolute) number of inputs of class
i that are classified as of class
j by the model.
\(\mathcal {C}^*\) can be simply computed by multiplying each row
i of the normalized confusion matrix
\(\mathcal {C}\) by
\(p_i\). The interface can be extended to account for more adaptation tactics that allow to tailor the framework to specific adaptation scenarios.
Dealing with uncertainty. As shown by recent work [
7,
33,
49], capturing uncertainty and including it when reasoning about adaptation contributes to improved decision-making. To capture uncertainty, we leverage the probabilistic framework proposed by Moreno et al. [
48], which allows to account for different sources of uncertainty in the system (e.g., uncertainty on the effects of an adaptation tactic or on the input class distribution) and which generates memoryless strategies (strategies that depend only on the current state of the system). This framework accounts for uncertainty by modeling the source of uncertainty as a probabilistic tree that is approximated via the
Extended Pearson-Tukey (EP-T) [
36] three-point approximation. The current state of the source of uncertainty is represented by the root node of the probability tree and the child nodes are its possible realizations.
4.4 Predicting the Effects of Adapting and Not Adapting
A key requirement of our framework is the ability to predict the costs and benefits of executing adaptation tactics on the ML components (requirement R1). For this purpose, the proposed framework associates with each adaptation tactic a dedicated component, which we call the AIP. The AIP is in charge of predicting (1) the adaptation tactic’s cost that is charged to the system utility and (2) the impact of the adaptation on the future quality of the ML component. We also include an adaptation tactic corresponding to performing no changes to the ML component (NOP). While the AIP for tactic NOP always predicts zero costs (this tactic inherently has no cost), its model quality predictor captures the evolution of the model’s performance if no action is taken, e.g., the possible degradation of accuracy of the ML component due to data shifts. Overall, this approach allows the model checker to quantify the impact of different adaptation tactics on system utility and reason about their cost/benefit tradeoffs.
We focus on the problem of how to predict the performance evolution of the ML component and describe, in the next section, how we tackle the problem of implementing AIPs for the retrain and NOP tactics for generic ML components. Indeed, for tactics such as
retrain the problem of estimating their costs has been investigated in the system’s community. The literature has shown that data-driven approaches [
11] based on observing previous retraining procedures, possibly mixed with white-box methods [
66], can generate accurate predictive models of the retrain cost.
Predicting future quality of ML components. Given the reliance on a row-normalized confusion matrix \(\mathcal {C}\) to characterize the performance of ML components, predicting their performance evolution requires estimating how \(\mathcal {C}\) will evolve in the future, e.g., due to shifts affecting the quality of the current model or as a consequence of retraining the model to incorporate newly available data.
The proposed method abstracts over the specific adaptation \(a()\) by modeling it as a generic function \(\mathcal {M}^{\prime } \leftarrow a(\mathcal {M}, \mathcal {I}, \mathcal {N})\) that produces a new ML model \(\mathcal {M}^{\prime }\) and takes as input (1) model \(\mathcal {M}\) prior to the execution of the adaptation; (2) data \(\mathcal {I}\), used to generate model \(\mathcal {M}\); and (3) new data, \(\mathcal {N}\), that became available since the last adaptation, e.g., by deploying the model in production and gathering new samples and corresponding ground-truth labels. We assume that both \(\mathcal {I}\) and \(\mathcal {N}\) contain ground-truth labels. Additionally, we assume that \(\mathcal {M}\) and \(\mathcal {M}^{\prime }\) are generic supervised ML models that are queried and returned predictions for the input samples. These two assumptions allow to determine the confusion matrices of models \(\mathcal {M}\) and \(\mathcal {M}^{\prime }\) at any future time interval, since their predictions can be compared with the ground-truth labels.
We seek to build blackbox regressors (e.g., random forests or neural networks) that, given model \(\mathcal {M}\) obtained at time 0 with dataset \(\mathcal {I}\), and given new data \(\mathcal {N}\) available at time \(t\gt 0\), predict the confusion matrices of both models (\(\mathcal {M}\) and \(\mathcal {M}^{\prime }\)) at time \(t+k\), where \(k\gt 0\) is the prediction lookahead window.
Adaptation impact dataset. In order to train such a blackbox regressor, we build an Adaptation Impact Dataset (AID) by systematically simulating the execution of the adaptation tactic using production data in different points in time. This allows for gathering observations characterizing the execution of the adaption tactic in different environmental contexts, such as (1) different sets of data used to adapt the model, (2) variation in the time passed since the last execution of the tactic, and (3) different ML performance before and after adaptation.
The first step of the procedure consists of monitoring model
\(\mathcal {M}_0\) of an ML component in production over
T time intervals. During this period, given the absence of AIPs, we assume that no adaptation is executed. Next, we deploy
\(\mathcal {M}_0\) on a testing platform (so as not to affect the production environment) and systematically apply adaptation
\(a()\) at each time interval
\(i\gt 0\), i.e.,
\(a(\mathcal {M}_0, \mathcal {I}_0, \mathcal {N}_i)\). This yields a new model
\(\mathcal {M}_i\), which we evaluate at every future time interval
\(i\lt j\le T\), obtaining the corresponding confusion matrices, noted as
\(\mathcal {C}_{i}(j)\). Overall, this procedures yields
T models, resulting from the adaptation of
\(\mathcal {M}_0\) at different time intervals, and produces
\(T\cdot (T-1)\) measurements of the confusion matrices at times
\(j\gt i\). This testing platform is required to support the data pre-processing pipeline, model building, and inference stages of the ML components targeted by the adaptation. Such testing platform is then leveraged by the framework to create and evaluate different versions of these components, eschewing the need to reproduce the full production system, comprising the whole set of ML and non-ML components. We expect such testing platforms to be normally available due to the common DevOps/MLOps practice [
60] of testing ML models’ quality prior to their actual deployment in production.
For each of the aforementioned \(T\cdot (T-1)\) measurements, we generate an AID entry, \(e_{i,j,k}\), which describes the quality at time \(j+k\) of model \(\mathcal {M}_j\) obtained by executing \(a(\mathcal {M}_i, \mathcal {I}_i, \mathcal {N}_j)\) at time j on model \(\mathcal {M}_i\), where \(\mathcal {I}_i\) denotes the data used at time i to generate model \(\mathcal {M}_i\), and \(\mathcal {N}_j\) the new data gathered from time i until time j. Each entry \(e_{i,j,k}\) has as target variables the \(N^2-N\) independent entries of the confusion matrix at time \(j+k\) of model \(\mathcal {M}_j\) and stores the following features:
—
Basic Features: provide basic information on (BF1) the amount of data (i.e., number of examples) used to generate model \(\mathcal {M}_i\), i.e., \(\mathcal {I}_i\), and gathered thereafter, i.e., \(\mathcal {N}_j\); (BF2) the predictive quality of the model shortly after its generation and at the present time; (BF3) the time elapsed since the last execution of the adaptation tactic, i.e., \(j-i\); (BF4) the ground-truth distribution of classes at the time model \(\mathcal {M}_i\) was generated and at the present time.
—
Output Characteristics Features: describe variations in the distribution of the output of models \(\mathcal {M}_i\) and \(\mathcal {M}_j\). It also includes the distribution of the uncertainty of the models’ predictions. This feature is included only when the ML model provides information regarding the uncertainty of a prediction. This information is usually provided by commonly employed ML models like random forests, Gaussian processes, and ensembles.
—
Input Characteristics Features: aim to capture variations in the distributions of the features of datasets
\(\mathcal {I}_i\) and
\(\mathcal {N}_j\). The current version of the framework computes, for each feature
f, the
Pearson correlation coefficient (PCC) between its values in
\(\mathcal {I}_i\) and
\(\mathcal {N}_j\). However, other metrics could also be used to detect shifts in the input distributions, e.g., using different distributional distances like
Jensen-Shannon divergence (JSD) [
46] or Kolmogorov [
45].
Overall, the AID can be seen as composed of pairs of features, where each pair describes a specific “characteristic” of the data or model at two different points in time, e.g., amount of data available at time
i and
j or distribution of predicted classes at time
\(j+k\) by models
\(\mathcal {M}_i\) and
\(\mathcal {M}_j\). The last step of the process consists of extending the AID by encoding the variation of each feature as follows: (1) for scalar features (e.g., amount of data) we encode their variation using the ratio and difference and (2) for features described via probability distributions (e.g., prediction’s uncertainty) we quantify their variation using the JSD [
46] (inspired by previous work [
54]), which yields a scalar measurement of the similarity between two probability distributions. This generic methodology can also be applied to the case of the NOP tactic. In this case, the dataset describes how the accuracy of a model originally obtained at time
i will evolve at time
\(j+k\), based on the information available at time
j.
Building the AIPs. We exploit the AID dataset to train a set of independent AIPs, which can be simple linear models or blackbox predictors such as random forests or neural networks. Each AIP is trained to predict the value of a different cell of the confusion matrix. Given an
n-ary classification problem, we have
\(n^2-n\) independent values for the corresponding confusion matrix, since each row must sum to 1. For the case of binary classification, where
\(n=2\), it is sufficient to predict the values of the two elements on the diagonal, which, being in different rows, are not subject to any mutual constraint. For the general case of
\(n\gt 2\), it is necessary to ensure that the predictions of the AIPs targeting different cells of the same row sum to 1. This can be achieved by using a softmax function [
4] to normalize the predictions generated by the AIPs into a probability distribution.
Integrating the AIPs in the formal model. As for the integration of the AIPs in the formal model, which is checked via a tool such as PRISM, a key practical issue is related to the fact that these tools do not typically allow for interacting with external processes (which could be used to encapsulate the implementation of the AIPs) during model analysis. This would be beneficial for cases when the model checker is used to reason on a look-ahead horizon of \(l\gt 1\) time intervals. In such a case, up to \(a^l\) possible adaptation strategies are generated, where a is the number of adaptation tactics available, thus requiring up to \(l\cdot a^l\) predictions.
This problem can be circumvented by integrating directly the AIPs as part of the formal model to be checked. This approach is reasonable if the AIPs are implemented via simple methods, such as linear models, but is cumbersome and unpractical for the case of more complex models, such as neural networks. An alternative approach, which is the one currently implemented in our framework, is to precompute all the predictions that will be required during the model checking phase and provide them as input constants to the model checker tool. This approach is viable only when the lookahead window and the set of available adaptations are small but allows us to use arbitrary external predictors.