research-article

Open access

Understanding Any Time Series Classifier with a Subsequence-based Explainer

Authors:

Fosca GiannottiAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 2

Article No.: 36, Pages 1 - 34

https://doi.org/10.1145/3624480

Published: 13 November 2023 Publication History

PDF eReader

Abstract

The growing availability of time series data has increased the usage of classifiers for this data type. Unfortunately, state-of-the-art time series classifiers are black-box models and, therefore, not usable in critical domains such as healthcare or finance, where explainability can be a crucial requirement. This paper presents a framework to explain the predictions of any black-box classifier for univariate and multivariate time series. The provided explanation is composed of three parts. First, a saliency map highlighting the most important parts of the time series for the classification. Second, an instance-based explanation exemplifies the black-box’s decision by providing a set of prototypical and counterfactual time series. Third, a factual and counterfactual rule-based explanation, revealing the reasons for the classification through logical conditions based on subsequences that must, or must not, be contained in the time series. Experiments and benchmarks show that the proposed method provides faithful, meaningful, stable, and interpretable explanations.

1 Introduction

The increasing availability of high-dimensional data stored in the form of time series such as electrocardiogram records, stock indices, motion sensors data, and so on, contributed to the diffusion of a wide range of time series classifiers [6, 80] in a variety of essential applications, ranging from the identification of stock market anomalies to the automated detection of heart diseases. The rising interest in this topic is confirmed by two surveys [6, 80] in which different kinds of univariate and multivariate time series classification models are tested and compared. The most common baseline, to which all other models are compared, is k-Nearest Neighbors (KNN) [59, 85, 97], usually paired with the Euclidean distance, Dynamic Time Warping (DTW) [72] or others [8, 98]. Transformation-based classifiers are also becoming very relevant, extracting different kinds of features from entire time series like BOSS [81] and WEASEL+MUSE [82], or sub-intervals like CIF [67], RISE [26] and FIT [60]. Further, the surveys focus both on traditional ensemble-based approaches like HIVE-COTE [63] and more recent deep learning-based models like ResNet [39] and TapNet [96]. Rocket [17], and its faster version MiniRocket [18], are regarded as the current best-performing state-of-the-art models. They use random convolutional kernels to quickly classify univariate and multivariate time series and have the best trade-off between speed and accuracy. The drawback of most of these models lies in their complexity, which makes them black-boxes and causes the non-interpretability of the internal decision process for humans [22]. However, when it comes to making high-stakes decisions, such as clinical diagnosis, the explanation aspect of the models used by Artificial Intelligence (AI) systems becomes a critical building block of a trustworthy interaction between the machine and human experts. Meaningful explanations [75] of time series classification would augment the cognitive ability of domain experts, such as medical doctors, to make informed and accurate decisions and better support AI accountability and responsibility in the decision-making.

A line of research exploring interpretable, transparent-by-design, and efficient time series classifiers is based on shapelets [95]. Shapelet decision trees [95] and shapelet transforms [62] extract shapelets from the time series of the training set by selecting subsequences with high discriminatory power and exploiting them for the classification process. Alternative approaches for mining discriminatory subsequences are the Matrix Profile [14] and SAX approximation [57, 83]. Unfortunately, in terms of accuracy and stability, all such methods lag far behind black-box time series classifiers, particularly in the presence of noisy data [25].

In this paper, we investigate the problem of black-box explanation for time series classifiers. We propose lasts(local agnostic subsequence-based time series explainer), an explainable AI (XAI) method unveiling the logic of any black-box classifier operating on time series. Given a time series X labeled with class \({\hat{y}}\) by a black-box b, lasts returns an explanation e composed of three parts that reveal the reasons for the opaque model’s decision via different representations. First, a saliency-based explanation highlights the most important parts of the time series, which are responsible for driving the black-box towards the outcome \({\hat{y}}\) or away from it. Second, an instance-based explanation composed by a set of exemplar and counterexemplar time series. Exemplars are instances classified with the same label of X and highlight common parts responsible for the classification. On the other hand, counterexemplars are instances similar to X but with a different label and provide evidence of how the time series should be “morphed” for being classified with a different label. Third, a factual and counterfactual rule-based explanation reveals the reasons for the classification through logic conditions expressed as subsequences that must (or must not) be contained in the time series in order to obtain (or not) the outcome \({\hat{y}}\). We emphasize the importance counterfactual components of the explanation of lasts, i.e., counterexemplars and counterfactual rules, which are becoming essential ingredients in XAI methods [3, 12, 89]. While factual, direct explanations such as decision rules [53], and feature importance [64, 78], are crucial for understanding the reasons for a certain prediction, a counterfactual reveals what should change in a given instance to obtain a different classification outcome [89]. Counterfactuals are useful because they facilitate reasoning about the cause-effect relationships between observed features and classification outputs.

This work generalizes the approach proposed in [37] as a modular framework that is able to explain any black-box classifier for univariate and multivariate time series, also extending it in several ways.¹ In line with recent studies on XAI [64, 78], we tackle the time series black-box outcome explanation problem by deriving a local explanation to understand the behavior of the black-box in the neighborhood of the instance to explain [36]. Inspired by [34, 37], we develop a unified, modular framework and overcome many state-of-the-art limitations. First, we propose a novel neighborhood generation strategy (cfs) to generate consistent synthetic time series. For this purpose, we exploit autoencoders [40] for generating, encoding, and decoding a local neighborhood composed of exemplar and counterexemplar instances. Second, we present a novel way of generating a saliency map that does not require performing arbitrary time series segmentations, usually needed in competitor approaches [38, 64]. This saliency map provides a quick and immediate assessment of the crucial points of the time series for the classification, allowing a better understanding of the most critical observations. Third, we use the shapelet [62] and SAX [61] transformations paired with a local surrogate tree for designing meaningful rule-based explanations, useful and easy to understand [33], which are based on interpretable time series subsequences.

Thus, the explanation e returned by lasts has the unique characteristic of being simultaneously saliency-based, rule-based, and instance-based. Further, it explains the black-box decision by exploiting three different data granularities, i.e., time points, subsequences, and entire time series, contrary to other state-of-the-art XAI approaches, which are typically limited to one of the three [87]. The benefit of having heterogeneous explanations is that they can be adopted in various contexts to convey the reasoning behind a black-box decision to different types of users through multiple alternative forms. Figure 1 shows an illustrative example of the explanations provided by LASTS. First, the saliency-based explainer highlights the most significant or relevant points in a time series, providing a high-level understanding of key elements within the data. Such a tool could be useful for decision-makers or executives who need to understand key points of a time series prediction but do not have the time or need to investigate the finer details, such as project managers attempting to understand key progress markers or setbacks in a project timeline. Exemplars and counterexemplars are used in the instance-based explanation to present a complete data picture. They may be most useful for users seeking a comprehensive understanding of the data or experimenting with “what-if” scenarios: for example, strategists and planners interested in what happened and what could have happened in different circumstances. Finally, rule-based explainers provide explanations based on the presence or absence of specific subsequences in the data. Deciphering these patterns can have practical implications. For example, compliance officers tasked with detecting fraudulent activities in financial transactions must recognize specific patterns that indicate irregularities. Similarly, engineers monitoring production processes may look for operational anomalies that have specific shapes.

Fig. 1.

It is important to highlight the inherent heterogeneity of these explanations, each deriving from different methodologies and each offering unique insights into the time series data. This is both a strength and a challenge. On the one hand, it is a strength because each type of explanation complements the others by highlighting different aspects of the time series data. On the other hand, it poses a challenge as not all explanations are equally suited to represent all aspects of the data. For instance, saliency-based explainers are excellent at spotlighting the most significant points within a time series, providing a condensed understanding of key elements. However, their representation method, focused on what is present in the data, limits their ability to signify the importance of patterns not contained within the data. Conversely, due to their focus on the presence or absence of specific sequences, subsequence-based explainers can effectively highlight both existing patterns and the significance of absent ones, thus offering a different lens through which to view the data. Hence, a comprehensive understanding of the time series data necessitates amalgamating these heterogeneous explanations. By employing a diverse set of explainer types, the strengths of one can compensate for the limitations of others, ensuring a complete and nuanced understanding of the time series data. Each type of explanation, therefore, is not just a standalone analytical tool but an integral component of a comprehensive explanatory system. Overall, the effectiveness and relevance of each explainer type are contingent on multiple factors, such as the complexity of the time series data, the specificity of the questions posed, and the user’s technical comprehension and individual needs. The primary advantage of using LASTS, as compared to separately adopting competitor explainers, is that all explanations, even though they are produced by different approaches, come from the same original source: the autoencoder.

To the best of our knowledge, lasts is the only model-agnostic approach able to return a set of heterogeneous explanations, offering an in-depth understanding of the local decision of the black box. We present a wide experimentation, testing different alternatives to the proposed approach. We empirically demonstrate that lasts provides faithful, stable, useful, and really understandable explanations by benchmarking it against state-of-the-art competitors on 15 time series datasets [64, 79].

To summarize, the main contributions of this work are:

•

a local, model agnostic, framework for explaining time series classifiers: lasts can explain the prediction of any univariate and multivariate time series classifier;

•

a heterogeneous set of explanations for time series classification: a saliency map, subsequence-based factual and counterfactual rules, exemplar and counterexemplar time series instances;

•

a quantitative evaluation on several datasets: lasts is evaluated on univariate and multivariate datasets and against state-of-the-art competitors, demonstrating the effectiveness of each part of the explanations.

The rest of the paper is organized as follows. Section 2 discusses related works. Section 3 formalizes concepts used to design lasts, which is then described in Section 4. Section 5 presents the experiments. Section 6 summarizes our contribution, its limitations, and future research directions.

2 Related Work

Research on black-box explanation has recently received much attention [1, 36]. This growing interest is driven by the idea of incorporating opaque classifiers accompanied by explainers into AI systems, allowing the coexistence of high performance and explainability [68]. XAI approaches can be categorized according to many aspects [9]. First, a common taxonomy differentiates between ante-hoc, i.e., directly interpretable white-box models, and post-hoc explainability approaches, which explain black-box models after training without changing their underlying structure. XAI approaches can be further divided into model-specific if they exploit knowledge of the internal structure of the black-box and model-agnostic if they do not. Moreover, local XAI approaches provide explanations for a specific instance of the dataset, while global approaches explain the logic of the black-box as a whole. Finally, XAI approaches can be categorized depending on their explanation output. Many kinds of explanations exist, depending on the specific task, the problem domain, and most of all, the kind of data under analysis [9]. In our setting, i.e., time series classification, explanations can be divided into time-step-based, e.g., saliency maps, when they focus on the importance of each observation towards the classification output, subsequence-based when they explain the classification outcome using discriminative patterns of the time series, or instance-based, e.g., prototypes or counterfactuals, when they use whole time series to exemplify some salient property of the data.

XAI for univariate and multivariate time series data is a rising topic in the literature, with a plethora of approaches tackling the problem from different points of view. In the following, we focus on overviewing the most pertinent methods related to our work, i.e., post-hoc, model-agnostic, and local XAI approaches. On this point, one cannot fail to mention lime and shap. lime [78] randomly generates instances “around” the instance to explain, creating a local neighborhood. Then, it trains a linear model on the neighborhood labeled with the black-box. The explanation consists of the feature importance of the linear model. shap [64] connects game theory with local explanations and overcomes lime’s limitations, exploiting the Shapley values of a conditional expectation function of the black-box providing the unique additive importance for each feature. Methods like lime and shap are thought for tabular data. However, if naively applied to time series classifiers, they can provide a time-point-based explanation for time series classifiers, considering each point as a separate feature [4]. Unfortunately, this procedure can work only on toy examples due to computation costs and given that time series classifiers are usually robust w.r.t. perturbations on single observations. Instead, a more involved approach is first to segment the time series and then use each segment as a feature [37, 38, 70]; however, the choice of the segmentation and type of perturbation is completely arbitrary. This paper proposes a saliency-based explanation that does not require any segmentation or discretization of the dataset analyzed.

Regarding subsequence-based explanations, the vast majority of XAI approaches are model-specific and usually extract patterns in the form of shapelets [95], or symbolic subsequences [61]. These subsequences with high discriminatory power are then used to transform the time-series dataset into a simpler representation [62] which can be paired with white-box classifiers such as decision trees or logistic regressors, guaranteeing an explanation for the decision. Given their simplicity, these approaches are generally lacking in terms of accuracy. There are many works in the literature aimed at improving the efficiency of the subsequence search using various optimization techniques [41, 42, 47], or by learning subsequences via gradient descent [30], or by first discretizing the time series using SAX [61] to speed up the search of greedy algorithms [56, 57]. Model-specific methods like ADSNs [65] and XCNN [91] take a different approach instead, directly generating patterns with adversarial learning to ensure their realism via a discriminator network. We reiterate that, differently from the aforementioned methods, our approach is model agnostic. We use subsequences to generate simple decision rules that explain the output of any black-box in terms of logical conditions. To the best of our knowledge, there are only a handful of XAI approaches that output rules as explanations, such as Anchor [79], and RuleMatrix [69], but none of them are tested on time series data.

Instance-based explanations are becoming more and more common in the literature. They exemplify a model’s decision by providing salient and important instances that can directly explain the black-box decision, i.e., prototypes, or indicate the minimal changes that result in a different classification outcome, i.e., counterfactuals. Again, most of these approaches are model-specific, like [45] which can generate counterfactuals for k-nearest neighbor and Random Shapelet Forest [44] classifiers, or CEM [19], which uses an LSTM and a fully connected network to find the minimal perturbations that change the model’s prediction. To the best of our knowledge, the only model-agnostic instance-based approaches for time series are native guides [16], LatentCF++ [92] and CoMTE [5]. Native guides extracts potential counterfactuals starting from the original dataset and adapts them to generate novel ones. LatentCF++ uses generative models to create counterfactual for Convolutional and LSTM networks. The main drawback of these approaches is that they only work on univariate time series data, contrary to our proposal. CoMTE can provide counterfactual explanations for multivariate time series classification by computing the minimal number of substitutions in order to change the predicted class of the original time series. Different from our approach, CoMTE can explain only black-boxes that return a prediction as class probability, thus precluding its applicability on widely used models like Rocket [17] and Minirocket [18] which are commonly combined with a Ridge classifier. Also, in our approach, the perturbations are performed in a so-called latent space to ensure the generation of a set of consistentcounterfactual instances.

3 Setting the Stage

In this paper, we address the black-box outcome explanation problem [36] in the domain of time series classification. We keep our paper self-contained by summarizing the key concepts necessary to comprehend the proposed explanation method.

3.1 Time Series

A time series signal is defined as follows:

Definition 3.1 (Time Series Signal).

A time series signal (or channel, dimension) \({\bf x}\) is a set of m real-valued observations sampled at equal time intervals, \({\bf x} = \lbrace x_{1}, \dots , x_{m}\rbrace \in \mathbb {R}^{m}\).

A set of one or more time series signals forms a time series:

Definition 3.2 (Time Series).

A time series X is a set of d signals, \(X = \lbrace {\bf x}_{1}, \dots , {\bf x}_{d}\rbrace \in \mathbb {R}^{m\times d}\).

When \(d=1\), the time series is univariate, while if \(d\gt 1\) the series is multivariate. A Time Series Classification (TSC) dataset is a set of time series with a vector of labels (or classes) attached.

Definition 3.3 (TSC Dataset).

A time series classification dataset \(\mathcal {D} = (\mathcal {X}, {\bf y})\) is a set of n time series, \(\mathcal {X}=\lbrace X_{1}, \dots , X_{n}\rbrace \in \mathbb {R}^{n \times m \times d}\), with a set of assigned labels, \({\bf y}=\lbrace y_1, \dots , y_n\rbrace \in \mathbb {N}^n\).

For a dataset \(\mathcal {D}\) containing c classes, \(y_i\) can take c different values. When \(c=2\), \(\mathcal {D}\) is a binary classification dataset, while if \(c\gt 2\), then \(\mathcal {D}\) is a multi-class classification dataset. In order to ensure clarity and consistency in notation, we have adopted a tensor-like notation based on [52]. Lowercase letters are used to denote single scalar observations (e.g., x), while bold lowercase letters are used to denote vectors and individual signals of time series (e.g., \({\bf x}\)). Capital letters are used to denote matrices and time series instances (e.g., X), and Euler script letters are used to denote tensors and time series datasets (e.g., \(\mathcal {X}\)). To indicate a specific observation in a time series dataset, we use the notation \(x_{i,j,k}\), where i denotes the \(i^{th}\) multivariate time series in the dataset, j denotes the \(j^{th}\) time-step, and k denotes the \(k^{th}\) signal of the time series. Although time series do not always form a proper tensor since different channels can have different numbers of observations, we use a unique symbol m to denote the length of the time series for simplicity of notation. When indexes are not relevant, we omit them for better readability. We can now define the TSC problem as:

Definition 3.4 (TSC).

Given a TSC dataset \(\mathcal {D}\), Time Series Classification is the task of training a function f from the space of possible inputs to a probability distribution over the class values in \({\bf y}\).

The resulting TSC function f takes as input a time series X and returns \({\hat{y}}\) according to what f learned, i.e., \({\hat{y}}= f(X)\). In general, \({\hat{y}}\) can either be a discrete label or the probability of X belonging to a specific class. We use \(f(\mathcal {X}) = {\hat{{\bf y}}}\) as a shorthand for \(\lbrace f(X) \;|\; X \in \mathcal {X}\rbrace = {\hat{{\bf y}}}\). Typically, the classifier f can be queried at will.

A common way to build classifiers in the time series domain is to use subsequences. In simple terms, a subsequence is a continuous sample of observations from a time series. Formally:

Definition 3.5 (Subsequence).

Given a signal \({\bf x} = \lbrace x_{1}, \dots , x_{m}\rbrace\) of a time series X, a subsequence \({\bf s} = \lbrace x_{j}, \dots , x_{j+l-1} \rbrace\) of length l is an ordered sequence of values such that \(1\le j\le m-l+1\).

Subsequence-based classification approaches search for patterns that better discriminate the dataset labels, i.e., they try to find those subsequences that are most dissimilar between instances belonging to different classes. The two most common kinds of subsequences used for this purpose are shapelets [95] and symbolic subsequences [61].

Shapelets . Typically, shapelet-based methods extract a set containing p-most discriminative shapelets by minimizing an information gain-like metric. Once the most discriminative shapelets are found, the Shapelet Transform [62] can be applied in order to transform the time series dataset into a simplified representation. Formally:

Definition 3.6 (Shapelet Transform).

Given a time series dataset \(\mathcal {X}\) and a set \(S\in \mathbb {R}^{p\times l}\) containing p shapelets, the Shapelet Transform, \({\varsigma }\), converts \(\mathcal {X} \in \mathbb {R}^{n \times m\times d}\) into a real-valued matrix \(T \in \mathbb {R}^{n \times p}\), obtained by taking the minimum distance between each time series in \(\mathcal {X}\), and each shapelet in S, via a sliding-window, i.e., \(T = {\varsigma }(\mathcal {X})\).

As a note, the sliding-window distance is method-dependent, and it is calculated w.r.t. the signal the shapelet was extracted from. In practice, the shapelet transform extracts the p-most discriminative shapelets from a time series dataset and returns a new representation of the data where the attributes represent the distances between each time series and the p shapelets. Hence, any classification algorithm can be used, potentially increasing the accuracy while reducing training time.

Symbolic Subsequences . Given that exhaustive subsequences search is computationally expensive, a common approach is to first transform the time series into a simplified representation by using SAX. The Symbolic Aggregate approXimation (SAX) algorithm [61] transforms time series into sequences of strings. For each time series signal \({\bf x} = \lbrace x_{1}, \dots , x_{m}\rbrace \in \mathbb {R}^{m}\), SAX uses the Piecewise Aggregate Approximation (PAA) [46] to split it into w equally sized intervals and averages the values of each interval. Then, the time series signal is discretized using a finite set of symbols, i.e., an alphabet \(\mathbb {A}\). Formally, \(\text{SAX}({\bf x}) = \tilde{{\bf x}} = \lbrace \tilde{x}_{1}, \dots , \tilde{x}_{w}\rbrace \in \mathbb {A}^{w}\), with \(|\mathbb {A}| \gt 1\) being the number of symbols in the chosen alphabet. This approximation reduces running time while denoising the time series. Once the time series are converted to a symbolic representation, the most discriminative symbolic subsequences can be found using different techniques. Similar to the Shapelet Transform, the dataset can be transformed into a new representation, having as features the extracted subsequences and as values 0 or 1 depending on the absence or presence of subsequences inside the time series. Formally:

Definition 3.7 (Symbolic Subsequence Transform).

Given a time series dataset \(\mathcal {X}\) and a set \(S\in \mathbb {A}^{p\times l}\) containing p symbolic subsequences, the Symbolic Subsequence Transform, \({\varsigma }\), converts \(\mathcal {X} \in \mathbb {R}^{n \times m\times d}\) into a binary-valued matrix \(T \in \mathbb {\lbrace }0,1\rbrace ^{n \times p}\), obtained by checking if each subsequence in S is contained or not in each time series in \(\mathcal {X}\), i.e., \(T = {\varsigma }(\mathcal {X})\).

For interpretability purposes, the symbolic subsequences can be easily mapped back to the original segments of the time series. We emphasize that we use the same symbol \({\varsigma }\) to denote the symbolic subsequence transform and the shapelet transform, given that they both convert a time series dataset into a tabular representation with subsequences as features.

3.2 Types of Explanations

Given a not interpretable, i.e., black-box, time series classifier b and a time series X classified by b, i.e., \(b(X) = {\hat{y}}\), our aim is to provide an explanation e for the decision \(b(X)={\hat{y}}\). More formally:

Definition 3.8 (Time Series Black-box Outcome Explanation Problem).

Let b be a not interpretable time series classifier, and X a time series whose decision \({\hat{y}}=b(X)\) has to be explained, the time series black-box outcome explanation problem consists in finding an explanation \(e \in E\) belonging to a human-interpretable domain E.

To build a complete, human-interpretable explanation in the time series domain, we consider three kinds of explanations: saliency maps, examples, and decision rules.

Saliency Maps. . Saliency maps are explanations that highlight the contribution of each feature for the classification [36]. Formally, for time series:

Definition 3.9 (Saliency Map).

Given a time series X a saliency map \({\Phi }= \lbrace {\phi }_{1,1}, \dots {\phi }_{j,k}, \dots {\phi }_{m,d}\rbrace \in \mathbb {R}^{m \times d}\) contains a score \({\phi }_{j,k}\) for every real-valued observation \(x_{j,k}\) of X.

In practice, the saliency map assigns an importance score to each observation in X depending on its contribution to the classification output.

Examples. . Examples, also called example-based (or instance-based) explanations, use whole time series objects to add interpretability to classification models by providing a comparison of the instance to explain with a salient time series. The two main different types of examples are exemplar and counterexemplar instances [34, 37]. Exemplars, also called prototypes, are time series that exemplify the main characteristics that influence the classifier’s decision. Formally:

Definition 3.10 (Exemplar).

Given a classifier f, an instance \({X_=}\) is an exemplar if there is a set of instances \(\mathcal {X}^{\prime }\subset \mathcal {X}\) represented by \({X_=}\), and such that \(\forall X \in \mathcal {X}^{\prime }, f({X_=})=f(X)\).

The explanation is obtained by comparing the instance X for which we have the decision \(f(X)\) with the exemplar \({X_=}\) that represents it. Representation is usually formalized with a notion of similarity. On the other hand, counterexemplars, also called counterfactual instances, are very similar w.r.t. the instance to explain but are classified differently. The explanation is achieved by comparing the minimal differences in shape that lead to a different classification outcome. Formally:

Definition 3.11 (Counterexemplar).

Given a classifier f that outputs the decision \({\hat{y}}= f(X)\) for an instance X, a counterexemplar consists of an instance \({X_\ne }\) such that the decision for f on \({X_\ne }\) is different from \({\hat{y}}\), i.e., \(f({X_\ne }) \ne {\hat{y}}\), and such that the difference between X and \({X_\ne }\) is minimal, and that \({X_\ne }\) is plausible.

Minimality usually refers to a distance metric, while plausibility is assessed by checking if the instance is not simply an adversarial example [29] and is semantically coherent with the dataset.

Decision Rules . Explainable AI methods increasingly simplify explanations to increase user trust by ensuring they identify cause-effect relations between events [12]. In this sense, rule-based explanations are arguably the most interpretable from a human standpoint as they allow the user to understand the reason behind a classifier’s decision in terms of if-then statements. Formally:

Definition 3.12 (Decision Rule).

Given an instance X, a decision rule is a function \(r:p \rightarrow {\hat{y}}\), where the premise p is a set of logical conditions on feature values, and \({\hat{y}}= r(X)\) is the predicted class value for X.

Decision rules are generated by rule-based classifiers or can be inferred by analyzing the splits of decision trees. They can be naively applied to time series by considering observations as features. In this case, a condition is in the form \((j,k) \in [v_{\text{low}}, v_{\text{up}}]\), where \((j,k)\) are the indexes identifying the \(j^{th}\) time-step of the \(k^{th}\) signal, and \(v_{\text{low}}, v_{\text{up}} \in \mathbb {\bar{R}}\) are the lower and upper bounds on the observation value. A time series \(X_i\) is covered by the rule if every condition in p is true. A local rule-based model rm can be used for explaining the black-box prediction for X if it imitates sufficiently well its behavior, i.e., if \(rm(X) = b(X)\) and also for every \(X^{\prime }\) in the neighborhood of X, \(rm(X^{\prime }) = b(X^{\prime })\). The definition of the neighborhood is method-dependent. A rule that directly explains the prediction of a black-box is called a factual rule. In contrast, the rules obtained by minimally removing or adding conditions in the factual rule premises are called counterfactual rules. Counterfactual rules are extremely useful for what-if analysis because they allow understanding of the minimal variation that results in a different classification by the black-box. The most notable rule-based XAI approaches for tabular data are Anchor [79], a model-agnostic method explaining the behavior black-boxes with high precision factual rules, and lore [35], which generates the local neighborhood via a genetic algorithm, trains a decision tree to then extract factual and counterfactual decision rules. The first version of lasts [37] extends lore to univariate time series data, generating rules that explain the decision of the black-box in terms of logical conditions based on time series subsequences. In this case, the conditions in the premise refer to subsequences instead of single time series observations, and the feature values represent the presence/absence of subsequences inside the time series. We build upon this to extend the approach to multivariate time series data.

3.3 Autoencoders in XAI

A standard autoencoder (AE) [40] is a type of neural network trained for learning a representation that reduces the dimensionality from \(m \times d\) to k and captures non-linear relationships. An encoder \({g}: \mathbb {R}^{m \times d} {\rightarrow } \mathbb {R}^{q}\), and a decoder \({h}: \mathbb {R}^{q}{\rightarrow } \mathbb {R}^{m \times d}\) are simultaneously trained with the objective of minimizing the reconstruction loss. Starting from the encoding \({\bf z} = {g}(X)\), the autoencoder tries to reconstruct a representation as close as possible to its original input \(X \simeq {h}({\bf z}) = \hat{X}\). Autoencoders learn to encode their input in a latent representation, which is usually of smaller dimensionality and, therefore, simpler and easier to deal with, w.r.t. the original input. The latent space can also be used to sample synthetic instances that can be decoded into completely new and unseen records. In order to use autoencoders as generators, it is useful to have an encoder with a specific latent distribution from which to sample. In this sense, the two leading solutions used in the literature are Variational Autoencoders (VAEs) [51], which learn the parameters of a latent distribution, usually the mean and standard deviation of a Gaussian, or Adversarial Autoencoders (AAEs) [66], that use a GAN inspired approach [28], adding a discriminator in the network architecture, trained to discriminate the latent distribution, constraining it to the desired form.

In order to support the generation of a good explanation, the autoencoder used in our proposal needs to have some desirable properties:

(1)

a known latent distribution to sample consistentsynthetic instances;

(2)

a latent distribution that allows for the generation of meaningful neighborhoods around sampled points [2];

(3)

a latent space that has as few dimensions as possible, to facilitate the neighborhood generation;

(4)

good performances in the (i) reconstruction error and (ii) reconstruction accuracy, i.e., (i) the reconstructed instances need to be similar to the original instances, and (ii) the autoencoder must be good enough for the black-box to be able to predict the same class before and after the autoencoding.

The first property can be ensured by using VAEs or AAEs. These are superior to traditional AEs as generative models because they allow sampling from a known latent distribution. Usually, a VAE is easier to train w.r.t. an AAE, given the former has to optimize only one loss function, while the latter also has to consider the discriminator [34, 66]. For this reason, we adopt VAEs by default as autoencoders for the proposed approach. The second property can be ensured by checking the sampled instances with a discriminator or by using specific sampling techniques that minimize the distribution mismatch [2] (Section 3.4). The third and fourth properties can be achieved with hyperparameter tuning. In our setting, the reconstruction error is measured in terms of mean squared error between the original instances and the reconstructed ones, \(\text{rec}_\text{mse} = \frac{1}{n} \sum _{i=1}^{n}(X_i-\hat{X_i})^2\). The reconstruction accuracy is the percentage of instances in \(\mathcal {X}\) that are correctly classified by the black-box after being reconstructed by the autoencoder. Formally, \(\text{rec}_\text{acc} = \frac{1}{n} \sum _{i=1}^{n} \mathbb {1}_{b(X_i) = b(\hat{X_i})}\), where \(\mathbb {1}\) is a function which outputs 1 if the condition in the subscript is true and 0 otherwise.

3.4 Neighborhood Generation in XAI

Neighborhood generation, being a central issue in the definition of local surrogates, is an increasingly studied topic in the domain of XAI [37, 64, 78]. Local surrogates mimic the local decision boundary of a black-box for the single instance they are tasked to explain. In order to be able to imitate the black-box, a representative neighborhood of the instance to explain has to be defined. It should be composed of (i) similar, i.e., spatially close, instances having the same label as the instance to explain (prototypes, exemplars), but it also should contain (ii) close instances having a different label (counterfactual instances, counterexemplars, distractors). Post-hoc interpretability approaches often rely on perturbations of the input data that query the black-box to understand how its prediction changes. Two of the most notable examples are lime [78] and shap [64]; however, the perturbation methods used by these models do not ensure the generation of realistic data. Moreover, even if these approaches can be applied to time series data with some modifications, they are not explicitly thought for it [34, 37, 54].

For time series data, it is often hard to generate a meaningful neighborhood in the manifest space by directly perturbing the original instances because the risk of generating unrealistic adversarial examples is relatively high. This is the main reason why generative models such as VAEs and AAEs are increasingly used in this field [7, 34, 37, 43, 54, 74]. In fact, these models are usually trained to produce a specific prior distribution in the latent space, typically a Gaussian. After the training phase, they can be used as generators by sampling new instances or perturbing existing ones. Perturbation techniques can greatly vary: from the usage of different sampling or searching algorithms [74], to genetic approaches [34, 37], to gradient descent-based methods [7, 43]. Even if the technical aspects of these methods are very different, they all rely on autoencoders to generate a suitable latent space to explore and analyze.

In [93], it is shown that latent space operations commonly used in the literature, such as instance interpolation and vicinity sampling, induce a so-called distribution mismatch between the outputs and the prior distribution the model was trained on. This is a delicate issue, given that decoders and generators are usually trained on fixed priors and thus assume that their inputs will have statistical properties that align with their distributions. For this reason, in [2] is proposed a Gaussian-matched neighborhood generation function which avoids the sampling from latent space locations that are highly unlikely given the prior distribution of the autoencoder, and allows the creation of more consistent, i.e., higher quality synthetic instances. Given a latent space vector \({\bf z}\in \mathbb {R}^{{q}}\) in which each scalar entry is independently sampled from a standard Gaussian distribution, i.e., \(\forall z \in {\bf z},\; z\sim \mathcal {N}(0,1)\), a standard Gaussian (ga) vicinity sampling is defined as \({\bf z}_{{\rm\small GA}} = {\bf z}+\theta {\bf u}\) with \(\theta\) being a scaling factor and \({\bf u}\in \mathbb {R}^{{q}}\) a randomly sampled normal Gaussian vector. The Gaussian-matched (gm) operation is defined as \({\bf z}_{{\rm\small GM}} = {\bf z}_{{\rm\small GA}} / \sqrt {1+\theta ^2}\). In [2] it is shown that this approach is guaranteed to produce samples coming from a standard Gaussian distribution, i.e., samples that are indistinguishable from randomly sampled instances from that distribution. The authors show that this property is independent of the number of latent dimensions, and it is proven to work even in high-dimensional latent spaces. Incorporating this operation into our framework helps to reduce the distribution mismatch and generate a synthetic neighborhood that is more consistent, resulting in samples that seem as if they were drawn from the original training set.

4 Local Agnostic Time Series Explainer

In this section we present lasts, a local agnostic subsequence-based time series explainer, solving the black-box outcome explanation problem. Given a black-box b and a univariate or multivariate time series X, the human-interpretable explanation \(e \in E\) returned by lasts for the classification \({\hat{y}}= b(X)\) is composed by three parts: (i) a saliency map highlighting the most sensible part of the time series, (ii) a set of exemplars and counterexemplars, and (iii) subsequence-based factual and counterfactual rules. The saliency map highlights the observations that are most responsible for a class change. Exemplars and counterexemplars illustrate time series classified with the same and with a different outcome than X. They can be visually analyzed to understand the reasons for the classification and make comparisons between X and them. Finally, the factual rule shows the subsequences contained (and not contained) in X responsible for the class \({\hat{y}}\), and vice-versa, the counterfactual highlights how the rule should change to have a different classification outcome. The explanation returned by lasts satisfies the requirements of counterfactuability, usability, and meaningfulness [12, 68, 75], and offers to the final user a multi-modal explanation unveiling the reasons for the classification in different and complementary ways. A simple schema of lasts can be viewed in Figure 2.

Fig. 2.

Besides the black-box b and the time series X, lasts requires a trained encoder \({g}\) and decoder \({h}\) for modeling times series in a simplified representation. The explanation process of lasts, described in Algorithm 1 and in Figure 2, involves the following steps. First, lasts encodes the time series X in its latent representation \({\bf z}\) (line 1). Then, through the cfs function detailed in Algorithm 2, it searches for the closest instance to \({\bf z}\) having a different class, i.e., the closest counterexemplar \({{\bf z}_\ne }\) (line 2). Once \({{\bf z}_\ne }\) is found, lasts generates a synthetic neighborhood Z around it using the \(\mathit {neighgen}\) function, exploiting the distribution of the VAE (line 3). After that, the latent instances are decoded and labeled using the black-box, i.e., \({\hat{{\bf y}}}= b({\hat{\mathcal {X}}})\) (line 4). The decoded neighborhood is used to get exemplar and counterexemplar time series (line 6), while the closest counterexemplar is used to compute the saliency map (line 5). The neighborhood is then represented as the presence/absence of subsequences in a time series through the function \({\varsigma ^*}\), which extracts the most discriminative subsequences and performs the subsequence transform \({\varsigma }\), obtaining the set T (line 7). Finally, a decision tree \(\mathit {dt}\) is trained on \((T, {\hat{{\bf y}}})\) (line 8) and used to retrieve the subsequence-based factual rule and counterfactual rules \({r_=}, {{r_\ne }}\) (line 9). The explanation comprises the saliency map, exemplar and counterexemplar instances, and the factual and counterfactual rules (line 10). Details of each step are presented in the rest of this section.

4.1 Latent Encoding

The time series X is passed to the encoder part of the VAE that compresses it into a so-called latent representation \({\bf z}={g}(X)\). The time series in Figure 3 is used as a running example. In our case, X originally has 128 observations, and it is compressed in a bidimensional vector, \({\bf z}=[1.344, -2.005]\), depicted in Figure 4 (left). The latent vector \({\bf z}\) can also be passed to the decoder to reconstruct the original time series. Both X and its reconstructed version, \(\hat{X}={h}({\bf z})\), are depicted in Figure 3 (right). The original time series is much more noisy w.r.t. its reconstructed version. This suggests that the autoencoder is able to capture the most relevant features of the time series for the classification, i.e., its general shape, discarding the random noise.

Fig. 3.

Fig. 4.

4.2 Counterfactual Search

The second step of lasts is to find the closest counterfactual instance w.r.t. \({\bf z}\). To perform this search, lasts adopts an algorithm that iteratively samples latent instances around \({\bf z}\), decodes them, and checks their label using b. The counterfactual search algorithm (cfs) is reported in Algorithm 2. cfs uses a function \({\it sample}\) to sample a distribution of synthetic instances around \({\bf z}\). The function sample can theoretically be any sampling function, ranging from a pure random approach like in lime [78] to a genetic algorithm maximizing a fitness function like in lore [33]. A good generation function is fundamental to create consistentsynthetic instances by avoiding the aforementioned distribution mismatch [2], i.e., the sampling from locations in the latent space that are highly unlikely given the prior distribution of the autoencoder. We emphasize that we employ the term “consistent” as mentioned in [2], i.e., as a synonym for probable, likely, or, in other words, coherent with the autoencoder’s underlying distribution. Our aim is to convey that the generated samples possess a similar distribution to that of the original training data. The first version of lasts presented in [37] uses the genetic approach of lore; in the version presented in this paper, lasts adopts by default a procedure inspired by the growing sphere algorithm proposed in [55].

cfs depends on a threshold \(\theta\) that decides the amount of space around \({\bf z}\) that the generated neighborhood is going to occupy. This threshold can be the radius for a spherical random uniform, as in the original growing sphere [55], or a scaling factor for a Gaussian distribution, as detailed in Section 3.4. The goal of cfs is to systematically explore the latent space in order to find the instance closest to \({\bf z}\) belonging to a different class, i.e., the closest counterfactual. The counterfactual search is performed by iteratively generating a neighborhood Z around \({\bf z}\), using the aforementioned sample function, and by checking if there is at least one counterfactual among the generated instances. The presence of counterfactuals indicates that the sampled neighborhood is still crossing the decision boundary; therefore, the threshold is halved, and the procedure loops until the sampling function does not generate any counterfactual. We highlight that in this setting, the generation happens in the latent space, while the presence of counterfactuals is checked in the time series domain by decompressing the latent instances through \({h}\). At each step of the iteration, the cfs algorithm stores all counterfactuals generated and, once out of the loop, it selects the closest one to \({\bf z}\), i.e., \({{\bf z}_\ne }\), as the best counterfactual. In the running example, the closest counterfactual instance can be viewed in its latent form in Figure 4 (left).

4.3 Neighborhood Generation

The third step we detail is the generation of the neighborhood around \({{\bf z}_\ne }\) using the function \(\mathit {neighgen}\). In principle, \(\mathit {neighgen}\) can be a different sampling function w.r.t. the sample function used for the search. In Section 5, we experiment with different combinations of the two. The sampling is performed extremely close to the black-box decision boundary; therefore, this synthetic neighborhood contains, by construction, both a set of time series having the same class as \({\bf z}\) and a set of time series having a different class, i.e., exemplars, \({Z_=}\), and counterexemplars, \({Z_\ne }\) (Figure 4, left). The number of distinct counterexemplar classes is influenced by the final threshold, and increases as \(\theta\) increases. Note that the sampling is performed in the latent space, while the labels are retrieved by decoding the latent neighborhood and then applying the black-box function to obtain its predictions. For ease of viewing, we depict all the counterexemplar instances with the red color, independently of their specific class. Z is decoded and classified by the black-box into \({\hat{\mathcal {X}}_=}\) and \({\hat{\mathcal {X}}_\ne }\). These instances represent the example-based explanation, showing how the decision of the black-box changes depending on the shape of the time series (Figure 4, right). Moreover, by computing the absolute difference between the decoded closest counterfactual \({\hat{X}_\ne }= b({{\bf z}_\ne })\), and \({\hat{X}}=b({\bf z})\), we can discover the time points for which a change in the values is more likely to modify the decision of the black-box. Thus, the saliency map is defined as \(\Phi = |{\hat{X}}- {\hat{X}_\ne }\)|.

Figure 4 (left) shows the latent neighborhood Z highlighting in green the exemplar instances labeled as \({\it bell}\), and in red the counterexemplar instances labeled with a different class value, in this case cylinder. In gray is depicted a standard bivariate normal distribution, highlighting the probability of hidden vectors in the latent space; formally \(\mathcal {N}_2(\mathbf {\mu }, \Sigma)\) with \(\mathbf {\mu } = [0, 0]^T\) and \(\Sigma = I_2\) where \(I_2\) is a 2-dimensional identity matrix. For ease of viewing, we provide an explanation for an instance at the edge of the distribution. The corresponding instances in the manifest space can be viewed in the same figure to the right. The synthetic neighborhood sampled in the latent space perfectly summarizes the separation of the different class values, unveiling a local decision boundary that is easy to detect even with a simple classifier. Both \({Z_=}\) and \({Z_\ne }\), forming the neighborhood, densely surround \({{\bf z}_\ne }\), helping to capture the black-box behavior locally around the closest decision boundary to \({\bf z}\). Note that \({\bf z}\) remains at the edge of the generated neighborhood by design because our area of interest is the decision boundary and not the instance to explain itself.

4.4 Subsequence Extraction

Given the decoded local neighborhood \({\hat{\mathcal {X}}}\) and \({\hat{{\bf y}}}= b({\hat{\mathcal {X}}})\), lasts extracts a set of subsequences S and performs the subsequence transform \({\varsigma ^*}\), encoding time series into a space of presence/absence of subsequences (line 7, Algorithm 1), as detailed in Section 3.1. Subsequences can be of many kinds; in Section 5, we test both shapelet-based and SAX-based subsequences. Figure 5 presents an example of SAX-based subsequences and a heatmap depiction of the transformed dataset. Each column represents a time series in \({\hat{\mathcal {X}}}\). For each row, a colored cell indicates the presence of the corresponding subsequence, while a white cell indicates its absence. The colored cells, green and red, indicate the SAX subsequences contained by the exemplar and counterexemplar time series, respectively. In some cases, we can observe a sort of complementarity. In other cases, a time series can simultaneously contain a combination of subsequences describing exemplars and counterexemplars.

Fig. 5.

4.5 Local Surrogate

Given T and \({\hat{{\bf y}}}\), lasts trains a subsequence-based decision tree classifier \(\mathit {dt}\) that allows to identify subsequence-based factual and counterfactual rules \({r_=}, {{r_\ne }}\) (lines 8 - 9, Algorithm 1). lasts adopts decision trees because factual and counterfactual decision rules can be naturally derived by following the root-leaf paths [33]. Figure 6 (bottom) reports the explanation rules for our example: \({r_=}= \lbrace s_{327} \in X \rbrace \rightarrow {\it bell}\), \({{r_\ne }} = \lbrace s_{327} \not\in X \rbrace \rightarrow \lnot {\it bell}\). The visual representation of the rules shows the position of the subsequences that must be contained and those that must not be contained at their best alignment with X. Looking at the rules, a user can truly understand the reasons for the classification. In this case, the presence of an increasing pattern in the time series differentiates between a bell instance and instances of other classes. In general, rules can be longer and include an arbitrary number of contained and non-contained subsequences, depending on the complexity of the classification task and the resulting surrogate tree.

Fig. 6.

4.6 Explanation

In summary, lasts explains the prediction of a black-box b for a time series instance X. The final explanation, \(e = \lbrace {\Phi }, ({\hat{\mathcal {X}}_=}, {\hat{\mathcal {X}}_\ne }), ({r_=}, {{r_\ne }}) \rbrace\), is composed of three parts. First, a saliency map, \({\Phi }\), highlights the most important timesteps of the time series, i.e., those timesteps that, if changed, would bring the prediction toward a different class. Second, a neighborhood composed of exemplar and counterexemplar instances. Exemplars, \({\hat{\mathcal {X}}_=}\), are instances close to X in the latent space, with the same label as X. They are prototypes, showing the main characteristics of a specific class. On the other hand, counterexemplars, \({\hat{\mathcal {X}}_\ne }\), are instances close to X in the latent space but classified by the black-box differently and can help the user understand how the prediction of the black-box changes by changing the shape of the time series consistently. Finally, the factual, \({r_=}\), and counterfactual, \({{r_\ne }}\), rules logically explain the prediction of the black-box, both in direct and contrastive ways, showing the minimum change in contained/not contained subsequences to modify the black-box prediction. The complete explanation for our running example is presented in Figure 6.

5 Experiments

We experiment with lasts both quantitatively and qualitatively. First, in Section 5.3, we compare different alternatives for neighborhood generation and subsequence extraction on 4 univariate datasets and 2 black-box models. Once the best framework combination is found, we benchmark it on 15 datasets, 10 univariate and 5 multivariate, respectively, from the UCR and UEA time series machine learning repositories². Each part of the explanation returned by lasts is evaluated. The instance-based part of the explanation is assessed through usefulness (Section 5.4). Then, the saliency-based part of the explanation is tested w.r.t. stability, correctness and by running insertion/deletion benchmarks against shap (Section 5.5). Furthermore, the rule-based part of the explanation is evaluated using fidelity, precision and coverage against a global SAX-based decision tree surrogate (glo-sax) and against anchor [79] re-adapted for time series (Section 5.6). Finally, in Section 5.7, we propose two qualitative examples of the explanation of lasts on one univariate and one multivariate time series.

5.1 Datasets and Black-box Models

In Table 1, all the information about the datasets used for evaluating our approach is reported. The training set \(\mathcal {X}_{\text{train}}\) is used both to train the black-box and the autoencoders. In principle, the dataset used for the black-box training, \(\mathcal {X}_{\text{bb}}\), and the dataset used for the autoencoder training, \(\mathcal {X}_{\text{ae}}\), can be different. However, due to the small dimensionality of some datasets, we set \(\mathcal {X}_{\text{bb}} = \mathcal {X}_{\text{ae}} = \mathcal {X}_{\text{train}}\) for all the datasets, with the exception CBF and CBM which are synthetic, and their instances can be sampled at will. The test set \(\mathcal {X}_{\text{test}}\) is used to benchmark both autoencoders and black-box models, and to sample \(\mathcal {X}_{\text{exp}}\), which is composed of a maximum of 50 instances to explain and evaluate.

Table 1.

datasets			details						autoencoder			rocket
name	abbr.	ref.	\(n_{\text{train}}\)	\(n_{\text{test}}\)	\(n_{\text{exp}}\)	m	d	c	\({q}\)	\({\it rec}_\text{mse}\)	\({\it rec}_\text{acc}\)	acc
ArticularyWordRecognition	ART	[90]	275	300	50	144	9	25	32	0.492	0.95	0.99
Cylinder-Bell-Funnel	CBF	[23]	268	84	36	128	1	3	2	1.067	1	1
Cylinder-Bell-Funnel-Multi	CBM	[23]	268	84	36	128	3	3	2	1.029	1	1
Coffee	COF	[11]	28	28	28	286	1	2	4	0.004	1	1
ECG200	EC2	[73]	100	100	50	96	1	2	4	0.180	0.99	0.90
ECG5000	EC5	[27]	500	4500	50	140	1	5	2	0.138	0.98	0.95
ERing	ERI	[94]	30	270	50	65	4	6	16	0.513	0.98	0.99
GunPoint	GUN	[77]	50	150	50	150	1	2	4	0.054	1	1
ItalyPowerDemand	ITA	[48]	67	1029	50	24	1	2	2	0.066	0.96	0.97
Libras	LIB	[20]	180	180	50	45	2	15	16	0.002	0.91	0.91
PenDigits	PEN	[21]	7494	3498	50	8	2	10	4	59.194	0.95	0.98
PhalangesOutlinesCorrect	PHA	[15]	1800	858	50	80	1	2	16	0.002	0.98	0.84
Plane	PLA	[86]	105	105	50	144	1	7	2	0.033	1	1
Strawberry	STR	[13]	613	370	50	235	1	2	4	0.002	0.99	0.98
TwoLeadECG	TWO	[27]	23	1139	50	82	1	2	2	0.042	1.00	1.00

Table 1. Dataset Details, Autoencoder Performance in Terms of Reconstruction Accuracy and MSE, Rocket Performance in Terms of Accuracy

To test different framework alternatives, we train and explain a ResNet [39] (res) implemented in keras according to [24], and a k-Nearest Neighbor [85] (knn) baseline as implemented by scikit-learn. Once the best framework setup is found, we choose as a black-box to explain Rocket [17], as implemented by sktime, using the default parameters for the transform and a RidgeClassifierCV as classification model. The ResNet comprises three residual blocks, each containing three convolutional layers, a global average pooling layer, and a dense layer. The layers inside each residual block have respectively \(64, 128, 256\) filters, of size \(8, 5, 3\). We train res with a batch size of 16, monitoring the loss, with a patience parameter of 50 epochs. As optimizer, we select Adam, with the default keras parameters: \(\mathit {learning\_rate} = 0.001\), \(\beta _1=0.9\), \(\beta _2=0.999\), minimizing the sparse categorical crossentropy. For knn we use the Euclidean distance and the k parameter is selected via grid-search with \(k \in [1, |\mathcal {X}_{\text{bb}}|]\).

5.2 Implementation Details

In the following, we specify the implementation details adopted for each module of lasts³.

Variational Autoencoder. . As autoencoder, we adopt a VAE for the reasons discussed in Section 3. For simplicity, we use the same network structure for all the datasets, i.e., a convolutional autoencoder composed of 16 layers, 8 for the encoder, and 8 for the decoder. The number of filters per layer is 8, and the kernel size is set to 3. The latent dimension is chosen on a dataset basis by starting with \({q}=2\) and iteratively building autoencoders with an increasing value of \({q}\) until the accuracy for the reconstructed instances increases and reaches 0.90, but also keeping \({q}\le m/2\). As an activation function, we always use ReLU. As optimizer, we select Adam with default keras parameters: \(\mathit {learning\_rate}=0.001\), \(\beta _1=0.9\), \(\beta _2=0.999\), minimizing the Mean Squared Error (MSE) with the Kullback–Leibler divergence regularization term. The VAEs are trained with a batch size equal to 32, for a maximum of 8,000 epochs, monitoring the validation loss with a patience of 500 epochs before halving the learning rate and a patience of 1,250 epochs before early stopping. We measure the performance of the autoencoders through the reconstruction error between the original and reconstructed time series, in terms of Mean Squared Error (the lower, the better), and in terms of accuracy of the classifiers on the reconstructed time series (the higher, the better). Since the encoding operation in a VAE is stochastic, \({\bf z}\) can vary slightly. Therefore, to improve the stability of the framework, X is encoded 1000 times, and the latent representation \({\bf z}\) is chosen by checking the most similar reconstruction \(h({\bf z})\) to X using the Euclidean distance. Given a set of 1,000 encodings of X named \(Z^{\prime }\), \({\bf z} = \arg \min _{{\bf z}^{\prime }\in Z^{\prime }} \text{dist}({h}({\bf z}^{\prime }), X)\).

Neighborhood Generators . For the counterfactual search and neighborhood generation, we experiment with four alternatives for the neighgen and sample functions of Algorithm 1 and Algorithm 2: (i) Gaussian (ga) that samples a normal distribution around \({\bf z}\) scaling each vector by a factor \(\theta\) as detailed in Section 3.4; (ii) Gaussian-matched (gm) that is a Gaussian-based sampling [2] that uses distribution matching transport maps to minimize the problem of distribution mismatch. As in Gaussian, it samples the distribution around \({\bf z}\) scaling each vector by a factor \(\theta\) as detailed in Section 3.4; (iii) Uniform Sphere (us) that samples a uniform sphere distribution around \({\bf z}\) with radius \(\theta\); and (iv) Matched Uniform (mu) that combines a Gaussian-matched search to find the closest counterfactual, and Uniform Sphere sampling to generate the neighborhood. In the first three cases, the counterfactual search and the neighborhood generation use the same function, i.e., \({\it sample}={\it neighgen}\), while the last approach uses two different ones, i.e., \({\it sample}\ne {\it neighgen}\).

For the counterfactual search, we generate \(n_s = 10,\!000\) instances at each iteration, starting with a threshold \(\theta = 2\). Once the closest counterfactual is found, the neighborhood generation function \(\mathit {neighgen}\) is run with neighborhood size equal to \(N = 500\) latent instances. If \(\theta\) represents a radius, we perform the final sampling using as \(\theta\) the distance between \({\bf z}\) and \({{\bf z}_\ne }\) while if \(\theta\) represents a scaling factor, we take the last \(\theta\) used in the counterfactual search. In the last generation step, we impose a balance between instance labels by oversampling the minority class. Moreover, we compare these sampling functions against the genetic approach adopted in [37]. For the genetic approach, we also generate \(N = 500\) latent instances using the same parameters as in [37], namely 10 generations, normalized Euclidean distance as the genetic fitness function, probability of mutation equal to 0.5, probability of crossover equal to 0.7.

Subsequence-based Surrogates . We implement the function \({\varsigma ^*}\) of Algorithm 1 in two ways in order to test two different strategies to retrieve subsequences: SAX-based and shapelet-based. To extract SAX-based subsequences, we use the SAX-SEQL algorithm illustrated in [57], which extracts the p-most discriminative subsequences S for \({\hat{{\bf y}}}\), and then converts time series in a binary-valued matrix T using the symbolic subsequence transform detailed in Section 3.1. Furthermore, to extract shapelets-based subsequences, we adopt the LTS algorithm described in [30] that learns the p-most discriminative shapelets S with respect to \({\hat{{\bf y}}}\) via gradient descent. The time series dataset is then converted to a simplified representation via the shapelet transform detailed in Section 3.1. In order to have the same representation as in the SAX-subsequences alternative, the distances in T are replaced with binary values using a threshold \(\tau\) such that \(\forall t_{i,j} \in T\) if the distance \(t_{i,j} \lt \tau\), \(t_{i,j}\) is replaced with 1 else with a 0. The distance threshold \(\tau\) is chosen via grid search by testing the accuracy of the decision tree surrogate for every decile of the distribution of all the distances in T. To further simplify the classification task of the local surrogate, we binarize the label vector by considering only the predicted class of the instance to explain as 1 and all the others as 0. Therefore, \(\forall {\hat{y}}_i \in {\hat{{\bf y}}}\), if \({\hat{y}}_i = b(X)\), \({\hat{y}}_i\) is replaced with a 1 else with a 0.

5.3 LASTS Framework Alternatives Analysis

In the following, we analyze the effect of various alternatives in terms of neighborhood generation and subsequence types on the explanation returned by lasts.

We measure the performance in terms of fidelity, aiming to evaluate how good the explanation model is at mimicking the black-box decisions. The fidelity (fid) is simply defined as the accuracy between the prediction of the black-box and that of the explanation model [9]. In our setting, we compare the prediction of the black-box for the explanation dataset, \({\hat{{\bf y}}}= b(\mathcal {X}_\text{exp})\), and \({\hat{{\bf y}}}^{\prime } = \lbrace {\hat{y}}^{\prime } | \forall X \in \mathcal {X}_\text{exp},\; {\hat{y}}^{\prime } = \mathit {dt}({\varsigma ^*}({h}({g}(X))))\rbrace\), where \(\mathit {dt}\) is the local subsequence-based decision tree learned for each X processed by lasts. Formally, the fidelity is the percentage of times in which \({\hat{{\bf y}}}= {\hat{{\bf y}}}^{\prime }\), i.e., \({\it fid} = \frac{1}{n} \sum _{i=1}^{n} \mathbb {1}_{{\hat{{\bf y}}}= {\hat{{\bf y}}}^{\prime }}\).

In [37], we showed that lasts is more faithful than a version of lasts using the same random neighborhood generation adopted by lime [78]. In the following, we compare the version of lasts proposed in [37] using the genetic-based neighborhood generation (named gen-shp) with the alternatives of lasts proposed in this paper by combining the various sampling strategies for the neighborhood generation and types of subsequences. Besides, we show that extracting an explanation from the local neighborhood of a given instance is a winning strategy compared to an approach that builds a single global interpretable surrogate. Thus, we compare lasts against shapelet/SAX-based global decision tree (\(\mathit {gdt}\)) classifiers, namely glo-shp and glo-sax, trained on \(\mathcal {X}_\text{bb}\). In this case, the fidelity is calculated as the accuracy between \({\hat{{\bf y}}}=b(\mathcal {X}_\text{exp})\) and \({\hat{{\bf y}}}^{\prime }=\mathit {gdt}(\mathcal {X}_\text{exp})\).

To help readability, we provide the average ranking (\(\overline{{\it rk}}\)) (lower is better) at the bottom of each table, which is the average of the ranks of each method for all datasets and black-boxes. Table 2 reports the values of the fidelity. We observe very high fidelity for all the approaches. However, the null hypothesis that all methods are equivalent is rejected (\(\mathit {p{-}value} \lt 0.01\)) from the non-parametric Friedman over multiple datasets and black-boxes. We notice that the global approaches glo-shp and glo-sax always have slightly lower values than the local alternatives of lasts. Among them, the variants employing SAX-based \(\mathit {dt}\) generally have higher fidelity than those using shapelet-based \(\mathit {dt}\). Concerning the neighborhood generation options, results are very similar even though mu seem to return slightly better results. We better elaborate on this fact in the following paragraph.

Table 2.

	model	gen-shp	ga-shp	ga-sax	gm-shp	gm-sax	mu-shp	mu-sax	us-shp	us-sax	glo-shp	glo-sax
CBF	RES	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.94
CBF	KNN	1.00	1.00	1.00	0.97	1.00	1.00	1.00	1.00	1.00	1.00	0.97
COF	RES	1.00	0.93	0.93	0.93	1.00	0.93	1.00	1.00	1.00	0.75	0.89
COF	KNN	1.00	0.93	1.00	0.93	1.00	0.96	1.00	1.00	1.00	0.89	0.89
EC2	RES	0.98	0.94	1.00	0.96	0.99	0.98	1.00	0.95	0.99	0.69	0.78
EC2	KNN	0.97	0.96	0.97	0.90	0.99	0.95	0.98	0.97	0.98	0.83	0.83
GUN	RES	0.98	0.82	0.98	0.78	0.98	0.84	0.98	0.80	0.98	0.93	0.90
GUN	KNN	0.96	0.86	0.96	0.82	0.98	0.86	0.98	0.86	0.98	0.74	0.81
\(\overline{{rk}}\)		4.38	7.31	4.44	8.50	3.31	6.69	3.25	5.88	3.50	8.81	9.94

Table 2. Fidelity Comparison (Higher is Better)

Top 3 performing models in bold.

The following test is to estimate the precision and coverage of the factual and counterfactual rules (indicated as \(\mathit {pre}_{r_=}\), \(\mathit {pre}_{{r_\ne }}\), \(\mathit {cov}_{r_=}\), \(\mathit {cov}_{{r_\ne }}\), respectively). The coverage is measured as the relative number of time series that respect the factual/counterfactual rule premises in the neighborhood \({\hat{\mathcal {X}}}\), while the precision is measured as the relative number of covered time series for which the prediction outcome is correct. In addition, we measure also the cohesion and separation of the neighborhoods \({\hat{\mathcal {X}}}\) through the silhouette (\(\mathit {sil}\)) coefficient [85] with respect to the two clusters identified as \({\hat{\mathcal {X}}_=}\) and \({\hat{\mathcal {X}}_\ne }\) [32]. Table 3 illustrates the average precision, coverage, and silhouette scores when applying lasts for explaining the res classifier. For every metric, the higher the value, the better the score. Again, we notice that SAX-based methods seem to perform better than shapelet-based ones. More in detail, mu-sax has always the highest \(\mathit {pre}_{r_=}\), while ga-sax is among the bests for \(\mathit {pre}_{{r_\ne }}\). Factual coverage scores are, in general, quite close, while shapelet-based models usually have better counterfactual coverage. Finally, for the silhouette, the approaches using the uniform sphere neighgen function (mu and us) perform better by quite a margin. In summary, we can state that lasts explanations obtained using SAX and uniform sphere or matched uniform are the best ones. However, in the following, we only consider lasts using SAX-based subsequences and the mu neighborhood generation, unless otherwise specified, because the gm sampling does not guarantee a perfectly centered neighborhood generation around the closest counterexemplar. This problem can be detected by computing the Local Outlier Factor (LOF) [10] for \({\bf z}\), as implemented in sklearn. In particular, a LOF score of \(-1\) indicates an outlier, while a score of 1 indicates an inlier. In our tests LOF_MU = 1 for every dataset and for both black-boxes, while LOF_GM \(\in [-0.55, 0.64]\), indicating a higher degree of outlierness. Moreover, given that the radius of the uniform sampling is, by construction, as small as possible, the problem of distribution mismatch in a Gaussian space is minimized in most cases.

Table 3.

	metric	ga-shp	ga-sax	gm-shp	gm-sax	mu-shp	mu-sax	us-shp	us-sax
CBF	\(\mathit {pre}_{r_=}\)	0.998	1.000	0.999	1.000	0.999	1.000	0.999	1.000
	\(\mathit {pre}_{{r_\ne }}\)	0.984	1.000	0.994	1.000	0.984	1.000	0.990	1.000
	\(\mathit {cov}_{r_=}\)	0.482	0.500	0.476	0.500	0.481	0.499	0.470	0.500
	\(\mathit {cov}_{{r_\ne }}\)	0.392	0.375	0.340	0.375	0.423	0.472	0.379	0.445
	\(\mathit {sil}\)	0.306	0.306	0.303	0.303	0.342	0.342	0.342	0.342
COF	\(\mathit {pre}_{r_=}\)	0.870	0.998	0.866	0.998	0.884	0.998	0.912	0.996
	\(\mathit {pre}_{{r_\ne }}\)	0.833	0.967	0.872	0.897	0.889	0.947	0.876	0.975
	\(\mathit {cov}_{r_=}\)	0.476	0.441	0.498	0.458	0.458	0.482	0.460	0.481
	\(\mathit {cov}_{{r_\ne }}\)	0.368	0.058	0.372	0.039	0.358	0.039	0.389	0.076
	\(\mathit {sil}\)	0.206	0.206	0.208	0.208	0.228	0.228	0.229	0.229
EC2	\(\mathit {pre}_{r_=}\)	0.871	0.994	0.879	0.993	0.890	0.994	0.878	0.992
	\(\mathit {pre}_{{r_\ne }}\)	0.829	0.920	0.838	0.882	0.828	0.888	0.842	0.885
	\(\mathit {cov}_{r_=}\)	0.402	0.379	0.398	0.392	0.386	0.389	0.384	0.401
	\(\mathit {cov}_{{r_\ne }}\)	0.306	0.022	0.287	0.036	0.326	0.022	0.305	0.025
	\(\mathit {sil}\)	0.147	0.147	0.145	0.145	0.160	0.160	0.160	0.160
GUN	\(\mathit {pre}_{r_=}\)	0.863	0.999	0.860	0.998	0.884	0.999	0.888	0.999
	\(\mathit {pre}_{{r_\ne }}\)	0.886	0.956	0.881	0.935	0.862	0.951	0.883	0.989
	\(\mathit {cov}_{r_=}\)	0.490	0.469	0.479	0.470	0.474	0.475	0.495	0.476
	\(\mathit {cov}_{{r_\ne }}\)	0.434	0.083	0.472	0.106	0.467	0.112	0.443	0.096
	\(\mathit {sil}\)	0.289	0.289	0.284	0.284	0.319	0.319	0.322	0.322
\(\overline{{\it rk}}\)	\(\mathit {pre}_{r_=}\)	7.500	2.000	7.000	2.875	5.750	2.000	5.750	3.125
	\(\mathit {pre}_{{r_\ne }}\)	6.875	1.875	6.250	3.625	7.125	2.625	5.750	1.875
	\(\mathit {cov}_{r_=}\)	3.000	6.500	3.500	4.875	6.125	4.000	5.250	2.750
	\(\mathit {cov}_{{r_\ne }}\)	3.250	7.000	3.750	6.250	2.500	5.250	3.000	5.000
	\(\mathit {sil}\)	6.000	6.000	7.000	7.000	3.000	3.000	2.000	2.000

Table 3. Comparison of Precision, Coverage, and Silhouette for Alternative Neighborhood Generations and Types of Subsequences for the Explanation Returned by lasts for RES (Higher is Better)

Best 3 models for each metric in bold.

The time required by lasts for the explanation is mainly affected by the cfs algorithm, and the subsequence transform (\({\varsigma ^*}\)). The cfs algorithm’s most expensive operations are the black-box (b) and decoder (h) functions. They dominate the time complexity of all the sample functions described in Section 4.3, which is only \(O(q)\), where q is the dimensionality of the latent vector \({\bf z}\) [88]. Following Algorithm 2, b and h must be repeated until no counterexemplar in the latent space is found. Therefore, the worst-case scenario can appear when the latent encoding of \({\bf z}\) is extremely close to the decision boundary, and we have to reduce the \(\theta\) parameter a great number of times. This is extremely unlikely, and we found that in most of our tests, starting with \(\theta =2\), only a few iterations are needed to exit the loop, with the worst case not exceeding 20 iterations. In Table 4, we report the average runtime and standard deviation for the two critical phases and for the complete explanation process (all). We notice that on every dataset, the cfs strategy is one order of magnitude faster than the genetic one⁴. The subsequence extraction runtime is comparable only for the Coffee dataset but, on average, is at least four times faster when using SAX instead of shapelets. In conclusion, we notice that the complete explanation process is much faster in the novel version of lasts when using SAX with respect to the one proposed in [37] and the current one when using shapelets.

Table 4.

		CBF	COF	EC2	GUN
neighgen	cfs	4.83 \(\pm\) 0.18	8.18 \(\pm\) 0.44	6.81 \(\pm\) 0.54	7.47 \(\pm\) 0.63
neighgen	genetic	55.57 \(\pm\) 4.55	80.67 \(\pm\) 2.11	72.25 \(\pm\) 5.14	80.81 \(\pm\) 18.00
\({\varsigma ^*}\)	shp	60.02 \(\pm\) 5.43	46.19 \(\pm\) 5.24	47.33 \(\pm\) 8.53	42.25 \(\pm\) 7.85
\({\varsigma ^*}\)	sax	8.91 \(\pm\) 0.47	43.75 \(\pm\) 9.33	13.04 \(\pm\) 2.61	12.63 \(\pm\) 2.14
all	gen-shp	115.59 \(\pm\) 9.98	126.86 \(\pm\) 7.35	119.58 \(\pm\) 13.67	123.06 \(\pm\) 9.83
	mu-shp tot	64.85 \(\pm\) 5.61	54.38 \(\pm\) 5.69	54.14 \(\pm\) 9.07	49.72 \(\pm\) 8.48
	mu-sax tot	13.74 \(\pm\) 0.65	51.93 \(\pm\) 9.77	19.85 \(\pm\) 3.15	20.10 \(\pm\) 2.77

Table 4. Mean Running Time in Seconds (Lower is Better)

Best results in bold.

In summary, mu-sax is the most well-rounded approach: (i) it generates a neighborhood of well-separated exemplar and counterexemplar instances around the decision boundary, which contains \({\bf z}\) as an inlier, (ii) it has excellent performance as a local surrogate, with high fidelity scores, factual and counterfactual rules precision, and (iii) it ensures that the closest counterexemplar is sampled from a “probable” portion of latent space, minimizing the distribution mismatch.

5.4 Instance-based Explanation Experiments

Since it is hard to validate the usefulness of the generated factual and counterfactual instances with an experiment involving humans, inspired by [50], we tested their effectiveness with a memory-based machine learning technique. This experiment gives an objective and indirect estimation of the usefulness of exemplars and counterexemplars by checking how the performance of a simple classifier changes by training it on an increasing number of instances. For each instance \(X \in \mathcal {X}_\text{exp}\) we apply lasts, i.e., we encode X, find its closest counterexemplar, generate its latent neighborhood, and decode it into \({\hat{\mathcal {X}}}\). From this synthetic dataset \({\hat{\mathcal {X}}}\), using the black-box to retrieve the predicted labels, we extract n exemplars, i.e., n instances having the same class as X, and n counterexemplars from each other class, i.e., \(n(c-1)\) instances having a different class w.r.t. X. This extraction is random and without replacement. Then, we use the selected exemplars and counterexemplars to train a 1-Nearest Neighbor (1-NN), and we classify X. For each dataset, we compare the classification performances on X with a 1-NN trained on real time series, n per class, randomly selected from \(\mathcal {X}_\text{exp} \setminus \lbrace X\rbrace\). Specifically, \(n \in \lbrace 1, 2, 4, 8, 16\rbrace\) and the Euclidean distance is used as a distance function. Figure 7 shows the accuracy of a 1-NN using an increasing number of exemplars and counterexemplars as training. The plot aggregates the accuracy for each dataset at each n, showing the estimate of the central tendency and the respective confidence interval. On average, the performance of lasts is higher and more constant, revealing that few exemplars and counterexemplars are a good proxy for recognizing the classification outcome. On the other hand, the accuracy of real increases more steeply on the various datasets with increasing n, meaning that more exemplars and counterexemplars are necessary to distinguish the reasons for the classification. Hence, we can state that exemplars and counterexemplars help in discovering the decision boundary and in highlighting similarities and differences. This experiment shows that time series must be carefully chosen, as lasts does, in order to be suitable for class recognition.

Fig. 7.

5.5 Saliency-based Explanation Experiments

In these experiments, we compare lasts with shap [64] employed to solve the time series black-box outcome explanation problem as in [4]. Similar to the proposal in [71], we adapt shap to time series classifiers by performing an adaptive segmentation [49] of the time series in order to divide it into meaningful intervals that are used as features by shap. As is commonly done with images, the absence of a feature is simulated via linear interpolation by connecting the observations before and after the ablated segment.

A common strategy to validate a saliency-based explanation is to observe how the performance of the black-box changes by adding/removing features in order of importance [76]. Regarding deletion, the intuition is that removing the most important time-steps in a time series will force the black-box to change its decision. On the other hand, the insertion evaluation adopts a complementary approach by starting from an “empty” time series and adding the most important time-steps. From a practical standpoint, the removal of time-steps is approximated by taking the average of the time series values. Time-steps are added/deleted in order of importance one by one, and the black-box prediction is checked at each step. Insertion/deletion metrics are computed as \(\text{AUC} / md\), i.e., the Area Under the Curve of the line obtained by computing the accuracy of the black-box after each insertion/deletion, divided by the total number of time-steps m of every dimension d of the time series. For the insertion benchmark, we want the black-box performance to improve as fast as possible; therefore, a high score is desirable. On the contrary, we expect a sudden drop in performance for deletion benchmarks, resulting in a lower score. Results can be seen in Table 5. For this benchmark, lasts scores very similarly to shap, indicating that both methods are able to indicate important observations of the time series.

Table 5.

	lasts	shap	lasts	shap	lasts	shap
	deletion \(\downarrow\)		insertion \(\uparrow\)		stability \(\downarrow\)
ART	0.30 \(\pm\) 0.20	0.30 \(\pm\) 0.26	0.35 \(\pm\) 0.23	0.30 \(\pm\) 0.22	0.16 \(\pm\) 0.04	0.32 \(\pm\) 0.04
CBF	0.78 \(\pm\) 0.17	0.75 \(\pm\) 0.23	0.72 \(\pm\) 0.25	0.78 \(\pm\) 0.28	0.04 \(\pm\) 0.08	0.20 \(\pm\) 0.17
CBM	0.82 \(\pm\) 0.08	0.79 \(\pm\) 0.18	0.90 \(\pm\) 0.11	0.73 \(\pm\) 0.20	0.04 \(\pm\) 0.04	0.31 \(\pm\) 0.09
COF	0.47 \(\pm\) 0.50	0.51 \(\pm\) 0.46	0.47 \(\pm\) 0.50	0.48 \(\pm\) 0.49	0.09 \(\pm\) 0.03	0.16 \(\pm\) 0.09
EC2	0.74 \(\pm\) 0.37	0.50 \(\pm\) 0.26	0.79 \(\pm\) 0.28	0.92 \(\pm\) 0.17	0.11 \(\pm\) 0.06	0.27 \(\pm\) 0.13
EC5	0.64 \(\pm\) 0.37	0.71 \(\pm\) 0.22	0.93 \(\pm\) 0.10	0.93 \(\pm\) 0.11	0.03 \(\pm\) 0.03	0.27 \(\pm\) 0.13
ERI	0.43 \(\pm\) 0.30	0.50 \(\pm\) 0.30	0.41 \(\pm\) 0.29	0.39 \(\pm\) 0.31	0.19 \(\pm\) 0.08	0.33 \(\pm\) 0.05
GUN	0.64 \(\pm\) 0.44	0.62 \(\pm\) 0.46	0.61 \(\pm\) 0.48	0.60 \(\pm\) 0.47	0.16 \(\pm\) 0.06	0.30 \(\pm\) 0.20
ITA	0.50 \(\pm\) 0.38	0.65 \(\pm\) 0.30	0.83 \(\pm\) 0.19	0.73 \(\pm\) 0.26	0.07 \(\pm\) 0.08	0.27 \(\pm\) 0.25
LIB	0.33 \(\pm\) 0.23	0.26 \(\pm\) 0.22	0.36 \(\pm\) 0.23	0.22 \(\pm\) 0.18	0.23 \(\pm\) 0.06	0.35 \(\pm\) 0.07
PEN	0.51 \(\pm\) 0.24	0.46 \(\pm\) 0.18	0.41 \(\pm\) 0.23	0.46 \(\pm\) 0.16	0.21 \(\pm\) 0.10	0.38 \(\pm\) 0.15
PHA	0.32 \(\pm\) 0.41	0.48 \(\pm\) 0.21	0.30 \(\pm\) 0.42	0.40 \(\pm\) 0.24	0.21 \(\pm\) 0.05	0.32 \(\pm\) 0.22
PLA	0.40 \(\pm\) 0.28	0.40 \(\pm\) 0.33	0.36 \(\pm\) 0.36	0.45 \(\pm\) 0.27	0.06 \(\pm\) 0.06	0.17 \(\pm\) 0.05
STR	0.55 \(\pm\) 0.48	0.54 \(\pm\) 0.46	0.55 \(\pm\) 0.48	0.58 \(\pm\) 0.44	0.14 \(\pm\) 0.07	0.11 \(\pm\) 0.11
TWO	0.49 \(\pm\) 0.47	0.59 \(\pm\) 0.39	0.59 \(\pm\) 0.38	0.57 \(\pm\) 0.42	0.10 \(\pm\) 0.09	0.27 \(\pm\) 0.13
\(\overline{{\it rk}}\)	1.53	1.47	1.53	1.47	1.07	1.93

Table 5. Saliency Map Deletion (Lower is Better), Insertion (Higher is Better), Stability (Lower is Better)

Furthermore, we measure the stability of an explainer as its ability to produce a similar explanation for close instances, i.e., given similar instances, their saliency map should also be similar. In practice, given an instance X, we find its closest instance \(X^{\prime }\) in the latent space using the Euclidean distance. Then, we compare their saliency maps using the Mean Absolute Error (MAE). Intuitively, stable explanations should have lower MAE, while unstable explanations should have a higher error. As can be seen in Table 5, lasts outperforms shap in stability for all but one dataset, indicating its ability to give similar explanations for similar instances.

Since it is impossible to know exactly if explanations are correct with real datasets and standard black-box models, we use a synthetic experiment to check if the saliency maps obtained with agnostic explainers match with custom-defined ground truths, i.e., to check if they are correct. To perform this study, inspired by [31], we generate a univariate synthetic dataset having two classes. Time series belonging to each class have a distinct and easily identifiable pattern that unequivocally defines their class, plus some random noise. Then, we build a synthetic classifier that performs the classification by only looking at the presence of these predefined patterns. The time-steps of the time series that are checked by the synthetic model during classification are thus known, and the ground truth is defined as a vector of length m with the value 1 if the pattern is present at a given time-step and 0 otherwise. In other words, the only important points for the time series classification have a value of 1, while the noise has a 0 value. Given a saliency-based explanation returned as a vector of length m, we compare it with the ground truth by first normalizing it in the range \([0, 1]\) and then computing MAE between the ground truth and the normalized saliency vector. Intuitively, if the saliency vector correctly identifies important and irrelevant points, the MAE will tend to 0. To improve the test’s significance, we perform it on six synthetic classifiers that use different random and continuous subsets of the original patterns. A comparison of the saliency maps returned by lasts and shap w.r.t. the ground truth is depicted in Figure 8, while box-plots of the MAE for each dataset can be viewed in Figure 9. In general, lasts performs better, i.e., has a lower median MAE in four of the six tests. Moreover, lasts has a lower standard deviation, indicating that even its worst saliency maps are not that far from the ground truth. In general, we observe that lasts tends to give a more targeted and precise explanation, while shap tends to give importance to more points in the time series, resulting in a very wide interquartile range for the MAE.

Fig. 8.

Fig. 9.

5.6 Rule-based Explanation Benchmarks

To the best of our knowledge, lasts is the only local agnostic time series explainer that outputs rules as an explanation. Thus, we decided to compare the factual and counterfactual rules returned by lasts with those of a global decision tree, as in Section 5.3, and also with anchor [79]. In order to adapt anchor to time series, we consider each observation as a separate feature. As a note, anchor can return only factual rules whose conditions depend on single time series observations. We use precision and coverage metrics to evaluate the goodness of a rule. Moreover, given that simpler explanations are to be preferred, we also measure the length of the returned rules.

The results of precision, coverage and length for the factual rules returned by lasts, glo-sax and anchor are presented in Table 6. Regarding precision, anchor is the clear winner in all but one dataset. This result is not surprising given that anchor, by definition, constructs rules that are guaranteed to have a precision above 0.95. lasts scores slightly lower, with a precision that does not drop under 0.9, while glo-sax is the clear loser in this benchmark. The coverage metric helps to show the whole picture of these benchmarks. lasts performs best, followed by glo-sax and anchor in last place. This indicated that while rules returned by anchor are indeed slightly more precise, they are also less generalizable, i.e., they cover a much lower number of instances. Furthermore, as shown by the average lengths, factual rules returned by anchor and glo-sax are also considerably longer, i.e., more difficult to understand from a human standpoint. As a note, anchor is also extremely inefficient for longer time series, requiring hours of runtime to explain a single instance. For this reason, completing the benchmarks for the ART dataset was impossible. Regarding counterfactual rules, results are presented in Table 6. lasts performs better than a global surrogate in precision and length, tying in coverage. This experiment demonstrates that, while the number of instances covered by the rules is comparable for the two methods, counterfactual rules returned by lasts are more precise, shorter, and thus easier to understand.

Table 6.

	\(\mathit {pre}_{r_=}\) \(\uparrow\)			\(\mathit {cov}_{r_=}\) \(\uparrow\)			\(\mathit {len}_{r_=}\) \(\downarrow\)			\(\mathit {pre}_{{r_\ne }}\) \(\uparrow\)		\(\mathit {cov}_{{r_\ne }}\) \(\uparrow\)		\(\mathit {len}_{{r_\ne }}\) \(\downarrow\)
	lasts	glo-sax	anchor	lasts	glo-sax	anchor	lasts	glo-sax	anchor	lasts	glo-sax	lasts	glo-sax	lasts	glo-sax
ART	0.92	0.66	-	0.30	0.04	-	4.70	10.36	-	0.78	0.54	0.09	0.04	4.92	11.04
CBF	1.00	0.89	1.00	0.50	0.33	0.20	1.14	1.69	2.00	0.99	0.86	0.43	0.33	1.22	2.00
CBM	0.99	0.97	1.00	0.46	0.33	0.18	1.33	1.67	2.00	0.96	0.95	0.38	0.33	1.78	2.00
COF	1.00	0.93	1.00	0.45	0.50	0.21	3.64	1.00	2.00	0.93	0.95	0.01	0.50	3.96	1.00
EC2	0.99	0.84	1.00	0.43	0.36	0.13	3.84	1.96	2.36	0.94	0.65	0.02	0.15	4.24	3.16
EC5	1.00	0.98	1.00	0.48	0.47	0.22	2.34	5.10	2.00	0.98	1.00	0.11	0.01	2.80	5.14
ERI	0.98	0.58	1.00	0.43	0.17	0.06	3.00	3.72	3.04	0.89	0.60	0.10	0.17	3.50	4.20
GUN	0.99	0.94	1.00	0.44	0.50	0.13	2.90	1.00	2.44	0.94	0.93	0.06	0.50	3.56	1.00
ITA	0.99	0.80	1.00	0.46	0.37	0.19	2.38	2.48	2.14	0.94	0.54	0.12	0.11	3.26	2.58
LIB	0.94	0.72	0.97	0.26	0.05	0.02	4.64	5.78	11.70	0.79	0.88	0.05	0.02	5.02	6.12
PEN	0.94	0.88	0.99	0.33	0.02	0.04	2.46	10.26	4.12	0.85	0.25	0.17	\(\lt\)0.01	3.44	10.76
PHA	0.91	0.76	1.00	0.30	0.23	0.08	3.58	9.50	2.56	0.81	0.77	0.09	0.01	4.26	10.02
PLA	1.00	0.92	1.00	0.49	0.13	0.08	1.98	4.32	3.54	0.99	0.91	0.12	0.12	2.28	4.92
STR	0.99	0.90	1.00	0.38	0.43	0.12	3.76	2.02	2.48	0.91	0.84	0.03	0.05	4.50	2.50
TWO	1.00	0.96	1.00	0.43	0.50	0.10	2.06	1.00	2.92	0.93	0.96	0.15	0.50	2.96	1.00
\(\overline{{rk}}\)	1.90	2.93	1.04	1.27	1.80	2.93	1.80	2.07	2.07	1.27	1.73	1.47	1.53	1.40	1.60

Table 6. Precision (Higher is Better), Coverage (Higher is Better), and Length (Lower is Better) of Factual and Counterfactual Rules

5.7 Qualitative Examples

This section shows qualitative examples from two real-world datasets, ECG5000 and Libras. Figure 10 presents an explanation of an instance from the ECG5000 dataset. ECG5000 contains \(5,\!000\) heartbeats belonging to five different classes, one corresponding Normal instances, and four corresponding to different kinds of Abnormal heartbeats. The instance we are trying to explain is correctly classified by rocket as Normal (Figure 10 top-left). By looking at the difference between exemplars and counterexemplars, we can clearly see that the main difference between normal and abnormal instances is in the rightmost part of the time series, which is much lower for abnormal time series, and presents an evident V-shape. More specifically, these abnormal series belong all to the Premature Ventricular Contraction class. The saliency map confirms the assessment deduced by the example-based explanation, highlighting only the last observations of the time series. The rules show the other main difference between classes. The factual rule, \({r_=}= \lbrace s_{221} \in X \wedge s_{264} \in X\rbrace \rightarrow {\it Normal}\), shows that normal instances contain subsequence \(s_{264}\), while the counterfactual rule, \({{r_\ne }} = \lbrace s_{221} \in X \wedge s_{264} \not\in X\rbrace \rightarrow {\it Premature Ventricular Contraction}\), shows that abnormal time series have a flatter shape, not containing \(s_{264}\). In general, we do not expect the saliency map and the rules to cover the same exact areas because the saliency map only highlights the parts of the time series that change the most between classes, whereas the rules can be based on subsequences emphasizing even a small shape change from any part of the time series.

Fig. 10.

In Figure 11, we present an explanation of a multivariate time series from the Libras dataset. This dataset contains instances having two signals each, belonging to 15 classes that correspond to different hand movements. The instance we are trying to explain is labeled as face-up curve, and it is correctly classified by rocket. The example-based part of the explanation shows exemplars and counterexemplars that are extremely similar to the naked eye. This means that even very small changes in the time series shape can result in a change of prediction from the black-box. In this sense, the closest instances belong all to class horizontal wavy. The most salient observations are highlighted in the saliency map and show that the top signal, and in particular the central sigmoidal part, is the most relevant for the change in classification. This is also confirmed by the factual and counterfactual rules, \({r_=}= \lbrace s_{412} \in {\bf x}_0 \wedge s_{1026} \not\in {\bf x}_0 \wedge s_{1855} \in {\bf x}_1 \wedge s_{2184} \in {\bf x}_1 \rbrace \rightarrow {\it face-up curve}\), \({{r_\ne }} = \lbrace s_{412} \in {\bf x}_0 \wedge s_{1026} \in {\bf x}_0 \wedge s_{1855} \in {\bf x}_1 \wedge s_{2184} \in {\bf x}_1 \rbrace \rightarrow {\it horizontal wavy}\). The rules show that the most significant subsequence is \(s_{1026}\), given that its presence/absence results in a change in class. Figure 11 (bottom-right) shows a counterexemplar which contains \(s_{1026}\), i.e., has a lower dip in the central sinusoidal pattern w.r.t. X.

Fig. 11.

6 Conclusions

We have presented lasts, a local model-agnostic subsequence-based explainer that returns an easy-to-understand explanation for univariate and multivariate time series classifiers. lasts succeeds in addressing the time series black-box outcome explanation problem, returning three different kinds of explanations: a saliency map, examples, and decision rules. The saliency map highlights the most important observations of the time series for the classification. Exemplar and counterexemplar instances can be compared to the time series to explain the black-box behavior. Finally, subsequence-based decision rules allow an understanding of the logic of the classification, showing the reasons for the outcome in terms of patterns that must and must not be contained. Extensive experimentation shows that lasts outperforms existing explainers in returning meaningful, useful, faithful, and coherent explanations.

The proposed method has some limitations. Indeed, the subsequences-based rules do not consider multiple alignments of the same shapelet at different time series points. On the other hand, multiple occurrences could help better explain a predictive phenomenon. Also, technical and conceptual extensions are possible. First, we would like to test lasts on longer and more complex, real-world time series datasets while also extending it to different types of sequential data like trajectories, text, and shopping transactions. Second, we would like to deepen the study of the relationship between the latent and subsequence spaces. Third, we aim to empower the explanations’ expressiveness and to enable higher levels of abstraction with grammar-based decision trees [58]. Fourth, we also aim to explain the overall logic of a time series classifier by aggregating the local subsequence-based rules into a global explanation model [84]. Finally, a human decision-making task driven by lasts explanations could objectively evaluate the real effectiveness of the explanations.

Footnotes

This work extends “Explaining Any Time Series Classifier” presented at the IEEE International Conference on Cognitive Machine Intelligence (CogMI) 2020 [37].

https://www.timeseriesclassification.com/

Code available at github.com/fspinna/lasts

⁴

We adopted Gaussian-matched sampling for this experiment for the cfs strategy. However, different sampling methods do not significantly affect performance.

A Notation

Table 7.

Data
\(\mathcal {X}, X, {\bf x}, x\)	time series dataset, instance, signal, observation
\({\bf y}, y, c\)	labels vector, value, number of unique labels
\(n, i\)	number of instances in a dataset, instance index
\(m, j\)	number of observations in a time series, feature index
\(d, k\)	number of signals in a time series, signal index
\(S, {\bf s}\)	collection of subsequences, subsequence
\(p, l\)	number, length of extracted subsequences
Models
\(f(\cdot)\)	generic function
\(b(\cdot)\)	black-box classifier
\(rm(\cdot)\)	rule-based classifier
\(dt(\cdot)\)	decision tree classifier
\({\hat{{\bf y}}}, {\hat{y}}\)	classifier prediction for a time series dataset, instance
Transform
\({\varsigma }(\cdot), T\)	shapelet or subsequence transform, transformed dataset
\(\tilde{{\bf x}}\)	SAX-transformed time series signal
\(w\)	number of intervals of PAA
\(\mathbb {A}\)	SAX alphabet
Autoencoder
\({g}(\cdot), {h}(\cdot)\)	encoder, decoder
\({q}\)	number of latent dimensions
\(\theta\)	latent vector scaling factor
\({\bf u}\)	randomly sampled normal Gaussian vector
\(Z, {\bf z}\)	latent encoding of a time series dataset, instance
\({\hat{\mathcal {X}}}, {\hat{X}}\)	autoencoding of a time series dataset, instance
Explanation
\(E, e\)	human-interpretable domain, explanation
\({\Phi }, {\phi }\)	saliency map, saliency value
\({Z_=}, {Z_\ne }\)	exemplar, counterexemplar time series encodings
\({\hat{\mathcal {X}}_=}, {\hat{\mathcal {X}}_\ne }\)	exemplar, counterexemplar time series dataset,
\({{\bf z}_\ne }, {\hat{X}_\ne }\)	best counterfactual latent vector, time series
\({r_=}, {{r_\ne }}\)	factual, counterfactual rule

Table 7. Summary of Notation

References

[1]

Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138–52160.

Abstract

1 Introduction

2 Related Work

3 Setting the Stage

3.1 Time Series

3.2 Types of Explanations

3.3 Autoencoders in XAI

3.4 Neighborhood Generation in XAI

4 Local Agnostic Time Series Explainer

4.1 Latent Encoding

4.2 Counterfactual Search

4.3 Neighborhood Generation

4.4 Subsequence Extraction

4.5 Local Surrogate

4.6 Explanation

5 Experiments

5.1 Datasets and Black-box Models

5.2 Implementation Details

5.3 LASTS Framework Alternatives Analysis

5.4 Instance-based Explanation Experiments

5.5 Saliency-based Explanation Experiments

5.6 Rule-based Explanation Benchmarks

5.7 Qualitative Examples

6 Conclusions

Footnotes

A Notation

References

Cited By

Index Terms

Recommendations

Robust explainer recommendation for time series classification

Denoising Optimization-Based Counterfactual Explanations for Time Series Classification

A Model-Agnostic Approach to Quantifying the Informativeness of Explanation Methods for Time Series Classification

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations