Time Series Classification and Extrinsic Regression are important and challenging machine learning tasks. Deep learning has revolutionized natural language processing and computer vision and holds great promise in other fields such as time series analysis where the relevant features must often be abstracted from the raw data but are not known a priori. This article surveys the current state of the art in the fast-moving field of deep learning for time series classification and extrinsic regression. We review different network architectures and training methods used for these tasks and discuss the challenges and opportunities when applying deep learning to time series data. We also summarize two critical applications of time series classification and extrinsic regression, human activity recognition and satellite earth observation.
1 Introduction
Time series analysis has been identified as one of the 10 most challenging research issues in the field of data mining in the 21st century [1]. Time series classification (TSC) is a key time series analysis task [2]. TSC builds a machine learning model to predict categorical class labels for data consisting of ordered sets of real-valued attributes. The many applications of time series analysis include human activity recognition [3, 4, 5], diagnosis based on electronic health records [6, 7], and systems monitoring problems [8]. The wide variety of dataset types in the University of California, Riverside (UCR) [9] and University of East Anglia (UEA) [8] benchmark archive further illustrates the breadth of TSC applications. Time series extrinsic regression (TSER) [10] is the counterpart of TSC for which the output is numeric rather than categorical. It should be noted that the TSER is not a forecasting method but rather a method for understanding the relationship between the time series and the extrinsic variable. TSER is an emerging field with great potential to be used in a wide range of applications.
Deep learning has been very successful, especially in computer vision and natural language processing. Many modern applications integrate deep learning. Deep learning can autonomously learn informative features from raw data, eliminating the need for manual feature engineering. Consequently, there has been much interest in developing deep TSC and TSER due to their ability to learn relevant latent feature representations. It is worth noting that the majority of TSC and TSER research has focused on non-deep-learning approaches. A recent benchmark [11] shows that the deep learning method (InceptionTime [12]) is competitive but did not outperform the state of the art on benchmarking archives. One reason is that the popular UCR and UEA benchmarking archives were not designed for deep learning models. In particular, they are relatively small, while deep learning often excels when data quantities are large. Deep learning can also benefit from heightened compatibility with current hardware, particularly GPUs, leading to fast and efficient execution. Their exceptional scalability further allows seamless handling of growing data volumes and computational complexity, reinforcing their versatility in processing large datasets. Indeed, ConvTran [13], a recent deep architecture for TSC, outperforms one of the fastest conventional models, ROCKET [14], in terms of both speed and accuracy when there are more than 10k training samples.
A highly influential review paper on deep-learning-based TSC [15] was published in 2019. However, the field of research is very fast moving, and that prior survey does not cover the current state of the art. For example, it does not include InceptionTime [12], a system that consistently outperforms ResNet [16], the best performing system from the prior survey. Nor does it cover attention models, which have received huge interest in recent years and have shown excellent capacity to model long-range dependencies in sequential data and are well suited for time series modeling [17]. Many attention variants have been proposed to address particular challenges in time series modeling and have been successfully applied to TSC [13, 18, 19]. Moreover, the previous survey does not include self-supervised learning, which is emerging as a new paradigm [20]. Self-supervised learning induces supervision by designing pretext tasks instead of relying on predefined prior knowledge and has shown very promising results, especially in datasets with a low label regime [21, 22, 23, 24].
In light of the emergence of attention mechanisms, self-supervised learning, and various new network configurations for TSC, a systematic and comprehensive survey on deep learning in TSC would greatly benefit the time series community. This article aims to fill that gap by summarizing recent developments in deep-learning-based time series analytics, specifically TSC and TSER. Following definitions and a brief introduction to the time series classification and extrinsic regression tasks, we propose a new taxonomy based on various methodological perspectives. Diverse architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), Graph Neural Networks (GNNs), and attention-based models, are discussed, along with refinements made to improve performance. Additionally, various types of self-supervised learning pretexts, such as contrastive learning and self-prediction, are explored. We also conduct a review of useful data augmentation and transfer learning strategies for time series data. Furthermore, we provide a summary of two key applications of TSC and TSER, namely Human Activity Recognition and Earth Observation.
2 Background and Definitions
This section begins by providing the necessary definitions and background information to understand the topic of training deep neural networks (DNNs) for TSC and TSER tasks. We begin by defining key terms and concepts, such as time series data and time series supervised learning. Finally, we present our proposed taxonomy of the different deep learning methods that have been used for TSC and TSER tasks.
2.1 Time Series
Time series data are sequences of data points indexed by time.
Each \(x_i\) is a \(D\)-dimensional vector of values, one for each feature captured in the series. When \(D=1,\) the series is called univariate. When \(D\gt 1,\) the series is called multivariate.
2.2 Time Series Supervised Learning Tasks
This article focuses on two time series learning tasks: time series extrinsic regression and time series classification. Classification and regression are both supervised learning tasks that learn the relationship between a target variable and a set of time series. We consider learning from a dataset \(D=\left\lbrace (X_1,Y_1), (X_2,Y_2), \ldots ,(X_N,Y_N)\right\rbrace\) of \(N\) time series where \(Y_i\) denotes the target variable for each \(X_i\). It is important to note that for ease of exposition, we assume in our discussion that the series are of the same length, but most methods extend trivially to the case of unequal-length series. The main difference between TSER and TSC is that TSC predicts a categorical value for a time series from a set of finite categories, while TSER predicts a continuous value for a variable external to the input time series. Typically \(Y_i\) is a one-hot encoded vector for TSC or a numeric value for TSER.
In the context of deep learning, a supervised learning model is a neural network that executes the following functions to map the input time series to a target variable:
where \(f_i\) represents the non-linear function and \(\theta _i\) denotes the parameters at layer \(i\). For TSC the neural network model is trained to map a time series dataset \(D\) to a set of class labels \(Y\) with \(C\) class labels. After training, the neural network outputs a vector of \(C\) values that estimates the probability of a series \(X\) belonging to each class. This is typically achieved using the softmax activation function in the final layer of the neural network. The softmax function estimates probabilities for all of the dependent classes such that they always sum to 1 across all classes. The cross-entropy loss is commonly used for training neural networks with softmax outputs or classification-type neural networks.
On the other hand, TSER trains the neural network model to map a time series dataset \(D\) to a set of numeric values \(Y\). Instead of outputting probabilities, a regression neural network outputs a numerical value for the time series. It is typically used with a linear activation function in the final layer of the neural network. However, any non-linear functions with a single-value output such as sigmoid or ReLU can also be used. A regression neural network typically trains using the mean square error or mean absolute error loss function. However, depending on the distribution of the target variable and the choice of final activation functions, other loss functions can be used.
2.3 TSC and TSER
TSC is a fast-growing field, with hundreds of papers being published every year [8, 9, 15, 25, 26]. The majority of works in TSC are non–deep learning based. In this survey, we focus on deep learning approaches and refer interested readers to Appendix A and benchmark papers [11, 25, 26] for more details on non–deep learning approaches. Most deep learning approaches to TSC have real-valued outputs that are mapped to a class label. TSER [10, 27] is a less widely studied task in which the predicted values are numeric, rather than categorical. While the majority of the architectures covered in this survey were designed for TSC, it is important to note that it is trivial to adapt most of them for TSER.
Deep-learning-based TSC methods can be classified into two main types: generative and discriminative [28]. In the TSC community, generative methods are often considered model based [25], aiming to understand and model the joint probability distribution of input series \(X\) and output labels \(Y\), denoted as \(p(X, Y)\). On the other hand, discriminative models focus on modeling the conditional probability of output labels \(Y\) given input series \(X\), expressed as \(p(Y | X)\).
Generative models, such as the Stacked Denoising Auto-encoders (SDAEs), have been proposed by Bengio et al. [29] to identify the salient structure of input data distributions, and Hu et al. [30] used the same model for the pre-training phase before training a classifier for time series tasks. A universal neural network encoder has been developed to convert variable-length time series to a fixed-length representation [31]. Also, a Deep Belief Network (DBN) combined with a transfer learning method was used in an unsupervised manner to model the latent features of time series [32]. An Echo State Network (ESN) has been used to learn the appropriate time series representation by reconstructing the original raw time series prior to training the classifier [33]. Generative Adversarial Networks (GANs) are one of the popular generative models that generate new examples by learning to discriminate between real and synthetic examples. Various GANs have been developed for time series and have been reviewed in a recent survey [34]. Often, implementing generative methods is more complex due to an additional step of training. Furthermore, generative methods are typically less efficient than discriminative methods, which directly map raw time series to class probability distributions. Due to these barriers, researchers tend to focus on discriminative methods. Therefore, this survey mainly focuses on the end-to-end discriminative approaches.
2.4 Taxonomy of Deep Learning in TSC and TSER
To provide an organized summary of the existing deep learning models for TSC, we propose a taxonomy that categorizes these models based on deep learning methods and application domains. This taxonomy is illustrated in Figure 1. In Section 3, we review various network architectures used for TSC, including MLPs, CNNs, RNNs, GNNs, and attention-based models. We also discuss refinements made to these models to improve their performance on time series tasks. Additionally, various types of self-supervised learning pretexts, such as contrastive learning and self-prediction, are explored in Section 4. We also conduct a review of useful data augmentation and transfer learning strategies for time series data in Sections 5 and 6. In addition to methods, we summarize key applications of TSC and TSER in Section 7 of this article. These applications include human activity recognition and satellite earth observation, which are important and challenging tasks that can benefit from the use of deep learning models. Overall, our proposed taxonomy and the discussions in these sections provide a comprehensive overview of the current state of the art in deep learning for time series analysis and outline future research directions.
Fig. 1.
3 Supervised Models
This section reviews the deep-learning-based models for TSC and discusses their architectures by highlighting their strengths as well as limitations. More details on deep model architectures and their adaptations to time series data are available in Appendix B.
3.1 Multi-Layer Perceptron (MLP)
The most straightforward neural network architecture is a fully connected network, also called an MLP. The number of layers and neurons are defined as hyperparameters in MLP models. However, studies such as auto-adaptive MLP [35] have attempted to determine the number of neurons in the hidden layers automatically, based on the nature of the training time series data. This allows the network to adapt to the training data’s characteristics and optimize its performance on the task at hand.
One of the main limitations of using MLPs for time series data is that they are not well suited to capturing the temporal dependencies in this type of data. MLPs are feedforward networks that process input data in a fixed and predetermined order without considering the temporal relationships between the input values. Various studies used MLPs alongside other feature extractors like Dynamic Time Warping (DTW) to address this problem [36, 37]. DTW-NN is a feedforward neural network that exploits DTW’s elastic matching ability to dynamically align a layer’s inputs to the weights instead of using a fixed and predetermined input-to-weight mapping. This weight alignment replaces the standard dot product within a neuron with DTW. In this way, the DTW-NN is able to tackle difficulties with time series recognition, such as temporal distortions and variable pattern length within a feedforward architecture [37]. Similarly, Symbolic Aggregate Approximation (SAX) is used to transform time series into a symbolic representation and produce sequences of words based on the symbolic representation [38]. The symbolic time-series-based words are later used as input for training a two-layer MLP for classification.
Although the models mentioned above attempt to resolve the shortage of capturing temporal dependencies in MLP models, they still have other limitations on capturing time-invariant features [16]. Additionally, MLP models do not have the ability to process input data in a hierarchical or multi-scale manner. Time series data often exhibits patterns and structures at different scales, such as long-term trends and short-term fluctuations. MLP models fail to capture these patterns, as they are only able to process input data in a single, fixed-length representation. In addition, MLPs may encounter difficulties when confronted with irregularly sampled time series data, where observations are not uniformly recorded in time. Many other deep learning models are better suited to handle time series data, such as RNNs, CNNs, and transformers, specifically designed to capture the temporal dependencies and patterns in time series data.
3.2 CNN-based Models
Several improvements have been made to CNN since the success of AlexNet in 2012 [39], such as using deeper networks, applying smaller and more efficient convolutional filters, adding pooling layers to reduce the dimensionality of the feature maps, and utilizing batch normalization to improve the stability of training [40]. They have been demonstrated to be very successful in many domains, such as computer vision, speech recognition, and natural language processing problems [40, 41, 42]. As a result of the success of CNN architectures in these various domains, researchers have also started adopting them for TSC. See Table 1 for a list of reviewed CNN models in this article.
Table 1. Summary of CNN Models for Time Series Classification and Extrinsic Regression
3.2.1 Adapted CNNs for TSC and TSER.
This section presents the first category, which we refer to as Adapted CNNs for TSC and TSER. The papers discussed here are mostly adaptations without any particular preprocessing or mathematical characteristics, such as transforming the series to an image or using multi-scale convolution, and therefore do not fit into one of the other categories.
The first CNN for TSC was the Multi-Channel Deep Convolutional Neural Network (MC-DCNN) [43]. It handles multivariate data by independently applying convolutions to each input channel. Each input dimension undergoes two convolutional stages with ReLU activation, followed by max pooling. The output from each dimension is concatenated and passed to a fully connected layer, which is then fed to a final softmax classifier for classification. Similar to MC-DCNN, a three-layer convolution neural network was proposed for human activity recognition (MC-CNN) [44]. Unlike the MC-DCNN, this model applies 1D convolutions to all input channels simultaneously to capture the temporal and spatial relationships in the early stages. The two-stage version of the MC-CNN architecture was used by Zhao et al. [45] on the earliest version of the UCR Time Series Data Mining Archive. The authors also conducted an ablation study to evaluate the performance of the CNN models with differing numbers of convolution filters and pooling types.
Fully Convolutional Networks (FCN) [46] and Residual Network (ResNet) [47] are two deep neural networks that are commonly used for image and video recognition tasks and have been adapted for end-to-end TSC [16]. FCNs are a variant of CNNs designed to operate on inputs of arbitrary size rather than being constrained to fixed-size inputs like traditional CNNs. This is achieved by replacing the fully connected layers in a traditional CNN with a Global Average Pooling (GAP) [46]. FCN was adapted for univariate TSC [16], and similar to the original model, it contains three convolution blocks where each block contains a convolution layer followed by batch normalization and ReLU activation. Each block uses 128, 256, and 128 filters with 8, 5, and 3 filter lengths, respectively. The output from the last convolution block is averaged with a GAP layer and passed to a final softmax classifier. The GAP layer has the property of reducing the spatial dimensions of the input while retaining the channel-wise information, which allows it to be used in conjunction with a class activation map (CAM) [48] to highlight the regions in the input that are most important for the predicted class. This can provide useful insights into how the network is making its predictions and help identify potential improvement areas. Similar to FCN, the ResNet was also proposed in [16] for univariate TSC. ResNet is a deep architecture containing three residual blocks followed by a GAP layer and a softmax classifier. It uses residual connections between blocks to reduce the vanishing gradient effect that affects deep learning models. The structure of each residual block is similar to the FCN architecture, containing three convolution layers followed by batch normalization and ReLU activation. Each convolution layer uses 64 filters with 8, 5, and 3 filter lengths, respectively. ResNet was found to be one of the most accurate deep learning TSC architectures on 85 univariate TSC datasets [15, 25]. Additionally, integration of ResNet and FCN has been proposed to combine the strength of both networks [49].
In addition to adapting the network architecture, some research has focused on modifying the convolution kernel to suit TSC tasks better. Dilated convolution neural networks (DCNNs) [50] are a type of CNN that uses dilated convolutions to increase the receptive field of the network without increasing the number of parameters. Dilated convolutions create gaps between elements of the kernel and perform convolution, thereby covering a larger area of the input. This allows the network to capture long-range dependencies in the data, making it well suited to TSC tasks [51]. Recently, Disjoint-CNN [52] showed that factorization of 1D convolution kernels into disjoint temporal and spatial components yields accuracy improvements with almost no additional computational cost. Applying disjoint temporal convolution and then spatial convolution behaves similarly to the Inverted Bottleneck [53]. Like the Inverted Bottleneck, the temporal convolutions expand the number of input channels, and spatial convolutions later project the expanded hidden state back to the original size to capture the temporal and spatial interaction.
3.2.2 Imaging Time Series.
In TSC, a common approach is to convert the time series data into a fixed-length representation, such as a vector or matrix, which can then be input to a deep learning model. However, this can be challenging for time series data that vary in length or have complex temporal dependencies. One solution to this problem is to represent the time series data in an image-like format, where each time step is treated as a separate channel in the image. This allows the model to learn from the spatial relationships within the data rather than just the temporal relationships. In this context, the term spatial refers to the relationships between different variables or features within a single time step of the time series.
As an alternative to using raw time series data as input, Wang and Oates encoded univariate time series data into different types of images that were then processed by a regular CNN [54]. This image-based framework initiated a new branch of deep learning approaches for time series, which consider image transformation as one of the feature engineering techniques. Wang and Oates presented two approaches for transforming a time series into an image. The first generates a Gramian Angular Field (GAF), while the second generates a Markov Transition Field (MTF). GAF represents time series data in a polar coordinate and uses various operations to convert these angles into a symmetry matrix, and MTF encodes the matrix entries using the transition probability of a data point from one time step to another time step [54]. In both cases, the image generation increases the time series size, making the images potentially prohibitively large. Therefore, they propose strategies to reduce their size without losing too much information. Afterward, the two types of images are combined in a two-channel image that is then used to produce better results than those achieved when using each image separately. Finally, a Tiled CNN model is applied to classify the time-series images. In other studies, a variety of transformation methods, including Recurrence Plots (RPs) [55], Gramian Angular Difference Field (GADF) [56], bilinear interpolation [57], and Gramian Angular Summation Field (GASF) [58] have been proposed to transfer time series to input images, expecting that the 2D images could reveal features and patterns not found in the 1D sequence of the original time series.
Hatami et al. [55] propose a representation method based on RP [59] to convert the time series to 2D images with a CNN model for TSC. In their study, time series are regarded as distinct recurrent behaviors such as periodicities and irregular cyclicities, which are the typical phenomena of dynamic systems. The main idea of using the RP method is to reveal at which points some trajectories return to a previous state. Finally, two-stage convolution and two fully connected layers are applied to classify the images generated by RP. Subsequently, pre-trained Inception v3 [60] was used to map the GADF images into a 2,048-dimensional vector space. The final stage used an MLP with three hidden layers, followed by a softmax activation function [56]. Following the same framework, Chen and Shi [61] adopted the Relative Position Matrix (RPMCNN) and VGGNet to classify time series data using transform 2D images. Their results showed promising performances by converting univariate time series data to 2D images using relative positions between two timestamps. Following the convention, three image encoding methods, GASF, GADF, and MTF, were used to encode MTS data into 2D images [58]. They showed that the simple structure of ConvNet is sufficient for classification as it performed equally well with the complex structure of VGGNet.
Overall, representing time series data as 2D images can be difficult because preserving the temporal relationships and patterns in the data can be challenging. This transformation can also result in a loss of information, making it difficult for the model to classify the data accurately. Chen and Shi [61] have also shown that the specific transformation methods like GASF, GADF, and MTF used in this process do not significantly improve the prediction outcome.
3.2.3 Multi-scale Operation.
The papers discussed here apply a multi-scale convolutional kernel to the input series or apply regular convolutions on the input series at different scales. Multi-scale CNNs (MCNN) [62] and Time LeNet (t-LeNet) [63] were considered the first models that preprocess the input series to apply convolution on multi-scale series rather than raw series. The designs of both MCNNs and t-LeNet were inspired by computer vision models, which means that they were adapted from models originally developed for image recognition tasks. These models may not be well suited to TSC tasks and may not perform as well as models specifically designed for this purpose. One potential reason for this is the use of progressive pooling layers in these models, commonly used in computer vision models, to reduce the input data size and make it easier to process. However, these pooling layers may not be as effective when applied to time series data and may limit the performance of the model.
MCNN has a simple architecture and comprises two convolutions and a pooling layer, followed by a fully connected and softmax layer. However, this approach involves heavy data preprocessing. Specifically, before any training, they use a sliding window to extract a time series subsequence, and later, the subsequence will undergo three transformations: (1) identity mapping, (2) down-sampling, and (3) smoothing, which results in the transformation of a univariate input time series into a multivariate one. Finally, the transformed output is fed to the CNN model to train a classifier [62]. t-LeNet uses two data augmentation techniques: window slicing (WS) and window warping (WW), to prevent overfitting [63]. The WS method is identical to MCNN’s data augmentation. The second data augmentation technique, WW, employs a warping technique that squeezes or dilates the time series. WS is also adopted to ensure that subsequences of the same length are extracted for training the network to deal with multi-length time series. Therefore, a given input time series of length \(L\) is first dilated \((\times 2)\) and then squeezed \((\times 1/2)\) using WW, resulting in three time series of length \(L,2L,1/2L\) that are fed to WS to extract equal-length subsequences for training. Finally, as both MCNN and t-LeNet predict a class for each extracted subsequence, majority voting is applied to obtain the class prediction for the full time series.
Inception was first proposed by Szegedy et al. [70] for end-to-end image classification. Now the network has evolved to become Inception-v4, where Inception was coupled with residual connections to improve further the performance [71]. Inspired by inception architecture, a multivariate convolutional neural network (MVCNN) is designed using multi-scale convolution kernels to find the optimal local construction [64]. MVCNN uses three scales of filters, \(2\times 2\), \(3 \times 3\), and \(5 \times 5\), to extract features of the interaction between sensors. A 1D Inception model was used for Supernovae classification using the light flux of a region in space as an input MTS for the network [65]. However, the authors limited the conception of their Inception architecture to the first version of this model [70]. The Inception-ResNet [72] architecture includes convolutional layers, followed by Inception modules and residual blocks. The Inception modules are used to learn multiple scales and aspects of the data, allowing the network to capture more complex patterns. The residual blocks are then used to learn the residuals, or differences, between the input and output of the network, improving its performance.
InceptionTime [12] explores much larger filters than any previously proposed network for TSC to reach state-of-the-art performance on the UCR benchmark. InceptionTime is an ensemble of five randomly initialized inception network models, each of which consists of two blocks of inception modules. Each inception module first reduces the dimensionality of a multivariate time series using a bottleneck layer with a length and stride of 1 while maintaining the same length. Then, 1D convolutions of different lengths are applied to the output of the bottleneck layer to extract patterns at different sizes. In parallel, a max pooling layer followed by a bottleneck layer are also applied to the original time series to increase the robustness of the model to small perturbations. The outputs from the convolution and max pooling layers are stacked to form a new multivariate time series, which is then passed to the next layer. Residual connections are used between each inception block to reduce the vanishing gradient effect. The output of the second inception block is passed to a GAP layer before feeding into a softmax classifier.
The strong performance of InceptionTime has inspired a number of extensions. Like InceptionTime, EEG-inception [66] uses several inception layers and residual connections as its backbone. Additionally, noise-addition-based data augmentation of electroencephalogram (EEG) signals is proposed, which increases the average accuracy. InceptionFCN [67] focuses on combining two well-known deep learning techniques, namely the Inception module and the Fully Convolutional Network [67]. In KDCTime [68], label smoothing (LSTime) and knowledge distillation (KDTime) were introduced for InceptionTime, automatically generated while compressing the inference model. Additionally, knowledge distillation with calibration (KDC) in KDCTime offers two calibrating strategies: KDC by translating (KDCT) and KDC by reordering (KDCR). LITE [69] addresses InceptionTime’s complexity while preserving its TSC performance. Utilizing DepthWise Separable Convolutions, LITE incorporates multiplexing, dilated convolution, and custom filters [73] to enhance efficiency.
3.3 Recurrent Neural Network
RNNs are types of neural networks built with internal memory to work with time series and sequential data. Conceptually similar to feed-forward neural networks (FFNs), RNNs differ in their ability to handle variable-length inputs and produce variable-length outputs.
RNNs for TSC have been proposed in [74]. Using RNNs, the input series have been classified based on their dynamic behavior. They used sequence-to-sequence architecture in which each sub-series of input series is classified in the first step. Then the argmax function is applied to the entire output, and finally, the neuron with the highest rate specifies the classification result. In order to improve the model parallelization and capacity, [75] proposed a two-layer RNN. In the first layer, the input sequence is split into several independent RNNs to improve parallelization, followed by a second layer that utilizes the first layer’s output to capture long-term dependencies [75]. Further, RNNs have been used in some hierarchical architectures [76, 77]. Hermans and Schrauwen showed that a deeper version of RNNs could perform hierarchical processing on complex temporal tasks and capture the time series structure more naturally than a shallow version [77]. RNNs are usually trained iteratively using a procedure known as backpropagation through time (BPTT). When unfolded in time, RNNs look like very deep networks with shared parameters. With deeper neural layers in RNN and sharing weights across different RNN cells, the gradients are summed up at each time step to train the model. Thus, gradients undergo continuous matrix multiplication due to the chain rule and either shrink exponentially and have small values called vanishing gradients or blow up to a very large value, referred to as exploding gradients [78]. These problems motivated the development of second-order methods for deep architectures named long short-term memory (LSTM) [79] and Gated Recurrent Unit (GRU) [80].
3.3.2 Long Short-Term Memory (LSTM).
LSTM addresses the common vanishing/exploding gradient issue in vanilla RNNs by integrating memory cells with gate control into their state dynamics [79]. Due to its design nature, LSTM is suited to problems involving sequence data, such as language translation [81], video representation learning [82], and image caption generation [83]. The TSC problem is not an exception and mainly adopts a similar model to the language translation [81]. Sequence-to-Sequence with Attention (S2SwA) [84] incorporates two LSTMs, one encoder and one decoder, in a sequence-to-sequence fashion for TSC. In this model, the encoder LSTM accepts input time series of arbitrary lengths and extracts information from the raw data based on which the decoder LSTM constructs fixed-length sequences that can be regarded as automatically extracted features for classification.
3.3.3 Gated Recurrent Unit (GRU).
GRU, another widely used variant of RNNs, shares similarities with LSTM in its ability to control information flow and memorize context across multiple time steps [80]. Similar to S2SwA [84], a sequence auto-encoder (SAE) based on GRU has been defined to deal with TSC problem [85]. A fixed-size output is produced by processing the various input lengths using GRU as the encoder and decoder. The model’s accuracy was also improved by pre-training the parameters on massive unlabeled data.
3.3.4 Hybrid Models.
CNNs and RNNs are often combined for TSC because they have complementary strengths. As mentioned previously, CNNs are well suited for learning from spatial relationships in data, such as the patterns and correlations between the channels of different time steps in a time series. This allows them to learn useful features from the time series data that can help improve the classification performance. RNNs, on the other hand, are well suited for learning from temporal dependencies in data, such as the past values of a time series that can help predict its future values. This allows them to capture the dynamic nature of time series data and make more accurate predictions. Combining the strengths of CNNs and RNNs makes it possible to learn spatial and temporal features from the time series data, improving the model’s performance for TSC. Additionally, the two models can be trained together, allowing them to learn from each other and improve the model’s overall performance.
Various extensions like MLSTM-FCN [86], TapNet [87], and SMATE [88] were proposed later to deal with time series data. MLSTM-FCN extends the univariate LSTM-FCN model [89] to the multivariate case. Like the LSTM-FCN, the multivariate version comprises LSTM blocks and fully convolutional blocks for extracting features from input series. A squeeze and excite block is also added to the FCN block and can execute a form of self-attention on the output feature maps of previous layers [86]. Two further proposals for multivariate TSC are the Time series attentional prototype Network (TapNet) and Semi-Supervised Spatio-Temporal (SMATE) [87, 88]. These methods combine and seek to leverage the relative strengths of both traditional distance-based and deep learning approaches.
MLSTM-FCN, TapNet, and SMATE were designed in dual-network architectures. The input is separately fed into the CNN and RNN models, and their output is concentrated before the fully connected layer for the final task. However, one branch cannot fully use the hidden states of the other during feature extraction since the final classification results are generated by concatenating the outputs of the two branches. That motivates different types of architecture that try layer-wise integration of CNN and RNN models. This motivates different architectures, such as GCRNN [90] and CNN-LSTM [91], which aim to integrate CNNs and RNNs in a layer-wise fashion.
While RNNs are commonly used for time series forecasting, only a few studies have applied them to TSC, mainly due to four reasons: (1) RNNs typically struggle with the gradient vanishing and exploding problem due to training on long time series [92]; (2) RNNs are considered difficult to train and parallelize, so researchers are less likely to use them as they are computationally expensive [78]; (3) recurrent architectures are designed mainly to learn from the previous data to make predictions about the future [28]; and (4) RNN models can fail to effectively capture and utilize long-range dependencies in long sequences [84].
3.4 Attention-based Model
Despite the excellent performance of CNN models for capturing local temporal/spatial correlations, these models cannot effectively capture and utilize long-range dependencies. Additionally, they only consider the local order of data points rather than the overall order of all data points. Therefore, many recent studies have embedded RNNs such as LSTMs alongside the CNNs to capture this information [86, 87, 89]. The disadvantage of RNN-based models is that they are computationally expensive, and their capability to capture long-range dependencies is limited [18, 93]. On the other hand, attention models can capture long-range dependencies, and their broader receptive fields provide more contextual information, which can improve the models’ learning capacity. The attention mechanism aims to enhance a network’s representation ability by focusing on essential features and suppressing unnecessary ones. Not surprisingly, with the success of attention models in natural language processing [93, 94], many previous studies have attempted to bring the power of attention models into various domains such as computer vision [95] and time series analysis [18, 19, 96, 97, 98]. Table 2 presents a list of the attention-based models reviewed in this article.
Table 2. Summary of Attention-based Models for Time Series Classification and Extrinsic Regression
3.4.1 Self-attention.
Self-attention has been demonstrated to be effective in various natural language processing tasks due to its ability to capture long-term dependencies in text [93]. Recently, it has also been shown to be effective for TSC tasks [18, 99, 100, 101]. As we mentioned, the self-attention module is embedded in the encoder-decoder models to improve the model performance. However, only the encoder and the self-attention module have been used for TSC. Early models of TSC follow the same backbone of natural language processing models and use the recurrent-based models such as RNN [102], GRU [99] and LSTM [103, 104] for encoding the input series. For example, the Multi-View Attention Network (MuVAN) applies bidirectional GRUs independently to each input dimension as the encoder and then feeds all the representations into a self-attention block [99].
As a result of the excellent performance of the CNN models, many studies have attempted to encode the time series using CNNs before applying attention [18, 100, 105, 106]. Cross-Attention Stabilized Fully Convolutional Neural Network (CA-SFCN) [18] and Locality-Aware eXplainable Convolutional ATtention network (LAXCAT) [100] applied the self-attention mechanism to leverage the long-term dependencies for the MTSC task. CA-SFCN combines FCN and two types of self-attention, temporal attention (TA) and variable attention (VA), which interact to capture the long-range dependencies and variables’ interactions. LAXCAT also used temporal and variable attention to identify informative variables and the time intervals where they have informative patterns for classification. WaveletDTW Hybrid attEntion Networks (WHENs) [107] integrate two attention mechanisms, namely wavelet attention and DTW attention, into the BiLSTM to enhance model performance. In wavelet attention, they leverage wavelets to compute attention scores, specifically targeting the analysis of dynamic frequency components in nonstationary time series. Simultaneously, DTW attention employs the DTW distance to calculate attention scores, addressing the challenge of time distortion in multiple time series.
Several self-attention models have been developed to improve network performance [108, 109], including Squeeze-and-Excitation (SE) [110], which focuses on channel attention and is often used to classify time series data [86, 101, 111]. The SE block allows the whole network to use global information to selectively focus on the informative feature maps and suppress less important ones [110]. More importantly, the SE block can increase the quality of the shared lower-level representations in the early layers and becomes increasingly specialized when responding to different inputs in later layers. The weight of each feature map is automatically learned at each layer of the network, and the SE block can boost feature discrimination throughout the whole network. Multi-scale Attention Convolutional Neural Network (MACNN) [101] applies the different kernel size convolutions to capture different scales of information along the time axis by generating feature maps at differing scales. Then an SE block is used to enhance useful feature maps and suppress less useful ones by automatically learning each feature map’s importance.
3.4.2 Transformers.
The impressive performance of multi-headed attention has led to numerous attempts to adapt multi-headed attention to the TSC domain. Transformers for classification usually employ a simple encoder structure consisting of attention and feed-forward layers. Simply Attend and Diagnose (SAnD) [112] architecture adopted a multi-head attention mechanism similar to a vanilla transformer [93] to classify clinical time series for the first time. The model uses both positional encoding and a dense interpolation embedding technique to incorporate temporal order into representation learning. In another study that classified vibration signals [113], time-frequency features such as Frequency Coefficients and Short Time Fourier Transformation (STFT) spectrums are used as input embeddings to the transformers. A multi-head attention-based model was applied to raw optical satellite TSC using Gaussian Process Interpolation [114] embedding and outperformed CNNs and RNNs [115].
Gated Transformer Networks (GTNs) [116] use two-tower multi-headed attention to capture the discriminative information from the input series. Also, they merged the output of two towers using a learnable matrix named gating. To enhance locality awareness of transformers for TSC, flexible multi-head linear attention (FMLA) [117] integrates deformable convolutional blocks and online knowledge distillation, as well as a random mask to reduce noise. For each TSC dataset, AutoTransformer searches for the suitable network architecture using the neural architecture search (NAS) algorithm before feeding the output to the multi-headed attention blocks. ConvTran [13] currently stands as the state of the art in multivariate TSC. They conducted a review of existing absolute and relative position encoding methods in TSC. Based on the limitations of the current position encodings for time series, they introduced two novel ones named tAPE and eRPE for absolute and relative positions, respectively. Integrating these proposed position encodings into a transformer block and combining them with a convolution layer, they presented a novel deep learning framework for multivariate time series classification—ConvTran.
3.5 Graph Neural Networks
While both CNNs and RNNs perform well on Euclidean data, many time series problems have data that are more naturally represented as graphs [119]. For example, in a network of sensors, the sensors may be irregularly spaced, instead of the sensors forming a regular grid. A graph representation of data collected by this network can model this irregular layout more accurately than can be done using a Euclidean space. However, using standard deep learning algorithms to learn from graph structures is challenging [120]. For example, nodes may have a varying number of neighboring nodes, making it difficult to apply a convolution operation.
GNNs [121] are methods that adapt deep learning techniques to the graph domain. Much of the early research using GNNs for time series analysis concentrated on forecasting tasks [119]. However, recent works consider GNNs for TSC [122, 123] and TSER [124] tasks. A list of the GNN models reviewed in this article is provided in Table 3. Time2Graph+ [125] transforms each time series into a shapelet graph. Shapelets are extracted from the time series and form the graph nodes. The graph edges are weighted based on transition probabilities between the two shapelets. Once the input graphs have been constructed, a graph attention network is used to create a representation of the time series that is fed into a classifier. SimTSC [137] constructs a pairwise similarity graph where each time series forms a node and edge weights are computed based on the DTW distance measure. Node attributes are generated using a feature vector encoder. GNN operations are used to enhance the node features based on similarities between adjacent time series. These representations are then used for the final classification step, which produces a classification for each node. LB-SimTSC [122] replaces the expensive DTW computation with the LB-Keogh lower-bounding method [141].
Table 3. Summary of Graph Neural Network Models for Time Series Classification and Extrinsic Regression
Spatiotemporal GNNs model both spatial (or inter-variable) and temporal dependencies using two modules that work in tandem. The spatial module models the dependencies between the time series by applying graph convolutions over a GNN (GCNs [142]). The temporal module models the dependencies within the time series using an RNN [129, 132], 1D-CNN [134, 135], Attention [133, 139], or a combination of these [119]. The features extracted from the graph layers are then fed into the classification or regression layers to make either a single prediction [132, 133, 135, 139] or a prediction for each node [129, 134]. Spatiotemporal GCNs are often used to analyze sensor arrays, where the graph structure models the physical layout of the sensors. A common example is EEG data, where the location of EEG electrodes is represented as a graph that is used to analyze the EEG signal. Some of these applications are epilepsy detection [131], seizure detection [126, 132], emotion recognition [127], and sleep classification [128]. Besides EEG, GCNs have also been applied to engineering applications such as machine fault diagnosis [130], slope deformation prediction [129], and seismic activity prediction [124]. MTPool [136] uses a spatiotemporal GCN for multivariate time series classification. In this study, each channel in the time series is represented by a node in the graph, and the graph edges model the correlations between the channels. The GCN is combined with temporal convolutions and a hierarchical graph pooling technique. Spatiotemporal GNNs have also been used for object-based image analysis [134] and semantic segmentation [138] of image time series. However, these assume the labels and spatial relationships are static over time. In many cases these may both change. Spatiotemporalgraphs(STGs), which include temporal edges as well as spatial edges, can model these dynamic relationships [140]. In STGs, each node represents an object at one timestamp. Spatial edges connect the object to adjacent objects, and temporal edges connect two objects in consecutive images if they have common pixels.
4 Self-supervised Models
Obtaining labeled data for large time series datasets poses significant costs and challenges. Machine learning models trained on large labeled time series datasets often exhibit superior performance compared to models trained on sparsely labeled datasets, small datasets with limited labels, or those without supervision, leading to suboptimal performance across various time series machine learning tasks [23, 143]. As a result, rather than depending on high-quality annotations for large datasets, researchers and practitioners are increasingly shifting their focus toward self-supervised representation learning for time series.
Self-supervised representation learning, a subfield of machine learning, focuses on learning representations from data without explicit supervision [24]. In contrast to supervised learning, which relies on labeled data, self-supervised learning methods utilize the inherent structure of the data to learn valuable representations in an unsupervised manner. The learned representations can then be used for a variety of downstream tasks including classification, anomaly detection, and forecasting. This survey specifically emphasizes classification as a downstream task. We categorized self-supervised learning approaches for TSC into three groups based on the pretext. Table 4 shows a list of the self-supervised models reviewed in this article.
Table 4. Summary of Self-supervised Models for Time Series Classification and Extrinsic Regression
4.1 Contrastive Learning
Contrastive learning involves model learning to differentiate between positive and negative time series examples. Time-Contrastive Learning (TCL) [144], Scalable Representation Learning (SRL or T-Loss) [145], and Temporal Neighborhood Coding (TNC) [146] apply a subsequence-based sampling and assume that distant segments are negative pairs and neighbor segments are positive pairs. TNC takes advantage of the local smoothness of a signal’s generative process to define neighborhoods in time with stationary properties to further improve the sampling quality for the contrastive loss function. TS2Vec [23] uses contrastive learning to obtain robust contextual representations for each timestamp hierarchically. It involves randomly sampling two overlapping subseries from input and encouraging consistency of contextual representations on the common segment. The encoder is optimized using both temporal contrastive loss and instance-wise contrastive loss.
In addition to the subsequence-based methods, other models employ instance-based sampling [21, 143, 147, 148, 149, 150], treating each sample individually to generate positive and negative samples for contrastive loss. Time-series Temporal and Contextual Contrasting (TS-TCC) [21] uses weak and strong augmentations to transform the input series into two views and then uses a temporal contrasting module to learn robust temporal representations. The contrasting contextual module is then built upon the contexts from the temporal contrasting module and aims to maximize similarity among contexts of the same sample while minimizing similarity among contexts of different samples. Similarly, TimeCLR [148] introduces DTW data augmentation to enhance robustness against phase shift and amplitude change phenomena. Bilinear Temporal-Spectral Fusion (BTSF) [143] uses simple dropout as the augmentation method and aims to incorporate spectral information into the feature representation. Similarly, Time-Frequency Consistency (TF-C) [149] is a self-supervised learning method that leverages the frequency domain to achieve better representation. It proposes that the time-based and frequency-based representations, learned from the same time series sample, should be more similar to each other in the time-frequency space compared to representations of different time series samples.
4.2 Self-prediction
The primary objective of self-prediction-based self-supervised models is to reconstruct the input or representation of input data. Studies have explored using transformer-based self-supervised learning methods for TSC [19, 22, 98, 151, 152, 153], following the success of models like BERT [94]. BErt-inspired Neural Data Representations (BENDR) [98] uses the transformer structure to model EEG sequences and shows that it can effectively handle massive amounts of EEG data recorded with differing hardware. Another study, Voice-to-Series with Transformer-based Attention (V2Sa) [22], utilizes a large-scale pre-trained speech processing model for TSC.
The Transformer-based Framework (TST) [19] and TARNet [151] adapt vanilla transformers to the multivariate time series domain and use a self-prediction-based self-supervised pre-training approach with masked data. These studies demonstrate the potential of using transformer-based self-supervised learning methods for TSC.
4.3 Other Pretext Tasks
While many pretext tasks in self-supervised learning are typically contrastive or self-predictive, specific tasks are tailored for time series data. In image-based self-supervised learning, synthetic transformations (augmentation) of an image are created, and the model learns to contrast the image and its transforms with other images in the training data, which works well for object interpretation. However, time series analysis fundamentally differs from vision or natural language processing concerning the definition of meaningful self-supervised learning tasks.
Guided by this insight, Foumani et al. [24] introduce Series2Vec, a novel self-supervised representation learning approach. Unlike other contrastive self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task. Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation. Pre-trained H-InceptionTime (PHIT) [154] is pre-trained using a novel pretext task designed to identify the originating dataset of each time series sample. The objective is to generate flexible convolution filters that can be applied across diverse datasets. Furthermore, PHIT demonstrates its capability to mitigate overfitting in small datasets.
5 Data Augmentation
In the field of deep learning, the concept of data augmentation has emerged as an important tool for enhancing performance, particularly in scenarios where the availability of training data is limited [155]. Originally proposed in computer vision, data augmentation involves a variety of transformations to images, such as cropping, rotating, flipping, and applying filters like blurring and sharpening. These transformations serve to introduce a diverse range of scenarios within the training data, thereby aiding in the development of more robust and generalizable models. However, the direct application of these image-based augmentation techniques to time series data often proves to be inadequate or inappropriate. Operations like rotation may disrupt the intrinsic temporal structure of time series data.
The challenge of overfitting is particularly pronounced in the field of deep learning models for TSC. These models are characterized by a high number of trainable parameters, which can lead to a model that performs well on training data but fails to generalize to unseen data. In such cases, data augmentation can be a valuable strategy. It offers an alternative to the costly and sometimes impractical approach of collecting additional real-world data. By generating synthetic samples from existing datasets, we can effectively augment the size and variety of our training data. The following details different investigated methods to produce synthetic time series for data augmentation.
Random Transformations.
Several augmentations have been developed for the magnitude domain. Jittering, as explored by Um et al. [156], involves the addition of random noise to the time series. Another method, flipping [157], reverses the time series values. Scaling is a technique where the time series is multiplied by a factor from a Gaussian distribution. Magnitude warping, which shares similarities with scaling, distorts the series along a curve that varies smoothly. For time domain transformations, permutation algorithms play a significant role. For example, the slicing transformation involves removing sub-sequence from the series. There are also various warping methods like Random Warping [158], Time Warping [156], Time Stretching [159], and Time Perturbation [160], each introducing different forms of distortion to the time series. Finally, in the frequency domain, transformations often utilize the Fourier transform. For example, Gao et al. [161] introduce perturbations to both the magnitude and phase spectrum following a Fourier transform.
Window methods.
A primary approach in window methods is to create new time series by combining segments from various series of the same class. This technique effectively enriches the data pool with a variety of samples. Window slicing, as introduced by Cui et al. [162], involves dividing a time series into smaller segments, with each segment retaining the class label of the original series. These segments are then used to train classifiers, offering a detailed view of the data. During classification, each segment is evaluated individually, and a collective decision on the final label is reached through a voting system among the slices. Another technique is window warping, based on the DTW algorithm. This method adjusts segments of a time series along the temporal axis, either stretching or compressing them. This introduces variability in the time dimension of the data. Le Guennec et al. [163] provide examples of the application of both window slicing and window warping, showcasing their effectiveness in enhancing the diversity and representativeness of time series datasets.
Averaging methods.
Averaging methods in time series data augmentation combine multiple series to form a new, unified series. This process is more difficult than it might seem, as it requires careful consideration of factors like noise and distortions in both the time and magnitude aspects of the data. In this context, weighted DTW Barycenter Averaging (wDBA) introduced by Forestier et al. [164] provides an averaging method by aligning time series in a way that accounts for their temporal dynamics. The practical application of wDBA is illustrated in the study by Ismail Fawaz et al. [165], where it is employed in conjunction with a ResNet classifier, demonstrating its effectiveness. Additionally, the research conducted by Terefe et al. [166] uses an auto-encoder for averaging a set of time series. This method represents a more advanced approach in time series data augmentation, exploiting the auto-encoder’s capacity for learning and reconstructing data to generate averaged representations of time series.
Selection of data augmentation methods.
The selection of the appropriate data augmentation technique is critical and must be adapted to the specific characteristics of the dataset and the architecture of the neural network being used. Studies like those conducted by Iwana and Uchida [167], Pialla et al. [168], and Gao et al. [169] highlight the complexity of this task. These studies demonstrate that the effectiveness of augmentation techniques can vary significantly across different datasets and neural network architectures. Consequently, a method that proves effective in one scenario may not necessarily yield similar results in another. To this end, practitioners in the field of TSC must engage in a careful and informed process of method selection and tuning. While the array of available data augmentation techniques offers a comprehensive toolkit for tackling the challenges of limited data and overfitting, their successful application depends heavily on a nuanced understanding of both the methods themselves and the specific demands of the task at hand.
6 Transfer Learning
Transfer learning, initially popularized in the field of computer vision, is increasingly becoming relevant in the domain of TSC. In computer vision, this approach involves using a pre-trained network, typically on large datasets like ImageNet [170], as a starting point rather than initiating with random network weights. This method is also related to the concept of foundation or base models, which are large-scale machine learning models trained on extensive data, often using self-supervised or semi-supervised learning. These models are adaptable to a wide array of tasks, showcasing their versatility. The principle of transfer learning is also closely associated with domain adaptation, which focuses on applying a model trained on a source data distribution to a different, but related, target data distribution. This approach is crucial in leveraging pre-trained models for various applications, particularly in scenarios where data is scarce or specific to certain domains.
In the context of TSC, insights have been contributed by the work of Ismail Fawaz et al. [171], who conducted a study using the UCR archive. Their extensive experiments demonstrated that transfer learning could lead to positive or negative outcomes, depending on the chosen datasets for transfer. This finding underscores the importance of the relationship between source and target datasets in transfer learning efficacy. Ismail Fawaz et al. [171] also introduced an approach to predict the success of transfer learning in TSC by using DTW to measure similarities between datasets. This metric serves as a guide to select the most appropriate source dataset for a given target dataset, thereby enhancing accuracy in a majority of cases.
Other researchers have also explored transfer learning in TSC. Spiegel’s [172] work on using dissimilarity spaces to enrich feature representations in TSC set a precedent for employing unconventional data sources. This approach of enhancing learning with diverse data types finds a parallel in Li et al.’s [173] method, which leverages sensor modality labels from various fields to train a deep network, emphasizing the importance of versatile data in transfer learning. Building on the concept of data diversity, Rotem et al. [174] pushed the boundaries further by generating a synthetic univariate time series dataset for transfer learning. This synthetic dataset, used for regression tasks, underscores the potential of artificial data in overcoming the limitations of real-world datasets. Furthermore, Senanayaka et al. [175] introduced the similarity-based multi-source transfer learning (SiMuS-TL) approach. By establishing a ”mixed domain” to model similarities among various sources, Senanayaka et al. demonstrated the effectiveness of carefully selected and related data sources in transfer learning. Finally, Kashiparekh et al. [176] with their ConvTimeNet (CTN) focused on the adaptability of pre-trained networks across diverse time scales.
While the explored studies collectively advance our understanding of transfer learning in TSC, the field remains open for further investigation. A key challenge lies in determining the most suitable source models for transfer, a task complicated by the relative scarcity of large, curated, and annotated datasets in time series analysis compared to the field of computer vision. This restricts the utility of transfer learning in TSC, as the availability of extensive and diverse datasets is crucial for developing robust and generalizable models. Furthermore, the question of developing filters that are generic enough to be effective across a wide range of applications remains unresolved. This aspect is critical for the success of transfer learning, as the applicability of a pre-trained model to new tasks depends on the universality of its learned features. Additionally, the strategy of whether to freeze certain layers of the network during transfer or to fine-tune the entire network is another area that warrants deeper exploration.
7 Applications: Recent Developments and Challenges
TSC and TSER techniques have been used to analyze and model time-dependent data in a wide range of applications. These include human activity recognition, Earth observation, medical diagnosis including EEG [177] and electrocardiogram (ECG) [178] monitoring, air quality and pollution prediction [179, 180], structural and machine health monitoring [181, 182], Industrial Internet of Things (IIOT) [183], energy consumption and anomaly detection [184], and bio-acoustics [185].
Due to the extensive range of applications that use TSC and TSER, it is infeasible to cover them all in detail in a single review. Therefore, in this survey, we focus on just two applications: human activity recognition and satellite Earth observation. (References to recent reviews have been provided for the other applications mentioned above.) These are two important but quite different domains and were chosen to give the reader an idea of the diverseness of time series use in deep learning. The following sections provide an overview of the use of TSC and TSER, the latest developments, and challenges in these two applications.
7.1 Human Activity Recognition
Human activity recognition (HAR) is the identification or monitoring of human activity through the analysis of data collected by sensors or other instruments [186]. The recent growth of wearable technologies and the Internet of Things has resulted in not only the collection of large volumes of activity data [187] but also easy deployment of applications utilizing this data to improve the safety and quality of human life [5, 186]. HAR is therefore an important field of research with applications including healthcare, fitness monitoring, smart homes [188], and assisted living [189].
Devices used to collect HAR data can be categorized as visual or sensor based [4, 5]. Sensor-based devices can be further categorized as object sensors (e.g., RFIDs embedded into objects), ambient sensors (motion sensors, WiFi or Bluetooth devices in fixed locations), and wearable sensors [4], including smartphones [3]. However, the majority of HAR studies use data from wearable sensors or visual devices [186]. Additionally, HAR from visual device data requires the use of computer vision techniques and is therefore out of scope for this review. Accordingly, this section reviews wearable sensor-based methods of HAR. For reviews of vision-based HAR, refer to Kong and Fu [190] or Zhang et al. [191].
The main sensors used in wearable devices are accelerometers, gyroscopes, and magnetic sensors [192], which each collect three-dimensional spatial data over time. Inertial measurement units (IMUs) are wearable devices that combine all three sensors in one unit [193, 194]. Wearable device studies typically collect data from multiple IMUs located on different parts of the body [195, 196]. To create a dataset suitable for HAR modeling, the sensor data is split into (usually equally sized) time windows [197]. The task is then to learn a function that maps the multi-variate sensor data for each time window to a set of activities. Thus, the data forms multi-variate time series suited to TSC.
Given the broad scope of our survey, this section necessarily only provides a brief overview of the studies using deep learning for HAR. However, there are several surveys that provide a more in-depth review of machine learning and deep learning for HAR. Lara and Labrador [197] provide a comprehensive introduction to HAR, including machine learning methods used and the principal issues and challenges. Both Nweke et al. [3] and Wang et al. [4] provide a summary of deep learning methods, highlighting their advantages and limitations. Chen et al. [5] discuss challenges in HAR and the appropriate deep learning methods for addressing each challenge. They also provide a comprehensive list of publicly available HAR datasets. Gu et al. [198] focus on deep learning methods, reviewing preprocessing and evaluation techniques as well as the deep learning models.
The deep learning methods used for HAR include both CNNs and RNNs, as well as hybrid CNN-RNN models. While some of the models include an attention module, we did not find any studies proposing a full attention or transformer model. A summary of the studies reviewed and the type of model built is provided in Table 5. Hammerla et al. [199] compared several deep learning models for HAR, including three LSTM variants, a CNN model, and a DNN model. They found that a bi-directional LSTM performed best on naturalistic datasets where long-term effects are important. However, they found that some applications need to focus on short-term movement patterns and suggested CNNs are more appropriate for these applications. Thus, research across all model types is beneficial for the ongoing development of models for HAR applications.
Many of the papers reviewed in this section used commonly available datasets to build and evaluate their models.
7.1.1 Convolutional Neural Networks.
One of the most common types of convolutional kernels for HAR is the \(k \times 1\) kernel. This kernel convolves \(k\) time steps together, moving along each time series in the input features in turn [221], so while weights are shared between the input features, there is no mixing between features. The outputs from the final convolutional layer are flattened and processed by fully connected layers before the final classification is made. Ronao et al. [203] performed a comprehensive evaluation of CNN models for HAR, evaluating the effect of changing the number of layers, filters, and filter sizes. The input data was collected from smartphone accelerometer and gyroscope sensors. Ignatov [207] used a one-layer CNN and augmented the extracted features with statistical features before being passed to fully connected layers. The architecture was effective with short time series (1 second) and therefore useful for real-time activity modeling. One drawback of the above method is that it forces weight sharing across all the input features. This may not be optimal, especially when using data collected from multiple devices. In this case, using a separate CNN for each device [208] allows independent weighting of the features. Similarly, as each sensor is typically tri-axial, a separate CNN can be used for each axis [200, 213]. The features extracted by each CNN are then concatenated and processed either by fully connected layers [200] or an attention head [213].
While the above two methods are the most common, other studies have proposed alternative CNNs for HAR. DCNN [201] pre-processes the sensor data using a Discrete Fourier Transform to convert IMU data to frequency signals, then uses two-dimensional convolutions to extract combined temporal and frequency features. Lee et al. [205] pre-processed the tri-axial accelerometer data to a magnitude vector, which was then processed in parallel by CNNs with varying kernel sizes, extracting features at different scales. Xu et al. [222] used deformable convolutions [223] in both a 2D-CNN and a ResNet model and found these models performed better than their non-deformable counterparts. Yao et al. [209] proposed a fully convolutional model using 2D temporal and feature convolutions. Their model has two advantages as (1) it handles arbitrary length input sequences and (2) it makes a prediction for each timestep, which avoids the need to pre-process the data into windows and can detect transitions between activities.
7.1.2 Recurrent Neural Networks.
Several LSTM models have been proposed for HAR. Murad and Pyun [206] designed and compared three multi-layered LSTMs, a uni-directional LSTM, a bi-directional LSTM, and a “cascading” LSTM, which has a bi-directional first layer, followed by uni-directional layers. In each case the output from all time steps is used as input to the classification layer. Zeng et al. [210] added two attention layers to an LSTM, a sensor attention layer before the LSTM and a temporal attention layer after the LSTM. They include a regularization term they called “continuous attention” to smooth the transition between attention weights. Guan and Plötz [204] created an ensemble of LSTM models by saving the models at every training epoch, then selecting the best “M” models based on validation set results, thus aiming to reduce model variance.
7.1.3 Hybrid Models.
Many recent studies have focused on hybrid models, combining both CNNs and RNNs. DeepConvLSTM [192] comprises four temporal convolutional layers followed by two LSTM layers, which the authors found to perform better than an equivalent CNN (replacing the LSTM layers with fully connected layers). As the LSTM layers have fewer parameters than fully connected layers, the DeepConvLSTM model was also much smaller. Singh et al. [220] used a CNN to encode the spatial data (i.e., the sensor readings at each timestamp) followed by a single LSTM layer to encode the temporal data, then a self-attention layer to weight the time steps. They found this model performed better than an equivalent model using temporal convolutions in the CNN layers. Challa et al. [214] proposed using three 1D-CNNs with different kernel sizes in parallel, followed by two bi-directional LSTM layers and a fully connected layer. Nafea et al. [219] also used 1D-CNNs with different kernel sizes and bi-directional LSTMs. However, they used separate branches for the CNNs and LSTMs, merging the features extracted in each branch for the final fully connected layer. Mekruksavanich and Jitpattanakul [217] compared a four-layer CNN-LSTM model with a smaller CNN-LSTM model and LSTM models, finding the extra convolutional layers improved performance over the smaller models. DEBONAIR [216] is another multi-layered model. It uses parallel 1D-CNNs, each having different kernel, filter, and pooling sizes to extract different types of features associated with different types of activity. These are followed by a combined 1D-CNN, then two LSTM layers. Mekruksavanich and Jitpattanakul [218] ensembled four different models: a CNN, an LSTM, a CNN-LSTM, and a ConvLSTM model. They aimed to produce a model for boimetric user identification that could identify not only the activity being performed but also the participant performing the activity.
A few hybrid models use GRUs instead of LSTMs. InnoHAR [212] is a modified DeepConvLSTM [192], replacing the four CNN layers with inception layers and the two LSTM layers with GRU layers. The authors found this inception model performed better than both the original DeepConvLSTM model and a straight CNN model [202]. AttnSense [211] uses a Fast Fourier transform to generate frequency features, which are then convolved separately for each time step. Attention layers are used to weight the extracted frequency features. These are then passed through a GRU with temporal attention to extract temporal features. CNN-BiGRU [215] uses a CNN layer to extract spatial features from the sensor data, then one or more GRU layers to extract temporal features. The final section of the model is a fully connected module consisting of one or more hidden layers and a softmax output layer.
7.2 Satellite Earth Observation
Ever since NASA launched the first Landsat satellite in 1972 [224], Earth-observing satellites have been recording images of the Earth’s surface, providing 50 years of continuous Earth observation data that can be used to estimate environmental variables informing us about the state of the Earth. Instruments on board the satellites record reflected or emitted electromagnetic radiation from the Earth’s surface and vegetation [225]. The regular, repeated observations from these instruments form satellite image time series (SITS) that are useful for analyzing the dynamic properties of some variables, such as plant phenology. The main modalities used for SITS analysis are multispectral spectrometers and spectroradiometers, which observe the visible and infrared frequencies and Synthetic Aperture Radar (SAR) systems, which emit a microwave signal and measure the backscatter.
Raw data collected by satellite instruments needs to be pre-processed before being used in machine learning. This is frequently done by the data providers to produce analysis-ready datasets (ARDs). With the increasing availability of compatible ARDs from sources such as Google Earth Engine [226] and various data cubes [227, 228], models combining data from multiple data sources (multi-modal) are becoming more common. These data sources make it straightforward to obtain data that are co-registered (spatially aligned and with the same resolution and projection), thus avoiding the need for complex pre-processing.
Satellite image time series can be processed either (1) as 2D temporal and spectral data, processing each pixel independently and ignoring the spatial dimensions, or (2) as 4D data, including the two spatial dimensions, in which models thus extract spatio-temporal features. This latter method allows estimates to be made at pixel, patch, or object level; however, it requires either more complex models or spatial features to be extracted in a pre-processing step. Feature extraction can be as simple as extracting the mean value for each band. However, both clustering (TASSEL [229]) and neural-network-based methods, such as the Pixel-Set Encoder [230], have been used for more complex feature extraction. The most common use of SITS deep learning is for the classification of the Earth’s surface by land cover and agricultural land by crop types. The classes used can range from very broad land cover categories (such as forest, grasslands, agriculture) through to specific crop types. Other classification tasks include identifying specific features, such as sinkholes [231], burnt areas [232], flooded areas [233], roads [234], deforestation [235], vegetation quality [236], and forest understory and litter types [237].
Extrinsic regression tasks are less common than classification tasks, but several recent studies have investigated methods of estimating water content in vegetation, as measured by the variable Live Fuel Moisture Content (LFMC) [238, 239, 240, 241]. Other regression tasks include estimating the wood volume of forests [242] by using a hybrid CNN-MLP model combining a time series of Sentinel-2 images with a single LiDAR image and crop yield [243] that uses a hybrid of CNN and LSTM.
Many different approaches to learning from SITS data have been studied, with studies using all the main deep learning architectures, adapting them for multi-modal learning, and combining architectures in hybrid and ensemble models. The rest of this section reviews the architectures that have been used to model SITS data. A summary of these papers and the embedding architecture is provided in Table 6.
One of the first papers to use RNNs for land cover classification was Ienco et al. [260], who showed that an LSTM model out-performed non-deep-learning methods such as Random Forest (RF) and Support Vector Machines (SVMs). However, they also showed that the performance of both RF and SVM improves if trained on features extracted by the LSTM model, and in some cases they were more accurate than the straight LSTM model. Rao et al. [238] used an extrinsic regression LSTM model to estimate LFMC in the western United States.
More commonly, however, RNNs are combined with an attention layer to allow the model to focus on the most important time steps. The OD2RNN model [262] used separate GRU layers followed by attention layers to process Sentinel-1 and Sentinel-2 data, combining the features extracted by each source for the final fully connected layers. HOb2sRNN [261] refined OD2RNN by using a hierarchy of land cover classifications; the model was pretrained using broad land cover classifications, then further trained using the finer-grained classifications. DCM [247] and HierbiLSTM [248] both use a bi-directional LSTM, processing the time series in both directions, followed by a self-attention transformer for a pixel-level crop-mapping model. All these studies found that adding the attention layers improved model performance over a straight GRU or LSTM model.
7.2.2 Convolutional Neural Networks (CNNs).
While many authors have claimed that RNNs out-perform CNNs for land cover and crop-type classification, most of these comparisons are to 2D-CNNs, which ignore the temporal ordering of SITS data [254]. However, other studies show using 1D-CNNs to extract temporal information and 3D-CNNs to extract spatio-temporal information are both effective methods of learning from SITS data. TempCNN [254] consists of three 1D convolutional layers. The output from the final convolutional layer is passed through a fully connected layer, then the final softmax classification layer. TASSEL [229] is an adaptation of TempCNN for OBIA classification, using TempCNN models to process features extracted from the objects, followed by an attention layer to weight the convolved features. TempCNN has also been adapted for extrinsic regression [239] and used for LFMC estimation [239, 240, 241].
2D-CNNs are mainly used to extract spatial or spatio-temporal features for both pixel and object classification. The model input is usually 4D and the data is convolved spatially, with two main methods used to handle the temporal dimension. In the first method, each time step is convolved separately and the extracted features are merged in later stages of the model [245]. In the second method, the time steps and channels are flattened to form a large multivariate image [242, 253]. FG-UNet [259] is a fully convolutional model that combines both the above methods, first grouping time steps by threes to produce images with 30 channels (10 spectral \(\times\) 3 temporal), which are passed through both U-Net and 2D-CNN layers. Ji et al. [246] used a 3D-CNN to convolve the spatial and temporal dimensions together, combining the strengths of 1D-CNN and 2D-CNNs. The study found that a 3D-CNN crop classification model performed significantly better than the 2D-CNN, again showing the importance of the temporal features. Another study, SSTNN [264], obtained good results for crop yield prediction by using a 3D-CNN to convolve the spatial and spectral dimensions, extracting spatio-spectral features for each time step. These features were then processed by LSTM layers to perform the temporal modeling.
7.2.3 Transformer and Attention Models.
As an alternative to including attention layers with a CNN or RNN, several studies have designed models that process temporal information using only attention layers. PSE-TAE [230] used a modified transformer called a temporal attention encoder (TAE) for crop mapping and found the TAE performed better than either a CNN or an RNN. L-TAE [249] replaced the TAE with a lightweight transformer that is both computationally efficient and more accurate than the full TAE. Ofori-Ampofo et al. [250] adapted the TAE model for multi-modal inputs using Sentinel-1 and Sentinel-2 data for crop-type mapping. Rußwurm and Körner [265] compared a self-attention model with RNN and CNN architectures. They found that this model was more robust to noise than either RNN or CNN and suggested that self-attention is suitable for processing raw, cloud-affected satellite data.
Building on the success of pre-trained transformers for natural language processing such as BERT [94], pre-trained transformers have been proposed for Earth observation tasks [251]. Earth observation tasks are particularly suited for pre-trained models as large quantities of Earth observation data are readily available, while labeled data can be difficult to obtain [266], especially in remote locations. SITS-BERT [251] is an adaptation of BERT [94] for pixel-based SITS classification. For the pretext task, random noise is added to the pixels, and the model is trained to identify and remove this noise. The pre-trained model is then further trained for required tasks such as crop type or land cover mapping. SITS-Former [263] modifies SITS-BERT for patch classification by using 3D-Conv layers to encode the spatial-spectral information, which is then passed through the temporal attention layers. The pretext task used for SITS-Former is to predict randomly masked pixels.
7.2.4 Hybrid Models.
A common use of hybrid models is to use a CNN to extract spatial features and an RNN to extract temporal features. Garnot et al. [267] compared a straight 2D-CNN model (thus ignoring the temporal aspect), a straight GRU model (thus ignoring the spatial aspect), and a combined 2D-CNN and GRU model (thus using both spatial and temporal information) and found the combined model gave the best results, demonstrating that both the spatial and temporal dimensions provide useful information for land cover mapping and crop classification. DuPLO [257] was one of the first models to exploit this method, running a CNN and ConvGRU model in parallel, then fusing the outputs using a fully connected network for the final classifier. During training, an auxiliary classifier for each component was used to enhance the discriminative power. TWINNS [256] extended DuPLO to a multi-modal model, using time series of both Sentinel-1 (SAR) and Sentinel-2 (Optical) images. Each modality was processed by separate CNN and convGRU models, and then the output features from all four models were fused for classification.
Other hybrid models include Li et al. [244], who used a CNN for spatial and spectral unification of Landsat-8 and Sentinel-2 images, which were then processed by a GRU. MLDL-Net [243] is a 2D-CNN extrinsic regression model, using CNNs to extract time step features, which are then passed through an LSTM model to extract temporal features. Fully connected layers combine the feature sets to predict crop yield. Rußwurm and Körner [258] extracted temporal features first, using a bi-directional LSTM, then used a fully convolutional 2D-CNN to incorporate spatial information and classify each pixel in the input patch.
7.2.5 Ensemble Models.
One of the easiest ways to ensemble DL models is to train multiple homogeneous models that vary only in the random weight initialization [268]. Di Mauro et al. [252] ensembled 100 LULC models with different weight initializations by averaging the softmax predictions. They found this produced a more stable and stronger classifier that outperformed the individual models. Multi-tempCNN [240], a model for LFMC estimation, is an ensemble of homogeneous models for extrinsic regression. The authors suggested that as an additional benefit, the variance of the individual model predictions can be used to obtain a measure of uncertainty of the estimates. TSI [255] also ensembles a set of homogeneous models, but instead of relying on random weight initialization to introduce model diversity, the time series are segmented and models trained on each segment.
Other methods create ensembles of heterogeneous models. Kussul et al. [253] compared ensembles of 1D-CNNs and 2D-CNNs models for land cover classification. Each model in the ensemble used a different number of filters, thus finding different feature sets useful for classification. Xie et al. [241] ensembled three heterogeneous models—a causal temporal convolutional neural network (TCN), an LSTM, and a hybrid TCN-LSTM model—for an extrinsic regression model to estimate LFMC. The ensembles were created using stacking [269]. The authors compared this method to boosting their TCN-LSTM model, using Adaboost [270] to create a three-member ensemble, and found that stacking a diverse set of models out-performed boosting.
7.2.6 EO Surveys and Reviews.
This survey is one of very few that include a section focusing specifically on deep learning TSC and TSER tasks using SITS data. However, there are other reviews that provide further information about related topics. Gomez et al. [271] is an older review highlighting the important role of SITS data for land cover classification. Zhu et al. [272] reviewed the advances and challenges in DL for remote sensing and the resources available that are potentially useful to help DL address some of the major challenges facing humanity. Ma et al. [273] study the role of deep learning in Earth observation using remotely sensed data. It covers a broad range of tasks including image fusion, image segmentation, and object-based analysis, as well as classification tasks. Yuan et al. [274] provide a review of DL applications for remote sensing, comparing the role of DL versus physical modeling of environmental variables and highlighting challenges in DL for remote sensing that need to be addressed. Chaves et al. [275] reviewed recent research using Landsat 8 and/or Sentinel-2 data for land cover mapping. While not focused on SITS DL methods, the review notes the growing importance of these methods. Moskolai et al. [276] provide a review of forecasting applications using DL with SITS data that provides an analysis of the main DL architectures that are relevant for classification as well as forecasting.
8 Conclusion
In conclusion, this survey article has discussed a variety of deep network architectures for time series classification and extrinsic regression tasks, including multilayer perceptrons, convolutional neural networks, recurrent neural networks, and attention-based models. We have also highlighted refinements that have been made to improve the performance of these models on time series tasks. Additionally, we have discussed two critical applications of time series classification and regression, human activity recognition and satellite Earth observation. Overall, using deep network architectures and refinements has enabled significant progress in the field of time series classification and will continue to be essential for addressing a wide range of real-world problems. We hope this survey will stimulate further research using deep learning techniques for time series classification and extrinsic regression. Additionally, we provide a carefully curated collection of sources, available at https://github.com/Navidfoumani/TSC_Survey, to further support the research community.
In this section, we aim to give a brief introduction to the field of TSC and discuss its current status. We refer interested readers to the ”bake-off” papers [11, 25, 26] that describe TSC methods in much more detail and benchmark them.
Research in TSC started with distance-based approaches that find discriminating patterns in the shape of the time series. Distance-based approaches usually consist of coupling a 1-nearest neighbor (1NN) classifier with a time series distance measure [277, 278]. Small distortions in the time series can lead to false matches when measuring the distance between time series using standard distance measurements such as Euclidean distance [277]. A time series distance measure aims to compensate for these distortions by aligning two time series such that the alignment cost between the two are minimized. There are many time series distances proposed in the literature; among these, the \(DTW\) distance is one of the most popular choices for many time series tasks, due to its intuitiveness and effectiveness in aligning two time series. The 1NN-\(DTW\) has been the go-to method for TSC for decades. However, by comparing several time series distance measures, the work in [277] showed that as of 2015, there was no single distance that significantly outperformed \(DTW\) when used with a 1NN classifier. The recent Amerced \(DTW\) [279] distance is the first distance that is significantly more accurate than \(DTW\). These individual 1NN classifiers with different distances can be ensembled together to create an ensemble, such as the Ensemble of Elastic distances (EE), that significantly outperforms each of them individually [277, 278]. However, since most distances have a complexity of \(O(L^2),\) where \(L\) is the length of the series, performing a nearest neighbor search becomes very costly. Hence, distance-based approaches are considered to be one of the slowest methods for TSC [280, 281].
As a result of EE, recent studies have focused mainly on developing ensembling methods that significantly outperform 1NN-\(DTW\) [278, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290]. These approaches use either an ensemble of tree-based approaches [289, 290] or an ensemble of different types of discriminant classifiers, such as NN with several distances and SVM on one or several feature spaces [282, 285, 286, 287]. All of these approaches share a common property—the data transformation phase where the time series is transformed into a new feature space such as the shapelets transform [286] or DTW features [285]. Taking advantage of this notion led to the development of the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) [281, 284]. HIVE-COTE is a meta ensemble for TSC and forms its ensemble from ensemble classifiers of multiple domains. Since its introduction in 2016 [284], HIVE-COTE has gone through a few iterations. Recently, the latest HIVE-COTE version, HIVE-COTEv2.0 (HC2), was proposed [281]. It is composed of four ensemble members, each of them being the then state of the art in their respective domains. It is currently one of the most accurate classifiers for both univariate and multivariate TSC tasks [281]. Despite being accurate on 26 multivariate and 142 univariate TSC benchmark datasets that are relatively small, HC2 scales poorly on large datasets with long time series as well as datasets with large numbers of channels.
Various work has been done on speeding up TSC methods without sacrificing accuracy [14, 278, 291, 292, 293, 294, 295]. A recent breakthrough is the development of Rocket [14], which was able to process 109 univariate time series datasets under 4 hours while the previous fastest one took days. Rocket leverages a large number of random convolutional filters to extract features from each series that might be relevant to classifying a series. These features are then passed to a linear model for classification. Rocket has been improved to be faster (Minirocket [291]) and more accurate (Multirocket [292] and Hydra [293]). Hydra when combined with Multirocket is now one of the fastest and most accurate methods for TSC.
B DNN Architectures for Time Series
In this section, we provide a descriptive overview of deep-learning-based models for TSC. The focus is on clarifying their architectures and outlining their adaptations to the specific characteristics of time series data.
B.1 Convolution Neural Networks (CNNs)
Many variants of CNN architectures have been proposed in the literature, but their primary components are very similar. Using the LeNet-5 [296] as an example, it consists of three types of layers: convolutional, pooling, and fully connected. The purpose of the convolutional layer is to learn feature representations of the inputs. Figure 2(a) shows the architecture of the t-LeNet network, which is a time-series-specific version of LeNet. This figure shows that the convolution layer is composed of several convolution kernels (or filters) used to compute different feature maps. In particular, each neuron of a feature map is connected to a region of neighboring neurons in the previous layer called the receptive field. Feature maps can be created by first convolving inputs with learned kernels and then applying an element-wise nonlinear activation function to the convolved results. It is important to note that all spatial locations of the input share the kernel for each feature map, and several kernels are used to obtain the entire feature map.
Fig. 2.
The feature value of the \(l\)th layer of the \(k\)th feature map at location \((i,j)\) is obtained by
where \({\bf W}^l_k\) and \(b^l_k\) are the weight vector and bias term of the \(k\)th filter of the \(l\)th layer, respectively, and \({\bf A}^{l-1}_{i,j}\) is the input patch centered at location \((i, j)\) of the \(l\)th layer. Note that the kernel \({\bf W}^l_k\) that generates the feature map \(Z^l_{:,:,k}\) is shared. A weight-sharing mechanism has several advantages, such as reducing model complexity and making the network easier to train. Let \(f\left(.\right)\) denote the nonlinear activation function. The activation value of convolutional feature \(Z^l_{i,j,k}\) can be computed as
The most common activation functions are sigmoid, tanh, and ReLU [297]. As shown in Figure 2(a), a pooling layer is often placed between two convolution layers to reduce the resolution of the feature maps and to achieve shift invariance. Following several convolution stages—the block comprising convolution, activation, and pooling is called convolution \(stage\)—there may be one or more fully connected layers that aim to perform high-level reasoning. As discussed in Section 3.1, each neuron in the previous layer is connected to every neuron in the current layer to generate global semantic information. In the final layer of CNNs, there is the output layer in which the Softmax operators are commonly used for classification tasks [40].
B.2 Recurrent Neural Networks (RNN)
RNNs are types of neural networks that are specifically designed to process time series and other sequential data. RNNs are conceptually similar to FFNs. While FFNs map from fixed-size inputs to fixed-size outputs, RNNs can process variable-length inputs and produce variable-length outputs. This capability is enabled by sharing parameters over time through directed connections between individual layers. RNN models for TSC can be classified as sequence to sequence or sequence to one based on their outputs. Figure 2(b) shows sequence-to-sequence architectures for RNN models, with an output for each input sub-series. On the other hand, in sequence-to-one architecture, decisions are made using only \(y^T\) and ignoring the other outputs.
At each time step \(t\), RNNs maintain a hidden vector \(h,\) which updates as follows [298, 299]:
where \(X =\left\lbrace x^1, \ldots , x^{t-1},x^t, \ldots , x^T\right\rbrace\) contains all of the observation, \(tanh\) denotes the hyperbolic tangent function, and the recurrent weight and the projection matrix are shown by \(W\) and \(I\), respectively. The hidden-to-hidden connections also model the short-term time dependency. The hidden state \(h\) is used to make a prediction as
where \(\sigma _s\) is a softmax function and provides a normalized probability distribution over the possible classes. As depicted in Figure 2(b), the hidden state \(h\) can be used to stack RNNs in order to build deeper networks:
where \(\sigma\) is the logistic sigmoid function. As an alternative to feeding each time step to the RNN, the data can be divided into time windows of \(\omega\) observations, with the option for variable overlaps. Each time window is labeled with the majority response labels within the \(\omega\) window.
B.3 Attention-based Model
B.3.1 Self-attention.
The attention mechanism was introduced by [300] for improving the performance of encoder-decoder models [301] in neural machine translation. The encoder-decoder in neural machine translation encodes a source sentence into a vector in latent space and decodes the latent vector into a target language sentence. As shown in Figure 3(a), the attention mechanism allows the decoder to pay attention to the segments of the source for each target through a context vector \(c_t\). For this model, a variable-length attention vector \(\alpha _t\), equal to the number of source time steps, is derived by comparing the current target hidden state \(h_t\) with each source hidden state \(\overline{h}_s\) as follows [302]:
The term \(score\) is referred to as an alignment model and used to compare the target hidden state \(h_t\) with each of the source hidden states \(\overline{h}_s\), and the result is normalized to produce attention weights (a distribution over source positions). There are various choices of the scoring function:
These scores influence the attention distribution, impacting how the model attends to different parts of the input sequence during predictions. As shown above, the score function is parameterized as an FFN that is jointly trained with all the other components of the model. The model directly computes soft attention, allowing the cost function’s gradient to be backpropagated [300]. Given the alignment vector as weights, the context vector \(c_t\) is computed as the weighted average over all the source hidden states:
Accordingly, the computation path goes from \(h_t\rightarrow \alpha _t \rightarrow c_t \rightarrow \widetilde{h}_t\) and then makes a prediction using a \(Softmax\) function [302]. Note that \(\widetilde{h}_t\) is a refined hidden state that incorporates both the original hidden state \(h_t\) and the context information \(c_t\) obtained through attention mechanisms.
Fig. 3.
B.3.2 Transformers.
Similar to self-attention and other competitive neural sequence models, the original transformer developed for natural language processing (hereinafter the vanilla transformer) has an encoder-decoder structure that takes as input a sequence of words from the source language and then generates the translation in the target language [93]. Both the encoder and decoder are composed of multiple identical blocks. Each encoder block consists of a multi-head self-attention module and a position-wise FFN, while each decoder block inserts cross-attention models between the multi-head self-attention module and the position-wise FFN. Unlike RNNs, transformers do not use recurrence and instead model sequence information using the positional encoding in the input embeddings.
The transformer architecture is based on finding associations or correlations between various input segments using the dot product. As shown in Figure 3(b), the attention operation in transformers starts with building three different linearly weighted vectors from the input \(x_i\), referred to as query (\(q_i\)), key (\(k_i\)), and value (\(v_i\)):
Note that the weighting of the value vector \({\bf v}_i\) depends on the mapped correlation between the query vector \({\bf q}_i\) at position \(i\) and the key vector \({\bf k}_j\) at position \(j\). The value of the dot product tends to grow with the increasing size of the query and key vectors. As the softmax function is sensitive to large values, the attention weights are scaled by the square root of the size of the query and key vectors \(d_q\). The input data may contain several levels of correlation information, and the learning process may benefit from processing the input data in multiple different ways. Multiple attention heads are introduced that operate on the same input in parallel and use different weight matrices \(W_q,W_k\), and \(W_v\) to extract various levels of correlation between the input data.
References
[1]
Qiang Yang and Xindong Wu. 2006. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5, 4 (2006), 597–604.
Henry Friday Nweke, Ying Wah Teh, Mohammed Ali Al-Garadi, and Uzoma Rita Alo. 2018. Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Systems with Applications 105 (2018), 233–261. DOI:
Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. 2017. Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping 38, 11 (2017), 5391–5420.
A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, P. Sundberg, H. Yee, K. Zhang, Y. Zhang, G. Flores, G. E. Duggan, J. Irvine, Q. Le, K. Litsch, A. Mossin, J. Tansuwan, D. Wang, J. Wexler, J. Wilson, D. Ludwig, S. L. Volchenboum, K. Chou, M. Pearson, S. Madabushi, N. H. Shah, A. J. Butte, M. D. Howell, C. Cui, G. S. Corrado, J. Dean. 2018. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018 May 8, 1 (2018), 18. DOI:. 31304302; PMCID: PMC6550175.
Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. 2018. The UEA multivariate time series classification archive, 2018. arXiv preprint:1811.00075 (2018).
Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, and Eamonn Keogh. 2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293–1305.
Chang Wei Tan, Christoph Bergmeir, François Petitjean, and Geoffrey I. Webb. 2021. Time series extrinsic regression. Data Mining and Knowledge Discovery 35, 3 (2021), 1032–1060.
Matthew Middlehurst, Patrick Schäfer, and Anthony Bagnall. 2023. Bake off redux: A review and experimental evaluation of recent time series classification algorithms. arXiv preprint arXiv:2304.13029 (2023).
Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier, Daniel F. Schmidt, Jonathan Weber, Geoffrey I. Webb, Lhassane Idoumghar, Pierre-Alain Muller, and François Petitjean. 2020. Inceptiontime: Finding Alexnet for time series classification. Data Mining and Knowledge Discovery 34, 6 (2020), 1936–1962.
Navid Mohammadi Foumani, Chang Wei Tan, Geoffrey I. Webb, and Mahsa Salehi. 2024. Improving position encoding of transformers for multivariate time series classification. Data Mining and Knowledge Discovery 38, 1 (2024), 22–48.
Angus Dempster, François Petitjean, and Geoffrey I. Webb. 2020. ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery 34, 5 (2020), 1454–1495.
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2019. Deep learning for time series classification: A review. Data Mining and Knowledge Discovery 33, 4 (2019), 917–963.
Zhiguang Wang, Weizhong Yan, and Tim Oates. 2017. Time series classification from scratch with deep neural networks: A strong baseline. In 2017 International Joint Conference on Neural Networks (IJCNN’17). IEEE, 1578–1585.
Yifan Hao and Huiping Cao. 2020. A new attention mechanism to classify multivariate time series. In 29th International Joint Conference on Artificial Intelligence.
George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. 2021. A transformer-based framework for multivariate time series representation learning. In 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2114–2124.
Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2021), 857–876.
Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, Chee Keong Kwoh, Xiaoli Li, and Cuntai Guan. 2021. Time-series representation learning via temporal and contextual contrasting. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI’21). 2352–2359.
Chao-Han Huck Yang, Yun-Yun Tsai, and Pin-Yu Chen. 2021. Voice2series: Reprogramming acoustic models for time series classification. In International Conference on Machine Learning. PMLR, 11808–11819.
Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. 2022. Ts2vec: Towards universal representation of time series. Proceedings of the AAAI Conference on Artificial Intelligence 36, 8 (2022), 8980–8987.
Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. 2017. The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 31, 3 (2017), 606–660.
Alejandro Pasos Ruiz, Michael Flynn, James Large, Matthew Middlehurst, and Anthony Bagnall. 2021. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 35, 2 (2021), 401–449.
Chang Wei Tan, Christoph Bergmeir, Francois Petitjean, and Geoffrey I Webb. 2020. Monash university, UEA, UCR time series regression archive. arXiv preprint:2006.10996 (2020).
Martin Längkvist, Lars Karlsson, and Amy Loutfi. 2014. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters 42 (2014), 11–24.
Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. 2013. Generalized denoising auto-encoders as generative models. Advances in Neural Information Processing Systems 26 (2013), 899–907.
Qinghua Hu, Rujia Zhang, and Yucan Zhou. 2016. Transfer learning for short-term wind speed prediction with deep neural networks. Renewable Energy 85 (2016), 83–95.
Debrup Banerjee, Kazi Islam, Keyi Xue, Gang Mei, Lemin Xiao, Guangfan Zhang, Roger Xu, Cai Lei, Shuiwang Ji, and Jiang Li. 2019. A deep transfer learning approach for improved post-traumatic stress disorder diagnosis. Knowledge and Information Systems 60, 3 (2019), 1693–1724.
Witali Aswolinskiy, René Felix Reinhart, and Jochen Steil. 2018. Time series classification in reservoir-and model-space. Neural Processing Letters 48, 2 (2018), 789–809.
Felipe Arias Del Campo, María Cristina Guevara Neri, Osslan Osiris Vergara Villegas, Vianey Guadalupe Cruz Sánchez, Humberto de Jesús Ochoa Domínguez, and Vicente García Jiménez. 2021. Auto-adaptive multilayer perceptron for univariate time series classification. Expert Systems with Applications 181 (2021), 115147.
Brian Kenji Iwana, Volkmar Frinken, and Seiichi Uchida. 2016. A robust dissimilarity-based neural network for temporal pattern recognition. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR’16). IEEE, 265–270.
Brian Kenji Iwana, Volkmar Frinken, and Seiichi Uchida. 2020. DTW-NN: A novel neural network for time series recognition using dynamic alignment between inputs and weights. Knowledge-based Systems 188 (2020), 104971.
Nuzhat Tabassum, Sujeendran Menon, and Agnieszka Jastrzebska. 2022. Time-series classification with SAFE: Simple and fast segmented word embedding-based neural time series classifier. Information Processing & Management 59, 5 (2022), 103044.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105.
Seyed Navid Mohammadi Foumani and Ahmad Nickabadi. 2019. A probabilistic topic model using deep visual word representation for simultaneous image classification and annotation. Journal of Visual Communication and Image Representation 59 (2019), 195–203.
Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. 2014. Time series classification using multi-channels deep convolutional neural networks. In International Conference on Web-age Information Management. Springer, 298–310.
Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In 24th International Joint Conference on Artificial Intelligence.
Bendong Zhao, Huanzhang Lu, Shangfeng Chen, Junliang Liu, and Dongya Wu. 2017. Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics 28, 1 (2017), 162–169.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition.3431–3440.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.770–778.
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition.2921–2929.
Xiaowu Zou, Zidong Wang, Qi Li, and Weiguo Sheng. 2019. Integration of residual network and convolutional neural network along with various activation functions and global pooling for time series classification. Neurocomputing 367 (2019), 39–45.
Yuhong Li, Xiaofan Zhang, and Deming Chen. 2018. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In IEEE Conference on Computer Vision and Pattern Recognition.1091–1100.
Omolbanin Yazdanbakhsh and Scott Dick. 2019. Multivariate time series classification using dilated convolutional neural network. arXiv preprint:1905.01697 (2019).
Seyed Navid Mohammadi Foumani, Chang Wei Tan, and Mahsa Salehi. 2021. Disjoint-CNN for multivariate time series classification. In 2021 International Conference on Data Mining Workshops (ICDMW’21). IEEE, 760–769.
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNet-V2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition.4510–4520.
Zhiguang Wang and Tim Oates. 2015. Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. In Workshops at the 29th AAAI Conference on Artificial Intelligence.
Nima Hatami, Yann Gavet, and Johan Debayle. 2018. Classification of time-series images using deep convolutional neural networks. In 10th International Conference on Machine Vision (ICMV’17), Vol. 10696. SPIE, 242–249.
Saeed Karimi-Bidhendi, Faramarz Munshi, and Ashfaq Munshi. 2018. Scalable classification of univariate and multivariate time series. In 2018 IEEE International Conference on Big Data (Big Data’18). IEEE, 1598–1605.
Yuxuan Zhao and Zhongmin Cai. 2019. Classify multivariate time series by deep neural network image classification. In 2019 2nd China Symposium on Cognitive Computing and Hybrid Intelligence (CCHI’19). IEEE, 93–98.
Chao-Lung Yang, Zhi-Xuan Chen, and Chen-Yi Yang. 2019. Sensor classification using convolutional neural network by encoding multivariate time series as two-dimensional colored images. Sensors 20, 1 (2019), 168.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition.2818–2826.
Wei Chen and Ke Shi. 2019. A deep learning framework for time series classification using Relative Position Matrix and Convolutional Neural Network. Neurocomputing 359 (2019), 384–394.
Zhicheng Cui, Wenlin Chen, and Yixin Chen. 2016. Multi-scale convolutional neural networks for time series classification. arXiv preprint:1603.06995 (2016).
Arthur Le Guennec, Simon Malinowski, and Romain Tavenard. 2016. Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data.
Chien-Liang Liu, Wen-Hoar Hsaio, and Yao-Chung Tu. 2018. Time series classification with multivariate convolutional neural network. IEEE Transactions on Industrial Electronics 66, 6 (2018), 4788–4797.
Anthony Brunel, Johanna Pasquet, Jérôome Pasquet, Nancy Rodriguez, Frédéric Comby, Dominique Fouchez, and Marc Chaumont. 2019. A CNN adapted to time series for the classification of Supernovae. Electronic Imaging 2019, 14 (2019), 90–1.
Jingyu Sun, Susumu Takeuchi, and Ikuo Yamasaki. 2021. Prototypical inception network with cross branch attention for time series classification. In 2021 International Joint Conference on Neural Networks (IJCNN’21). IEEE, 1–7.
Saidrasul Usmankhujaev, Bunyodbek Ibrokhimov, Shokhrukh Baydadaev, and Jangwoo Kwon. 2021. Time series classification with InceptionFCN. Sensors 22, 1 (2021), 157.
Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. 2023. LITE: Light inception with boosTing tEchniques for time series classification. In 2023 IEEE 10th International Conference on Data Science and Advanced Analytics (DSAA’23). IEEE, 1–10.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition.1–9.
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In 31st AAAI Conference on Artificial Intelligence.
Mutegeki Ronald, Alwin Poulose, and Dong Seog Han. 2021. iSPLInception: An inception-ResNet deep learning architecture for human activity recognition. IEEE Access 9 (2021), 68985–69001.
Ali Ismail-Fawaz, Maxime Devanne, Jonathan Weber, and Germain Forestier. 2022. Deep learning for time series classification using new hand-crafted convolution filters. In 2022 IEEE International Conference on Big Data (Big Data’22). IEEE, 972–981.
Don Dennis, Durmus Alp Emre Acar, Vikram Mandikal, Vinu Sankar Sadasivan, Venkatesh Saligrama, Harsha Vardhan Simhadri, and Prateek Jain. 2019. Shallow RNN: Accurate time-series classification on resource constrained devices. Advances in Neural Information Processing Systems 32 (2019), 11 pages.
Santiago Fernández, Alex Graves, and Jürgen Schmidhuber. 2007. Sequence labelling in structured domains with hierarchical recurrent neural networks. In 20th International Joint Conference on Artificial Intelligence (IJCAI’07).
Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. Advances in Neural Information Processing Systems 26 (2013), 190–198.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning. PMLR, 1310–1318.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 3104–3112.
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition.2625–2634.
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition.3128–3137.
Yujin Tang, Jianfeng Xu, Kazunori Matsumoto, and Chihiro Ono. 2016. Sequence-to-sequence model with attention for time series classification. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). IEEE, 503–510.
Pankaj Malhotra, Vishnu TV, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2017. TimeNet: Pre-trained deep recurrent neural network for time series classification. arXiv preprint:1706.08838 (2017).
Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Samuel Harford. 2019. Multivariate LSTM-FCNs for time series classification. Neural Networks 116 (2019), 237–245.
Xuchao Zhang, Yifeng Gao, Jessica Lin, and Chang-Tien Lu. 2020. Tapnet: Multivariate time series classification with attentional prototypical network. In AAAI Conference on Artificial Intelligence, Vol. 34. 6845–6852.
Jingwei Zuo, Karine Zeitouni, and Yehia Taher. 2021. SMATE: Semi-supervised spatio-temporal representation learning on multivariate time series. In 2021 IEEE International Conference on Data Mining (ICDM’21). IEEE, 1565–1570.
Sangdi Lin and George C. Runger. 2017. GCRNN: Group-constrained convolutional recurrent neural network. IEEE Transactions on Neural Networks and Learning Systems 29, 10 (2017), 4709–4718.
Ronald Mutegeki and Dong Seog Han. 2020. A CNN-LSTM approach to human activity recognition. In 2020 IEEE International Conference on Computational Intelligence and Communications Technologies. (ICAIIC’20). IEEE, 362–366.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019, Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 4171–4186. DOI:arXiv:1810.04805
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint:2010.11929 (2020).
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in Neural Information Processing Systems 32 (2019).
Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz. 2021. BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience 15 (2021).
Ye Yuan, Guangxu Xun, Fenglong Ma, Yaqing Wang, Nan Du, Kebin Jia, Lu Su, and Aidong Zhang. 2018. Muvan: A multi-view attention network for multivariate temporal data. In 2018 IEEE International Conference on Data Mining (ICDM’18). IEEE, 717–726.
Tsung-Yu Hsieh, Suhang Wang, Yiwei Sun, and Vasant Honavar. 2021. Explainable multivariate time series classification: A deep neural network which learns to attend to important variables as well as time intervals. In 14th ACM International Conference on Web Search and Data Mining. 607–615.
Ye Yuan, Guangxu Xun, Fenglong Ma, Qiuling Suo, Hongfei Xue, Kebin Jia, and Aidong Zhang. 2018. A novel channel-aware attention framework for multi-channel EEG seizure detection via multi-view deep learning. In 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI’18). IEEE, 206–209.
Xu Cheng, Peihua Han, Guoyuan Li, Shengyong Chen, and Houxiang Zhang. 2020. A novel channel and temporal-wise attention in convolutional networks for multivariate time series classification. IEEE Access 8 (2020), 212247–212257.
Zhiwen Xiao, Xin Xu, Huanlai Xing, Shouxi Luo, Penglin Dai, and Dawei Zhan. 2021. RTFN: A robust temporal feature network for time series classification. Information Sciences 571 (2021), 65–86.
Jingyuan Wang, Chen Yang, Xiaohan Jiang, and Junjie Wu. 2023. WHEN: A Wavelet-DTW hybrid attention network for heterogeneous time series analysis. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2361–2373.
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. Advances in Neural Information Processing Systems 28 (2015), 2017–2025.
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision. 3–19.
Tian Wang, Zhaoying Liu, Ting Zhang, and Yujian Li. 2021. Time series classification based on multi-scale dynamic convolutional features and distance features. In 2021 2nd Asia Symposium on Signal Processing (ASSP’21). IEEE, 239–246.
Huan Song, Deepta Rajan, Jayaraman Thiagarajan, and Andreas Spanias. 2018. Attend and diagnose: Clinical time series analysis using attention models. In AAAI Conference on Artificial Intelligence, Vol. 32.
Can-can Jin and Xi Chen. 2021. An end-to-end framework combining time–frequency expert knowledge and modified Transformer networks for vibration signal classification. Expert Systems with Applications 171 (2021), 114570.
Tarek Allam Jr. and Jason D. McEwen. 2021. Paying attention to astronomical transients: Photometric classification with the time-series transformer. arXiv preprint:2105.06178 (2021).
Yankun Ren, Longfei Li, Xinxing Yang, and Jun Zhou. 2022. AutoTransformer: Automatic transformer architecture design for time series classification. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 143–155.
Ming Jin, Huan Yee Koh, Qingsong Wen, Daniele Zambon, Cesare Alippi, Geoffrey I. Webb, Irwin King, and Shirui Pan. 2023. A survey on graph neural networks for time series: Forecasting, classification, imputation, and anomaly detection. arXiv 14, 8 (July2023), 1–27. arxiv:2307.03759http://arxiv.org/abs/2307.03759
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2021. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (Jan.2021), 4–24. DOI:arxiv:1901.00596
Wenjie Xi, Arnav Jain, Li Zhang, and Jessica Lin. 2023. LB-SimTSC: An efficient similarity-aware graph neural network for semi-supervised time series classification. arXiv (Jan.2023). arxiv:2301.04838
Stefan Bloemheuvel, Jurgen van den Hoogen, Dario Jozinović, Alberto Michelini, and Martin Atzmueller. 2023. Graph neural networks for multivariate time series regression with application to seismic data. International Journal of Data Science and Analytics 16, 3 (Sept.2023), 317–332. DOI:arxiv:2201.00818
Ziqiang Cheng, Yang Yang, Shuo Jiang, Wenjie Hu, Zhangchi Ying, Ziwei Chai, and Chunping Wang. 2021. Time2Graph+: Bridging time series and graph representation learning via multiple attentions. IEEE Transactions on Knowledge and Data Engineering 35, 2 (2021), 1–1. DOI:
Ian C. Covert, Balu Krishnan, Imad Najm, Jiening Zhan, Matthew Shore, John Hixson, and Ming Jack Po. 2019. Temporal graph convolutional networks for automatic seizure detection. In Machine Learning for Healthcare Conference. PMLR, 160–180.
Zhengjing Ma, Gang Mei, Edoardo Prezioso, Zhongjian Zhang, and Nengxiong Xu. 2021. A deep learning approach using graph convolutional networks for slope deformation prediction based on time-series displacement data. Neural Computing and Applications 33, 21 (2021), 14441–14457.
D. Nhu, M. Janmohamed, P. Perucca, A. Gilligan, P. Kwan, T. O’Brien, C. W. Tan, and L. Kuhlmann. 2021. Graph convolutional network for generalized epileptiform abnormality detection on EEG. In 2021 IEEE Signal Processing in Medicine and Biology Symposium (SPMB’21). IEEE, 1–6.
Siyi Tang, Jared A. Dunnmon, Khaled Saab, Xuan Zhang, Qianying Huang, Florian Dubost, Daniel L. Rubin, and Christopher Lee-Messer. 2021. Self-supervised graph neural networks for improved electroencephalographic seizure analysis. In 10th International Conference on Learning Representations (ICLR’22), 1–23. arxiv:2104.08336
Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. 2021. Graph-guided network for irregularly sampled multivariate time series. In 10th International Conference on Learning Representations (ICLR’22), 1–21. arxiv:2110.05357
Tiago Azevedo, Alexander Campbell, Rafael Romero-Garcia, Luca Passamonti, Richard A. I. Bethlehem, Pietro Liò, and Nicola Toschi. 2022. A deep graph neural network architecture for modelling spatio-temporal dynamics in resting-state functional MRI data. Medical Image Analysis 79 (July2022), 102471. DOI:
Daochen Zha, Kwei-herng Lai, Kaixiong Zhou, and Xia Hu. 2022. Towards similarity-aware time-series classification. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM’22). 199–207. DOI:
Lukasz Tulczyjew, Michal Kawulok, Nicolas Longepe, Bertrand Le Saux, and Jakub Nalepa. 2022. Graph neural networks extract high-resolution cultivated land maps from sentinel-2 image series. IEEE Geoscience and Remote Sensing Letters 19 (2022), 1–5. DOI:
Le Sun, Chenyang Li, Bo Liu, and Yanchun Zhang. 2023. Class-driven graph attention network for multi-label time series classification in mobile health digital twins. IEEE Journal on Selected Areas in Communications 41, 10 (2023), 3267–3278. DOI:
Corentin Dufourg, Charlotte Pelletier, Stéphane May, and Sébastien Lefèvre. 2023. Graph dynamic earth net: Spatio-temporal graph benchmark for satellite image time series. In 2023 IEEE International Geoscience and Remote Sensing Symposium (IGARSS’23). IEEE, 7164–7167. DOI:
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR’17) - Conference Track Proceedings. 1–14. arxiv:1609.02907
Ling Yang and Shenda Hong. 2022. Unsupervised time-series representation learning with iterative bilinear temporal-spectral fusion. In ICML. 25038–25054.
Aapo Hyvarinen and Hiroshi Morioka. 2016. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in Neural Information Processing Systems 29 (2016), 3772–3780.
Kristoffer Wickstrøm, Michael Kampffmeyer, Karl Øyvind Mikalsen, and Robert Jenssen. 2022. Mixing up contrastive learning: Self-supervised representation learning for time series. Pattern Recognition Letters 155 (2022), 54–61.
Xinyu Yang, Zhenguo Zhang, and Rongyi Cui. 2022. TimeCLR: A self-supervised contrastive learning framework for univariate time series representation. Knowledge-based Systems 245 (2022), 108606.
Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. 2022. Self-supervised contrastive pre-training for time series via time-frequency consistency. In Proceedings of Neural Information Processing Systems (NeurIPS’22).
Ranak Roy Chowdhury, Xiyuan Zhang, Jingbo Shang, Rajesh K. Gupta, and Dezhi Hong. 2022. TARNet: Task-aware reconstruction for time-series transformer. In 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 14–18.
Wenrui Zhang, Ling Yang, Shijia Geng, and Shenda Hong. 2023. Self-supervised time series representation learning via cross reconstruction transformer. IEEE Transactions on Neural Networks and Learning Systems (2023), 1–10.
Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, and Germain Forestier. 2023. Finding foundation models for time series classification with a PreText task. arXiv preprint arXiv:2311.14534 (2023).
Terry T. Um, Franz M. J. Pfister, Daniel Pichler, Satoshi Endo, Muriel Lang, Sandra Hirche, Urban Fietzek, and Dana Kulić. 2017. Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 216–220.
Khandakar M. Rashid and Joseph Louis. 2019. Window-warping: A time series data augmentation of IMU data for construction equipment activity identification. In Proceedings of the International Symposium on Automation and Robotics in Construction (ISARC’19), Vol. 36. IAARC Publications, 651–657.
Brian Kenji Iwana and Seiichi Uchida. 2021. Time series data augmentation for neural networks by time warping with a discriminative teacher. In 2020 25th International Conference on Pattern Recognition (ICPR’21). IEEE, 3558–3565.
Thai-Son Nguyen, Sebastian Stueker, Jan Niehues, and Alex Waibel. 2020. Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7689–7693.
Bhavik Vachhani, Chitralekha Bhat, and Sunil Kumar Kopparapu. 2018. Data augmentation using healthy speech for dysarthric speech recognition. In Interspeech. 471–475.
Zhicheng Cui, Wenlin Chen, and Yixin Chen. 2016. Multi-Scale Convolutional Neural Networks for Time Series Classification. arxiv:cs.CV/1603.06995 (2016).
Arthur Le Guennec, Simon Malinowski, and Romain Tavenard. 2016. Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD on Advanced Analytics and Learning on Temporal Data.
Germain Forestier, François Petitjean, Hoang Anh Dau, Geoffrey I. Webb, and Eamonn Keogh. 2017. Generating synthetic time series to augment sparse datasets. In 2017 IEEE International Conference on Data Mining (ICDM’17). IEEE, 865–870.
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2018. Data Augmentation Using Synthetic Data for Time Series Classification with Deep Residual Networks. arxiv:cs.CV/1808.02455 (2018).
Tsegamlak Terefe, Maxime Devanne, Jonathan Weber, Dereje Hailemariam, and Germain Forestier. 2020. Time series averaging using multi-tasking autoencoder. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI’20). IEEE, 1065–1072.
Brian Kenji Iwana and Seiichi Uchida. 2021. An empirical survey of data augmentation for time series classification with neural networks. Plos One 16, 7 (2021), e0254841.
Gautier Pialla, Maxime Devanne, Jonathan Weber, Lhassane Idoumghar, and Germain Forestier. 2022. Data augmentation for time series classification with deep learning models. In International Workshop on Advanced Analytics and Learning on Temporal Data. Springer, 117–132.
Zijun Gao, Lingbo Li, and Tianhua Xu. 2023. Data augmentation for time-series classification: An extensive empirical study and comprehensive survey. arXiv preprint arXiv:2310.10060 (2023).
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2018. Transfer learning for time series classification. In 2018 IEEE International Conference on Big Data (Big Data’18). IEEE, 1367–1376.
Frédéric Li, Kimiaki Shirahama, Muhammad Adeel Nisar, Xinyu Huang, and Marcin Grzegorzek. 2020. Deep transfer learning for time series data based on sensor modality classification. Sensors 20, 15 (2020), 4271.
Yarden Rotem, Nathaniel Shimoni, Lior Rokach, and Bracha Shapira. 2022. Transfer learning for time series classification using synthetic data generation. In International Symposium on Cyber Security, Cryptology, and Machine Learning. Springer, 232–246.
Ayantha Senanayaka, Abdullah Al Mamun, Glenn Bond, Wenmeng Tian, Haifeng Wang, Sara Fuller, T. C. Falls, Shahram Rahimi, and Linkan Bian. 2022. Similarity-based multi-source transfer learning approach for time series classification. International Journal of Prognostics and Health Management 13, 2 (2022).
Kathan Kashiparekh, Jyoti Narwariya, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. 2019. Convtimenet: A pre-trained deep convolutional neural network for time series classification. In 2019 International Joint Conference on Neural Networks (IJCNN’19). IEEE, 1–8.
D. Merlin Praveena, D. Angelin Sarah, and S. Thomas George. 2022. Deep learning techniques for EEG signal applications–A review. IETE Journal of Research 68, 4 (2022), 3030–3037. DOI:
Nur’atiah Zaini, Lee Woen Ean, Ali Najah Ahmed, and Marlinda Abdul Malek. 2022. A systematic literature review of deep learning neural network for time series air quality forecasting. Environmental Science and Pollution Research 29, 4 (Jan.2022), 4958–4990. DOI:
Bo Zhang, Yi Rong, Ruihan Yong, Dongming Qin, Maozhen Li, Guojian Zou, and Jianguo Pan. 2022. Deep learning for air pollutant concentration prediction: A review. Atmospheric Environment 290 (Dec.2022), 119347. DOI:
Gyungmin Toh and Junhong Park. 2020. Review of vibration-based structural health monitoring using deep learning. Applied Sciences 10, 5 (2020), 1680. DOI:
Nikhil M. Thoppil, V. Vasu, and C. S. P. Rao. 2021. Deep learning algorithms for machinery health prognostics using time-series data: A review. Journal of Vibration Engineering & Technologies 9, 6 (Sept.2021), 1123–1145. DOI:
Lei Ren, Zidi Jia, Yuanjun Laili, and Di Huang. 2023. Deep learning for time-series prediction in IIoT: Progress, challenges, and prospects. IEEE Transactions on Neural Networks and Learning Systems PP (2023), 1–20. DOI:
Yassine Himeur, Khalida Ghanem, Abdullah Alsalemi, Faycal Bensaali, and Abbes Amira. 2021. Artificial intelligence based anomaly detection of energy consumption in buildings: A review, current trends and new perspectives. Applied Energy 287 (2021), 116601. DOI:
Neha Gupta, Suneet K. Gupta, Rajesh K. Pathak, Vanita Jain, Parisa Rashidi, and Jasjit S. Suri. 2022. Human activity recognition in artificial intelligence framework: A narrative review. Artificial Intelligence Review 55, 6 (Aug.2022), 4755–4808. DOI:
E. Ramanujam, Thinagaran Perumal, and S. Padmavathi. 2021. Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review. IEEE Sensors Journal 21, 12 (June2021), 13029–13040. DOI:
Jeffrey W. Lockhart, Tony Pulickal, and Gary M. Weiss. 2012. Applications of mobile activity recognition. In 2012 ACM Conference on Ubiquitous Computing (UbiComp’12). ACM Press, New York, NY, USA, 1054. DOI:
Emmanuel Munguia Tapia, Stephen S. Intille, and Kent Larson. 2004. Activity recognition in the home using simple and ubiquitous sensors. In Lecture Notes in Computer Science. Vol. 3001. Springer, Berlin, 158–175. DOI:
Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. International Journal of Computer Vision 130, 5 (May2022), 1366–1401. DOI:arxiv:1806.11230
Francisco Ordóñez and Daniel Roggen. 2016. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 1 (Jan.2016), 115. DOI:
Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In 16th International Symposium on Wearable Computers. 108–109. DOI:
Mi Zhang and Alexander A. Sawchuk. 2012. USC-HAD: A daily activity dataset for ubiquitous activity recognition using wearable sensors. In 2012 ACM Conference on Ubiquitous Computing (UbiComp’12). ACM Press, New York, NY, USA, 1036. DOI:
Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster, Gerhard Tröster, Paul Lukowicz, David Bannach, Gerald Pirkl, et al. 2010. Collecting complex activity datasets in highly rich networked sensor environments. In 7th International Conference on Networked Sensing Systems. IEEE, 233–240.
Timo Sztyler, Heiner Stuckenschmidt, and Wolfgang Petrich. 2017. Position-aware activity recognition with wearable devices. Pervasive and Mobile Computing 38 (July2017), 281–295. DOI:
Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys & Tutorials 15, 3 (2013), 1192–1209. DOI:
Fuqiang Gu, Mu-Huan Chung, Mark Chignell, Shahrokh Valaee, Baoding Zhou, and Xue Liu. 2022. A survey on deep learning for human activity recognition. Computer Surveys 54, 8 (Nov.2022), 1–34. DOI:
Nils Y. Hammerla, Shane Halloran, and Thomas Ploetz. 2016. Deep, convolutional, and recurrent models for human activity recognition using wearables. In International Joint Conference on Artificial Intelligence (IJCAI’16). 1533–1540. arxiv:1604.08880
Ming Zeng, Le T. Nguyen, Bo Yu, Ole J. Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neural networks for human activity recognition using mobile sensors. In 6th International Conference on Mobile Computing, Applications and Services. ICST, 718–737. DOI:
Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutional neural networks. In 23rd ACM International Conference on Multimedia. ACM, New York, NY, USA, 1307–1310. DOI:
Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In International Joint Conference on Artificial Intelligence (IJCAI ’15),3995–4001. DOI:
Charissa Ann Ronao and Sung-Bae Cho. 2016. Human activity recognition with smartphone sensors using deep learning neural networks. Expert Systems with Applications 59 (Oct.2016), 235–244. DOI:
Yu Guan and Thomas Plötz. 2017. Ensembles of deep LSTM learners for activity recognition using wearables. ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 2 (June2017), 1–28. DOI:arxiv:1703.09370
Song-Mi Lee, Sang Min Yoon, and Heeryon Cho. 2017. Human activity recognition from accelerometer data using Convolutional Neural Network. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp’17), Vol. 83. IEEE, 131–134. DOI:
Fernando Moya Rueda, René Grzeszick, Gernot Fink, Sascha Feldhorst, and Michael ten Hompel. 2018. Convolutional neural networks for human activity recognition using body-worn sensors. Informatics 5, 2 (May2018), 26. DOI:
Rui Yao, Guosheng Lin, Qinfeng Shi, and Damith C. Ranasinghe. 2018. Efficient dense labelling of human activity sequences from wearables using fully convolutional networks. Pattern Recognition 78 (June2018), 252–266. DOI:arxiv:1702.06212
Ming Zeng, Haoxiang Gao, Tong Yu, Ole J. Mengshoel, Helge Langseth, Ian Lane, and Xiaobing Liu. 2018. Understanding and improving recurrent networks for human activity recognition by continuous attention. In ACM International Symposium on Wearable Computers. New York, NY, USA, 56–63. DOI:arxiv:1810.04038
Cheng Xu, Duo Chai, Jie He, Xiaotong Zhang, and Shihong Duan. 2019. InnoHAR: A deep neural network for complex human activity recognition. IEEE Access 7 (2019), 9893–9902. DOI:
Haoxi Zhang, Zhiwen Xiao, Juan Wang, Fei Li, and Edward Szczerbicki. 2020. A novel IoT-perceptive human activity recognition (HAR) approach using multihead convolutional attention. IEEE Internet of Things Journal 7, 2 (Feb.2020), 1072–1080. DOI:
Sravan Kumar Challa, Akhilesh Kumar, and Vijay Bhaskar Semwal. 2021. A multibranch CNN-BiLSTM model for human activity recognition using wearable sensor data. The Visual Computer 38, 0123456789 (Aug.2021), 4095–4109. DOI:
Sakorn Mekruksavanich and Anuchit Jitpattanakul. 2021. LSTM networks using smartphone data for sensor-based human activity recognition in smart homes. Sensors 21, 5 (Feb.2021), 1636. DOI:
Sakorn Mekruksavanich and Anuchit Jitpattanakul. 2021. Biometric user identification based on human activity recognition using wearable sensors: An experiment using deep learning models. Electronics 10, 3 (Jan.2021), 308. DOI:
Satya P. Singh, Madan Kumar Sharma, Aime Lay-Ekuakille, Deepak Gangwar, and Sukrit Gupta. 2021. Deep ConvLSTM with self-attention for human activity decoding using wearable sensors. IEEE Sensors Journal 21, 6 (March2021), 8575–8582. DOI:arxiv:2005.00698
Xing Wang, Lei Zhang, Wenbo Huang, Shuoyuan Wang, Hao Wu, Jun He, and Aiguo Song. 2022. Deep convolutional networks with tunable speed–accuracy tradeoff for human activity recognition using wearables. IEEE Transactions on Instrumentation and Measurement 71 (2022), 1–12. DOI:
Shige Xu, Lei Zhang, Wenbo Huang, Hao Wu, and Aiguo Song. 2022. Deformable convolutional networks for multimodal human activity recognition using wearable sensors. IEEE Transactions on Instrumentation and Measurement 71 (2022), 1–14. DOI:
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 764–773. DOI:arXiv:1703.06211
Michael A. Wulder, Joanne C. White, Samuel N. Goward, Jeffrey G. Masek, James R. Irons, Martin Herold, Warren B. Cohen, Thomas R. Loveland, and Curtis E. Woodcock. 2008. Landsat continuity: Issues and opportunities for land cover monitoring. Remote Sensing of Environment 112, 3 (March2008), 955–969. DOI:
William Emery and Adriano Camps. 2017. Basic electromagnetic concepts and applications to optical sensors. In Introduction to Satellite Remote Sensing, William Emery and Adriano Camps (Eds.). Elsevier, Chapter 2, 43–83. DOI:
Noel Gorelick, Matt Hancher, Mike Dixon, Simon Ilyushchenko, David Thau, and Rebecca Moore. 2017. Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment 202 (Dec.2017), 18–27. DOI:
Grégory Giuliani, Bruno Chatenoux, Andrea De Bono, Denisa Rodila, Jean-Philippe Richard, Karin Allenbach, Hy Dao, and Pascal Peduzzi. 2017. Building an Earth observations data cube: Lessons learned from the Swiss data cube (SDC) on generating analysis ready data (ARD). Big Earth Data 1, 1–2 (Dec.2017), 100–117. DOI:
Adam Lewis, Simon Oliver, Leo Lymburner, Ben Evans, Lesley Wyborn, Norman Mueller, Gregory Raevksi, Jeremy Hooke, Rob Woodcock, Joshua Sixsmith, et al. 2017. The Australian geoscience data cube–Foundations and lessons learned. Remote Sensing of Environment 202 (2017), 276–292.
Dino Ienco, Yawogan Jean Eudes Gbodjo, Roberto Interdonato, and Raffaele Gaetano. 2020. Attentive weakly supervised land cover mapping for object-based satellite image time series data with spatial interpretation. arXiv (2020), 1–12. arxiv:2004.14672
Vivien Sainte Fare Garnot, Loic Landrieu, Sebastien Giordano, and Nesrine Chehata. 2020. Satellite image time series classification with pixel-set encoders and temporal self-attention. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 12322–12331. DOI:
Anurag Kulshrestha, Ling Chang, and Alfred Stein. 2022. Use of LSTM for sinkhole-related anomaly detection and classification of InSAR deformation time series. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 4559–4570. DOI:
Yifang Ban, Puzhao Zhang, Andrea Nascetti, Alexandre R. Bevington, and Michael A. Wulder. 2020. Near real-time wildfire progression monitoring with Sentinel-1 SAR time series and deep learning. Scientific Reports 10, 1 (Dec.2020), 1322. DOI:
C. Rambour, N. Audebert, E. Koeniguer, B. Le Saux, M. Crucianu, and M. Datcu. 2020. Flood detection in time series of optical and SAR images. International Archives of Photogrammetry, Remote Sensing, & Spatial Information Sciences XLIII-B2-2, B2 (Aug.2020), 1343–1346. DOI:
G. Kamdem De Teyou, Y. Tarabalka, I. Manighetti, R. Almar, and S. Tripodi. 2020. Deep neural networks for automatic extraction of features in time series optical satellite images. International Archives of Photogrammetry, Remote Sensing, & Spatial Information Sciences 43 (2020), 1529–1535.
Bruno Menini Matosak, Leila Maria Garcia Fonseca, Evandro Carrijo Taquary, Raian Vargas Maretto, Hugo Do Nascimento Bendini, and Marcos Adami. 2022. Mapping deforestation in Cerrado based on hybrid deep learning architecture and medium spatial resolution satellite time series. Remote Sensing 14, 1 (2022), 1–22. DOI:
Pia Labenski, Michael Ewald, Sebastian Schmidtlein, and Fabian Ewald Fassnacht. 2022. Classifying surface fuel types based on forest stand photographs and satellite time series using deep learning. International Journal of Applied Earth Observation and Geoinformation 109 (May2022), 102799. DOI:
Krishna Rao, A. Park Williams, Jacqueline Fortin Flefil, and Alexandra G. Konings. 2020. SAR-enhanced mapping of live fuel moisture content. Remote Sensing of Environment 245 (2020), 111797. DOI:
Liujun Zhu, Geoffrey I. Webb, Marta Yebra, Gianluca Scortechini, Lynn Miller, and François Petitjean. 2021. Live fuel moisture content estimation from MODIS: A deep learning approach. ISPRS Journal of Photogrammetry and Remote Sensing 179 (Sept.2021), 81–91. DOI:
Lynn Miller, Liujun Zhu, Marta Yebra, Christoph Rüdiger, and Geoffrey I Webb. 2022. Multi-modal temporal CNNs for live fuel moisture content estimation. Environmental Modelling & Software 156 (Oct.2022), 105467. DOI:
Jiangjian Xie, Tao Qi, Wanjun Hu, Huaguo Huang, Beibei Chen, and Junguo Zhang. 2022. Retrieval of live fuel moisture content based on multi-source remote sensing data and ensemble deep learning model. Remote Sensing 14, 17 (Sept.2022), 4378. DOI:
Jie Sun, Zulong Lai, Liping Di, Ziheng Sun, Jianbin Tao, and Yonglin Shen. 2020. Multilevel deep learning network for county-level corn yield estimation in the U.S. Corn Belt. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 5048–5060. DOI:
Zhengtao Li, Gang Zhou, and Qiong Song. 2020. A temporal group attention approach for multitemporal multisensor crop classification. Infrared Physics and Technology 105 (2020), 103152. DOI:
Valentin Barriere and Martin Claverie. 2022. Multimodal crop type classification fusing multi-spectral satellite time series with farmers crop rotations and local crop distribution. arXiv preprint:2208.10838 (2022).
Vivien Sainte Fare Garnot and Loic Landrieu. 2020. Lightweight temporal self-attention for classifying satellite images time series. In Lecture Notes in Computer Science. Vol. 12588 LNAI. Springer International Publishing, 171–181. DOI:
Stella Ofori-Ampofo, Charlotte Pelletier, and Stefan Lang. 2021. Crop type mapping from optical and radar time series using attention-based deep learning. Remote Sensing 13, 22 (Nov.2021), 4668. DOI:
Yuan Yuan and Lei Lin. 2021. Self-Supervised pretraining of transformers for satellite image time series classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021), 474–487. DOI:
Nicola Di Mauro, Antonio Vergari, Teresa Maria Altomare Basile, Fabrizio G. Ventola, and Floriana Esposito. 2017. End-to-end learning of deep spatio-temporal representations for satellite image time series classification. In DC@PKDD/ECML.
Nataliia Kussul, Mykola Lavreniuk, Sergii Skakun, and Andrii Shelestov. 2017. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters 14, 5 (May2017), 778–782. DOI:
Charlotte Pelletier, Geoffrey Webb, and François Petitjean. 2019. Temporal convolutional neural network for the classification of satellite image time series. Remote Sensing 11, 5 (March2019), 523. DOI:
Peng Dou, Huanfeng Shen, Zhiwei Li, and Xiaobin Guan. 2021. Time series remote sensing image classification framework using combination of deep learning and multiple classifiers system. International Journal of Applied Earth Observation and Geoinformation 103 (2021), 102477. DOI:
Dino Ienco, Roberto Interdonato, Raffaele Gaetano, and Dinh Ho Tong Minh. 2019. Combining Sentinel-1 and Sentinel-2 satellite image time series for land cover mapping via a multi-source deep learning architecture. ISPRS Journal of Photogrammetry and Remote Sensing 158 (2019), 11–22. DOI:
Roberto Interdonato, Dino Ienco, Raffaele Gaetano, and Kenji Ose. 2019. DuPLO: A DUal view Point deep Learning architecture for time series classificatiOn. ISPRS Journal of Photogrammetry and Remote Sensing 149 (March2019), 91–104. DOI:
Marc Rußwurm and Marco Körner. 2018. Multi-temporal land cover classification with sequential recurrent encoders. ISPRS International Journal of Geo-Information 7, 4 (March2018), 129. DOI:
Andrei Stoian, Vincent Poulain, Jordi Inglada, Victor Poughon, and Dawa Derksen. 2019. Land cover maps production with high resolution satellite image time series and convolutional neural networks: Adaptations and limits for operational systems. Remote Sensing 11, 17 (2019), 1–26. DOI:
Dino Ienco, Raffaele Gaetano, Claire Dupaquier, and Pierre Maurel. 2017. Land cover classification via multitemporal spatial data by deep recurrent neural networks. IEEE Geoscience and Remote Sensing Letters 14, 10 (Oct.2017), 1685–1689. DOI:
Dino Ienco, Raffaele Gaetano, Roberto Interdonato, Kenji Ose, and DInh Ho Tong Minh. 2019. Combining Sentinel-1 and Sentinel-2 time series via RNN for object-based land cover classification. In 2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS’19). IEEE, 4881–4884. DOI:
Yuan Yuan, Lei Lin, Qingshan Liu, Renlong Hang, and Zeng-Guang Zhou. 2022. SITS-Former: A pre-trained spatio-spectral-temporal representation model for Sentinel-2 time series classification. International Journal of Applied Earth Observation and Geoinformation 106 (Feb.2022), 102651. DOI:
Marc Rußwurm and Marco Körner. 2020. Self-attention for raw optical satellite time series classification. ISPRS Journal of Photogrammetry and Remote Sensing 169 (2020), 421–435. DOI:
Devis Tuia, Claudio Persello, and Lorenzo Bruzzone. 2016. Domain adaptation for the classification of remote sensing data: An overview of recent advances. IEEE Geoscience and Remote Sensing Magazine 4, 2 (2016), 41–57. DOI:
V. Sainte Fare Garnot, Loic Landrieu, Sebastien Giordano, and Nesrine Chehata. 2019. Time-space tradeoff in deep learning models for crop classification on satellite multi-spectral image time series. In 2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS’19). IEEE, 6247–6250.
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2019. Deep neural network ensembles for time series classification. In 2019 International Joint Conference on Neural Networks (IJCNN’19), Vol. 2019. IEEE, 1–6. DOI:
Cristina Gómez, Joanne C. White, and Michael A. Wulder. 2016. Optical remotely sensed time series data for land cover classification: A review. ISPRS Journal of Photogrammetry and Remote Sensing 116 (2016), 55–72. DOI:
Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. 2017. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5, 4 (2017), 8–36. DOI:
Lei Ma, Yu Liu, Xueliang Zhang, Yuanxin Ye, Gaofei Yin, and Brian Alan Johnson. 2019. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS Journal of Photogrammetry and Remote Sensing 152 (June2019), 166–177. DOI:
Michel E. D. Chaves, Michelle C. A. Picoli, and Ieda D. Sanches. 2020. Recent applications of Landsat 8/OLI and Sentinel-2/MSI for land use and land cover mapping: A systematic review. Remote Sensing 12, 18 (sep2020), 3062. DOI:
Waytehad Rose Moskolaï, Wahabou Abdou, Albert Dipanda, and Kolyang. 2021. Application of deep learning architectures for satellite image time series prediction: A review. Remote Sensing 13, 23 (Nov.2021), 4822. DOI:
Jason Lines and Anthony Bagnall. 2015. Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery 29, 3 (2015), 565–592.
Chang Wei Tan, François Petitjean, and Geoffrey I. Webb. 2020. FastEE: Fast ensembles of elastic distances for time series classification. Data Mining and Knowledge Discovery 34, 1 (2020), 231–272.
Matthieu Herrmann and Geoffrey I. Webb. 2021. Amercing: An intuitive, elegant and effective constraint for dynamic time warping. arXiv preprint:2111.13314 (2021).
Anthony Bagnall, Michael Flynn, James Large, Jason Lines, and Matthew Middlehurst. 2020. On the usage and performance of the hierarchical vote collective of transformation-based ensembles version 1.0 (hive-cote v1. 0). In International Workshop on Advanced Analytics and Learning on Temporal Data. 3–18.
Matthew Middlehurst, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, and Anthony Bagnall. 2021. HIVE-COTE 2.0: A new meta ensemble for time series classification. Machine Learning 110, 11 (2021), 3211–3243.
Anthony Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. 2015. Time-series classification with COTE: The collective of transformation-based ensembles. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2522–2535.
Jason Lines, Sarah Taylor, and Anthony Bagnall. 2018. Time series classification with HIVE-COTE: The hierarchical vote collective of transformation-based ensembles. ACM Trans. Knowl. Discov. Data 12, 5, Article 52 (October 2018), 35 pages.
Jason Lines, Sarah Taylor, and Anthony Bagnall. 2016. Hive-Cote: The hierarchical vote collective of transformation-based ensembles for time series classification. In 2016 IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, 1041–1046.
Rohit J. Kate. 2016. Using dynamic time warping distances as features for improved time series classification. Data Mining and Knowledge Discovery 30, 2 (2016), 283–312.
Aaron Bostrom and Anthony Bagnall. 2015. Binary shapelet transform for multiclass time series classification. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 257–269.
Patrick Schäfer. 2015. The BOSS is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery 29, 6 (2015), 1505–1530.
Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. 2014. Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery 28, 4 (2014), 851–881.
Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. 2013. A time series forest for classification and feature extraction. Information Sciences 239 (2013), 142–153.
Mustafa Gokce Baydogan, George Runger, and Eugene Tuv. 2013. A bag-of-features framework to classify time series. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2796–2802.
Angus Dempster, Daniel F. Schmidt, and Geoffrey I. Webb. 2021. Minirocket: A very fast (almost) deterministic transform for time series classification. In 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 248–257.
Chang Wei Tan, Angus Dempster, Christoph Bergmeir, and Geoffrey I. Webb. 2022. MultiRocket: Multiple pooling operators and transformations for fast and effective time series classification. Data Mining and Knowledge Discovery 36 (June2022), 1623–1646. DOI:arxiv:2102.00457
Angus Dempster, Daniel F. Schmidt, and Geoffrey I. Webb. 2023. Hydra: Competing convolutional kernels for fast and accurate time series classification. Data Mining and Knowledge Discovery 37 (2023), 1–27.
Benjamin Lucas, Ahmed Shifaz, Charlotte Pelletier, Lachlan O’Neill, Nayyar Zaidi, Bart Goethals, François Petitjean, and Geoffrey I. Webb. 2019. Proximity forest: An effective and scalable distance-based classifier for time series. Data Mining and Knowledge Discovery 33, 3 (2019), 607–635.
Matthieu Herrmann, Chang Wei Tan, Mahsa Salehi, and Geoffrey I. Webb. 2023. Proximity forest 2.0: A new effective and scalable similarity-based classifier for time series. arXiv preprint arXiv:2304.05800 (2023).
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
Salah Hihi and Yoshua Bengio. 1995. Hierarchical recurrent neural networks for long-term dependencies. Advances in Neural Information Processing Systems 8 (1995), 493–499.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2013. How to construct deep recurrent neural networks. arXiv preprint:1312.6026 (2013).
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint:1508.04025 (2015).
Zuo RLi GCao RChoi BXu JBhowmick S(2024)DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time SeriesProceedings of the VLDB Endowment10.14778/3681954.368199617:11(3229-3242)Online publication date: 30-Aug-2024
Traini LDi Menna FCortellessa VFilkov VRay BZhou M(2024)AI-driven Java Performance Testing: Balancing Result Quality with Testing TimeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695017(443-454)Online publication date: 27-Oct-2024
Time Series Classification (TSC) is an important and challenging problem in data mining. With the increase of time series data availability, hundreds of TSC algorithms have been proposed. Among these methods, only a few have considered Deep Neural ...
In this study, for the first time in the literature, identification of different chaotic systems by classifying graphic images of their time series with deep learning methods is aimed. For this purpose, a data set is generated that consists of the ...
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Time series classification and regression are rapidly evolving fields that find areas of application in all domains of machine learning and data science. This hands on tutorial will provide an accessible overview of the recent research in these fields, ...
Zuo RLi GCao RChoi BXu JBhowmick S(2024)DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time SeriesProceedings of the VLDB Endowment10.14778/3681954.368199617:11(3229-3242)Online publication date: 30-Aug-2024
Traini LDi Menna FCortellessa VFilkov VRay BZhou M(2024)AI-driven Java Performance Testing: Balancing Result Quality with Testing TimeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695017(443-454)Online publication date: 27-Oct-2024
Zhang H(2024)Multivariate Time Series Classification with Graph Neural NetworksProceedings of the 2024 4th International Conference on Artificial Intelligence, Big Data and Algorithms10.1145/3690407.3690469(361-367)Online publication date: 21-Jun-2024
Dabbous ABerta RFresta MBallout HLazzaroni LBellotti F(2024)Bringing Intelligence to the Edge for Structural Health Monitoring: The Case Study of the Z24 BridgeIEEE Open Journal of the Industrial Electronics Society10.1109/OJIES.2024.34343415(781-794)Online publication date: 2024
Miller LPelletier CWebb G(2024)Deep Learning for Satellite Image Time-Series Analysis: A reviewIEEE Geoscience and Remote Sensing Magazine10.1109/MGRS.2024.339301012:3(81-124)Online publication date: Sep-2024
Huo THe YZhang LYang WTang JZhang QLu JZhang Y(2024)GLER-BiGRUnet: A Surface Deformation Prediction Model Fusing Multiscale Features of InSAR Deformation Information and Environmental FactorsIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing10.1109/JSTARS.2024.344383317(14848-14861)Online publication date: 2024