survey

Open access

Deep Learning for Time Series Classification and Extrinsic Regression: A Current Survey

Authors:

Navid Mohammadi Foumani,

Mahsa SalehiAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 9

Article No.: 217, Pages 1 - 45

https://doi.org/10.1145/3649448

Published: 25 April 2024 Publication History

PDF eReader

Abstract

Time Series Classification and Extrinsic Regression are important and challenging machine learning tasks. Deep learning has revolutionized natural language processing and computer vision and holds great promise in other fields such as time series analysis where the relevant features must often be abstracted from the raw data but are not known a priori. This article surveys the current state of the art in the fast-moving field of deep learning for time series classification and extrinsic regression. We review different network architectures and training methods used for these tasks and discuss the challenges and opportunities when applying deep learning to time series data. We also summarize two critical applications of time series classification and extrinsic regression, human activity recognition and satellite earth observation.

1 Introduction

Time series analysis has been identified as one of the 10 most challenging research issues in the field of data mining in the 21st century [1]. Time series classification (TSC) is a key time series analysis task [2]. TSC builds a machine learning model to predict categorical class labels for data consisting of ordered sets of real-valued attributes. The many applications of time series analysis include human activity recognition [3, 4, 5], diagnosis based on electronic health records [6, 7], and systems monitoring problems [8]. The wide variety of dataset types in the University of California, Riverside (UCR) [9] and University of East Anglia (UEA) [8] benchmark archive further illustrates the breadth of TSC applications. Time series extrinsic regression (TSER) [10] is the counterpart of TSC for which the output is numeric rather than categorical. It should be noted that the TSER is not a forecasting method but rather a method for understanding the relationship between the time series and the extrinsic variable. TSER is an emerging field with great potential to be used in a wide range of applications.

Deep learning has been very successful, especially in computer vision and natural language processing. Many modern applications integrate deep learning. Deep learning can autonomously learn informative features from raw data, eliminating the need for manual feature engineering. Consequently, there has been much interest in developing deep TSC and TSER due to their ability to learn relevant latent feature representations. It is worth noting that the majority of TSC and TSER research has focused on non-deep-learning approaches. A recent benchmark [11] shows that the deep learning method (InceptionTime [12]) is competitive but did not outperform the state of the art on benchmarking archives. One reason is that the popular UCR and UEA benchmarking archives were not designed for deep learning models. In particular, they are relatively small, while deep learning often excels when data quantities are large. Deep learning can also benefit from heightened compatibility with current hardware, particularly GPUs, leading to fast and efficient execution. Their exceptional scalability further allows seamless handling of growing data volumes and computational complexity, reinforcing their versatility in processing large datasets. Indeed, ConvTran [13], a recent deep architecture for TSC, outperforms one of the fastest conventional models, ROCKET [14], in terms of both speed and accuracy when there are more than 10k training samples.

A highly influential review paper on deep-learning-based TSC [15] was published in 2019. However, the field of research is very fast moving, and that prior survey does not cover the current state of the art. For example, it does not include InceptionTime [12], a system that consistently outperforms ResNet [16], the best performing system from the prior survey. Nor does it cover attention models, which have received huge interest in recent years and have shown excellent capacity to model long-range dependencies in sequential data and are well suited for time series modeling [17]. Many attention variants have been proposed to address particular challenges in time series modeling and have been successfully applied to TSC [13, 18, 19]. Moreover, the previous survey does not include self-supervised learning, which is emerging as a new paradigm [20]. Self-supervised learning induces supervision by designing pretext tasks instead of relying on predefined prior knowledge and has shown very promising results, especially in datasets with a low label regime [21, 22, 23, 24].

In light of the emergence of attention mechanisms, self-supervised learning, and various new network configurations for TSC, a systematic and comprehensive survey on deep learning in TSC would greatly benefit the time series community. This article aims to fill that gap by summarizing recent developments in deep-learning-based time series analytics, specifically TSC and TSER. Following definitions and a brief introduction to the time series classification and extrinsic regression tasks, we propose a new taxonomy based on various methodological perspectives. Diverse architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), Graph Neural Networks (GNNs), and attention-based models, are discussed, along with refinements made to improve performance. Additionally, various types of self-supervised learning pretexts, such as contrastive learning and self-prediction, are explored. We also conduct a review of useful data augmentation and transfer learning strategies for time series data. Furthermore, we provide a summary of two key applications of TSC and TSER, namely Human Activity Recognition and Earth Observation.

2 Background and Definitions

This section begins by providing the necessary definitions and background information to understand the topic of training deep neural networks (DNNs) for TSC and TSER tasks. We begin by defining key terms and concepts, such as time series data and time series supervised learning. Finally, we present our proposed taxonomy of the different deep learning methods that have been used for TSC and TSER tasks.

2.1 Time Series

Time series data are sequences of data points indexed by time.

Definition 2.1.

A time series \(X\) is an ordered collection of \(T\) pairs of measurements and timestamps,

\(X=\lbrace (x_1,t_1),(x_2,t_2), \ldots , (x_T,t_T)\rbrace\), where \(x_i\in \mathbb {R}^D\) and \(t_1\) to \(t_T\) are the timestamps for some measurements \(x_1\) to \(x_T\).

Each \(x_i\) is a \(D\)-dimensional vector of values, one for each feature captured in the series. When \(D=1,\) the series is called univariate. When \(D\gt 1,\) the series is called multivariate.

2.2 Time Series Supervised Learning Tasks

This article focuses on two time series learning tasks: time series extrinsic regression and time series classification. Classification and regression are both supervised learning tasks that learn the relationship between a target variable and a set of time series. We consider learning from a dataset \(D=\left\lbrace (X_1,Y_1), (X_2,Y_2), \ldots ,(X_N,Y_N)\right\rbrace\) of \(N\) time series where \(Y_i\) denotes the target variable for each \(X_i\). It is important to note that for ease of exposition, we assume in our discussion that the series are of the same length, but most methods extend trivially to the case of unequal-length series. The main difference between TSER and TSC is that TSC predicts a categorical value for a time series from a set of finite categories, while TSER predicts a continuous value for a variable external to the input time series. Typically \(Y_i\) is a one-hot encoded vector for TSC or a numeric value for TSER.

In the context of deep learning, a supervised learning model is a neural network that executes the following functions to map the input time series to a target variable:

\begin{equation} f_L(\theta _L, X) = f_{L-1}(\theta _{L-1}, f_{L-2}(\theta _{L-2}, \ldots , f_1(\theta _1, X))), \end{equation}

(1)

where \(f_i\) represents the non-linear function and \(\theta _i\) denotes the parameters at layer \(i\). For TSC the neural network model is trained to map a time series dataset \(D\) to a set of class labels \(Y\) with \(C\) class labels. After training, the neural network outputs a vector of \(C\) values that estimates the probability of a series \(X\) belonging to each class. This is typically achieved using the softmax activation function in the final layer of the neural network. The softmax function estimates probabilities for all of the dependent classes such that they always sum to 1 across all classes. The cross-entropy loss is commonly used for training neural networks with softmax outputs or classification-type neural networks.

On the other hand, TSER trains the neural network model to map a time series dataset \(D\) to a set of numeric values \(Y\). Instead of outputting probabilities, a regression neural network outputs a numerical value for the time series. It is typically used with a linear activation function in the final layer of the neural network. However, any non-linear functions with a single-value output such as sigmoid or ReLU can also be used. A regression neural network typically trains using the mean square error or mean absolute error loss function. However, depending on the distribution of the target variable and the choice of final activation functions, other loss functions can be used.

2.3 TSC and TSER

TSC is a fast-growing field, with hundreds of papers being published every year [8, 9, 15, 25, 26]. The majority of works in TSC are non–deep learning based. In this survey, we focus on deep learning approaches and refer interested readers to Appendix A and benchmark papers [11, 25, 26] for more details on non–deep learning approaches. Most deep learning approaches to TSC have real-valued outputs that are mapped to a class label. TSER [10, 27] is a less widely studied task in which the predicted values are numeric, rather than categorical. While the majority of the architectures covered in this survey were designed for TSC, it is important to note that it is trivial to adapt most of them for TSER.

Deep-learning-based TSC methods can be classified into two main types: generative and discriminative [28]. In the TSC community, generative methods are often considered model based [25], aiming to understand and model the joint probability distribution of input series \(X\) and output labels \(Y\), denoted as \(p(X, Y)\). On the other hand, discriminative models focus on modeling the conditional probability of output labels \(Y\) given input series \(X\), expressed as \(p(Y | X)\).

Generative models, such as the Stacked Denoising Auto-encoders (SDAEs), have been proposed by Bengio et al. [29] to identify the salient structure of input data distributions, and Hu et al. [30] used the same model for the pre-training phase before training a classifier for time series tasks. A universal neural network encoder has been developed to convert variable-length time series to a fixed-length representation [31]. Also, a Deep Belief Network (DBN) combined with a transfer learning method was used in an unsupervised manner to model the latent features of time series [32]. An Echo State Network (ESN) has been used to learn the appropriate time series representation by reconstructing the original raw time series prior to training the classifier [33]. Generative Adversarial Networks (GANs) are one of the popular generative models that generate new examples by learning to discriminate between real and synthetic examples. Various GANs have been developed for time series and have been reviewed in a recent survey [34]. Often, implementing generative methods is more complex due to an additional step of training. Furthermore, generative methods are typically less efficient than discriminative methods, which directly map raw time series to class probability distributions. Due to these barriers, researchers tend to focus on discriminative methods. Therefore, this survey mainly focuses on the end-to-end discriminative approaches.

2.4 Taxonomy of Deep Learning in TSC and TSER

To provide an organized summary of the existing deep learning models for TSC, we propose a taxonomy that categorizes these models based on deep learning methods and application domains. This taxonomy is illustrated in Figure 1. In Section 3, we review various network architectures used for TSC, including MLPs, CNNs, RNNs, GNNs, and attention-based models. We also discuss refinements made to these models to improve their performance on time series tasks. Additionally, various types of self-supervised learning pretexts, such as contrastive learning and self-prediction, are explored in Section 4. We also conduct a review of useful data augmentation and transfer learning strategies for time series data in Sections 5 and 6. In addition to methods, we summarize key applications of TSC and TSER in Section 7 of this article. These applications include human activity recognition and satellite earth observation, which are important and challenging tasks that can benefit from the use of deep learning models. Overall, our proposed taxonomy and the discussions in these sections provide a comprehensive overview of the current state of the art in deep learning for time series analysis and outline future research directions.

Fig. 1.

3 Supervised Models

This section reviews the deep-learning-based models for TSC and discusses their architectures by highlighting their strengths as well as limitations. More details on deep model architectures and their adaptations to time series data are available in Appendix B.

3.1 Multi-Layer Perceptron (MLP)

The most straightforward neural network architecture is a fully connected network, also called an MLP. The number of layers and neurons are defined as hyperparameters in MLP models. However, studies such as auto-adaptive MLP [35] have attempted to determine the number of neurons in the hidden layers automatically, based on the nature of the training time series data. This allows the network to adapt to the training data’s characteristics and optimize its performance on the task at hand.

One of the main limitations of using MLPs for time series data is that they are not well suited to capturing the temporal dependencies in this type of data. MLPs are feedforward networks that process input data in a fixed and predetermined order without considering the temporal relationships between the input values. Various studies used MLPs alongside other feature extractors like Dynamic Time Warping (DTW) to address this problem [36, 37]. DTW-NN is a feedforward neural network that exploits DTW’s elastic matching ability to dynamically align a layer’s inputs to the weights instead of using a fixed and predetermined input-to-weight mapping. This weight alignment replaces the standard dot product within a neuron with DTW. In this way, the DTW-NN is able to tackle difficulties with time series recognition, such as temporal distortions and variable pattern length within a feedforward architecture [37]. Similarly, Symbolic Aggregate Approximation (SAX) is used to transform time series into a symbolic representation and produce sequences of words based on the symbolic representation [38]. The symbolic time-series-based words are later used as input for training a two-layer MLP for classification.

Although the models mentioned above attempt to resolve the shortage of capturing temporal dependencies in MLP models, they still have other limitations on capturing time-invariant features [16]. Additionally, MLP models do not have the ability to process input data in a hierarchical or multi-scale manner. Time series data often exhibits patterns and structures at different scales, such as long-term trends and short-term fluctuations. MLP models fail to capture these patterns, as they are only able to process input data in a single, fixed-length representation. In addition, MLPs may encounter difficulties when confronted with irregularly sampled time series data, where observations are not uniformly recorded in time. Many other deep learning models are better suited to handle time series data, such as RNNs, CNNs, and transformers, specifically designed to capture the temporal dependencies and patterns in time series data.

3.2 CNN-based Models

Several improvements have been made to CNN since the success of AlexNet in 2012 [39], such as using deeper networks, applying smaller and more efficient convolutional filters, adding pooling layers to reduce the dimensionality of the feature maps, and utilizing batch normalization to improve the stability of training [40]. They have been demonstrated to be very successful in many domains, such as computer vision, speech recognition, and natural language processing problems [40, 41, 42]. As a result of the success of CNN architectures in these various domains, researchers have also started adopting them for TSC. See Table 1 for a list of reviewed CNN models in this article.

Table 1.

Model	Year	Baseline Architecture	Other Features
Adapted
MC-DCNN [43]	2014	2-Stage Conv	Independent convolutions per channel
MC-CNN [44]	2015	3-Stage Conv	1D convolutions on all channels
Zhao et al. [45]	2015	2-Stage Conv	Similar architecture to MC-CNN
FCN [16]	2017	FCN	Using GAP instead of FC layer
ResNet [16]	2017	ResNet 9	Using 3-residual block
Res-CNN [49]	2019	RezNet+FCN	Using 1-residual block + FCN
DCNNs [51]	2019	4-Stage Conv	Using dilated convolutions
Disjoint-CNN [16]	2021	4-Stage Conv	Disjoint temporal and spatial convolution
Series To Image
Wang and Oates [54]	2015	Tiled CNN	GAF, MT
Hatami et al. [55]	2017	2-Stage Conv	Recurrence plots
Karimi-Bidhendi et al. [56]	2018	Inception V3	GADF
Zhao and Cai [57]	2019	ResNet18, ShuffleNet V2	Bilinear interpolation
RPMCNN [61]	2019	VGGNet, 2-Stage Conv	Relative position matrix
Yang et al. [58]	2019	VGGNet	GASF, GADF, MTF
Multi-Scale Operation
MCNN [62]	2016	2-Stage Conv	Identity mapping, smoothing, down-sampling
t-LeNet [63]	2016	2-Stage Conv	Squeeze and dilation
MVCNN [64]	2019	4-stage Conv	Inception V1 based
Brunel et al. [65]	2019	Inception V1
InceptionTime [12]	2019	Inception V4	Ensemble
EEG-inception [66]	2021	InceptionTime
Inception-FCN [67]	2021	InceptionTime + FCN
KDCTime [68]	2022	InceptionTime	Knowledge distillation, label smoothing
LITE [69]	2023	InceptionTime	Multiplexing, dilated, and custom filters

Table 1. Summary of CNN Models for Time Series Classification and Extrinsic Regression

3.2.1 Adapted CNNs for TSC and TSER.

This section presents the first category, which we refer to as Adapted CNNs for TSC and TSER. The papers discussed here are mostly adaptations without any particular preprocessing or mathematical characteristics, such as transforming the series to an image or using multi-scale convolution, and therefore do not fit into one of the other categories.

The first CNN for TSC was the Multi-Channel Deep Convolutional Neural Network (MC-DCNN) [43]. It handles multivariate data by independently applying convolutions to each input channel. Each input dimension undergoes two convolutional stages with ReLU activation, followed by max pooling. The output from each dimension is concatenated and passed to a fully connected layer, which is then fed to a final softmax classifier for classification. Similar to MC-DCNN, a three-layer convolution neural network was proposed for human activity recognition (MC-CNN) [44]. Unlike the MC-DCNN, this model applies 1D convolutions to all input channels simultaneously to capture the temporal and spatial relationships in the early stages. The two-stage version of the MC-CNN architecture was used by Zhao et al. [45] on the earliest version of the UCR Time Series Data Mining Archive. The authors also conducted an ablation study to evaluate the performance of the CNN models with differing numbers of convolution filters and pooling types.

Fully Convolutional Networks (FCN) [46] and Residual Network (ResNet) [47] are two deep neural networks that are commonly used for image and video recognition tasks and have been adapted for end-to-end TSC [16]. FCNs are a variant of CNNs designed to operate on inputs of arbitrary size rather than being constrained to fixed-size inputs like traditional CNNs. This is achieved by replacing the fully connected layers in a traditional CNN with a Global Average Pooling (GAP) [46]. FCN was adapted for univariate TSC [16], and similar to the original model, it contains three convolution blocks where each block contains a convolution layer followed by batch normalization and ReLU activation. Each block uses 128, 256, and 128 filters with 8, 5, and 3 filter lengths, respectively. The output from the last convolution block is averaged with a GAP layer and passed to a final softmax classifier. The GAP layer has the property of reducing the spatial dimensions of the input while retaining the channel-wise information, which allows it to be used in conjunction with a class activation map (CAM) [48] to highlight the regions in the input that are most important for the predicted class. This can provide useful insights into how the network is making its predictions and help identify potential improvement areas. Similar to FCN, the ResNet was also proposed in [16] for univariate TSC. ResNet is a deep architecture containing three residual blocks followed by a GAP layer and a softmax classifier. It uses residual connections between blocks to reduce the vanishing gradient effect that affects deep learning models. The structure of each residual block is similar to the FCN architecture, containing three convolution layers followed by batch normalization and ReLU activation. Each convolution layer uses 64 filters with 8, 5, and 3 filter lengths, respectively. ResNet was found to be one of the most accurate deep learning TSC architectures on 85 univariate TSC datasets [15, 25]. Additionally, integration of ResNet and FCN has been proposed to combine the strength of both networks [49].

In addition to adapting the network architecture, some research has focused on modifying the convolution kernel to suit TSC tasks better. Dilated convolution neural networks (DCNNs) [50] are a type of CNN that uses dilated convolutions to increase the receptive field of the network without increasing the number of parameters. Dilated convolutions create gaps between elements of the kernel and perform convolution, thereby covering a larger area of the input. This allows the network to capture long-range dependencies in the data, making it well suited to TSC tasks [51]. Recently, Disjoint-CNN [52] showed that factorization of 1D convolution kernels into disjoint temporal and spatial components yields accuracy improvements with almost no additional computational cost. Applying disjoint temporal convolution and then spatial convolution behaves similarly to the Inverted Bottleneck [53]. Like the Inverted Bottleneck, the temporal convolutions expand the number of input channels, and spatial convolutions later project the expanded hidden state back to the original size to capture the temporal and spatial interaction.

3.2.2 Imaging Time Series.

In TSC, a common approach is to convert the time series data into a fixed-length representation, such as a vector or matrix, which can then be input to a deep learning model. However, this can be challenging for time series data that vary in length or have complex temporal dependencies. One solution to this problem is to represent the time series data in an image-like format, where each time step is treated as a separate channel in the image. This allows the model to learn from the spatial relationships within the data rather than just the temporal relationships. In this context, the term spatial refers to the relationships between different variables or features within a single time step of the time series.

As an alternative to using raw time series data as input, Wang and Oates encoded univariate time series data into different types of images that were then processed by a regular CNN [54]. This image-based framework initiated a new branch of deep learning approaches for time series, which consider image transformation as one of the feature engineering techniques. Wang and Oates presented two approaches for transforming a time series into an image. The first generates a Gramian Angular Field (GAF), while the second generates a Markov Transition Field (MTF). GAF represents time series data in a polar coordinate and uses various operations to convert these angles into a symmetry matrix, and MTF encodes the matrix entries using the transition probability of a data point from one time step to another time step [54]. In both cases, the image generation increases the time series size, making the images potentially prohibitively large. Therefore, they propose strategies to reduce their size without losing too much information. Afterward, the two types of images are combined in a two-channel image that is then used to produce better results than those achieved when using each image separately. Finally, a Tiled CNN model is applied to classify the time-series images. In other studies, a variety of transformation methods, including Recurrence Plots (RPs) [55], Gramian Angular Difference Field (GADF) [56], bilinear interpolation [57], and Gramian Angular Summation Field (GASF) [58] have been proposed to transfer time series to input images, expecting that the 2D images could reveal features and patterns not found in the 1D sequence of the original time series.

Hatami et al. [55] propose a representation method based on RP [59] to convert the time series to 2D images with a CNN model for TSC. In their study, time series are regarded as distinct recurrent behaviors such as periodicities and irregular cyclicities, which are the typical phenomena of dynamic systems. The main idea of using the RP method is to reveal at which points some trajectories return to a previous state. Finally, two-stage convolution and two fully connected layers are applied to classify the images generated by RP. Subsequently, pre-trained Inception v3 [60] was used to map the GADF images into a 2,048-dimensional vector space. The final stage used an MLP with three hidden layers, followed by a softmax activation function [56]. Following the same framework, Chen and Shi [61] adopted the Relative Position Matrix (RPMCNN) and VGGNet to classify time series data using transform 2D images. Their results showed promising performances by converting univariate time series data to 2D images using relative positions between two timestamps. Following the convention, three image encoding methods, GASF, GADF, and MTF, were used to encode MTS data into 2D images [58]. They showed that the simple structure of ConvNet is sufficient for classification as it performed equally well with the complex structure of VGGNet.

Overall, representing time series data as 2D images can be difficult because preserving the temporal relationships and patterns in the data can be challenging. This transformation can also result in a loss of information, making it difficult for the model to classify the data accurately. Chen and Shi [61] have also shown that the specific transformation methods like GASF, GADF, and MTF used in this process do not significantly improve the prediction outcome.

3.2.3 Multi-scale Operation.

The papers discussed here apply a multi-scale convolutional kernel to the input series or apply regular convolutions on the input series at different scales. Multi-scale CNNs (MCNN) [62] and Time LeNet (t-LeNet) [63] were considered the first models that preprocess the input series to apply convolution on multi-scale series rather than raw series. The designs of both MCNNs and t-LeNet were inspired by computer vision models, which means that they were adapted from models originally developed for image recognition tasks. These models may not be well suited to TSC tasks and may not perform as well as models specifically designed for this purpose. One potential reason for this is the use of progressive pooling layers in these models, commonly used in computer vision models, to reduce the input data size and make it easier to process. However, these pooling layers may not be as effective when applied to time series data and may limit the performance of the model.

MCNN has a simple architecture and comprises two convolutions and a pooling layer, followed by a fully connected and softmax layer. However, this approach involves heavy data preprocessing. Specifically, before any training, they use a sliding window to extract a time series subsequence, and later, the subsequence will undergo three transformations: (1) identity mapping, (2) down-sampling, and (3) smoothing, which results in the transformation of a univariate input time series into a multivariate one. Finally, the transformed output is fed to the CNN model to train a classifier [62]. t-LeNet uses two data augmentation techniques: window slicing (WS) and window warping (WW), to prevent overfitting [63]. The WS method is identical to MCNN’s data augmentation. The second data augmentation technique, WW, employs a warping technique that squeezes or dilates the time series. WS is also adopted to ensure that subsequences of the same length are extracted for training the network to deal with multi-length time series. Therefore, a given input time series of length \(L\) is first dilated \((\times 2)\) and then squeezed \((\times 1/2)\) using WW, resulting in three time series of length \(L,2L,1/2L\) that are fed to WS to extract equal-length subsequences for training. Finally, as both MCNN and t-LeNet predict a class for each extracted subsequence, majority voting is applied to obtain the class prediction for the full time series.

Inception was first proposed by Szegedy et al. [70] for end-to-end image classification. Now the network has evolved to become Inception-v4, where Inception was coupled with residual connections to improve further the performance [71]. Inspired by inception architecture, a multivariate convolutional neural network (MVCNN) is designed using multi-scale convolution kernels to find the optimal local construction [64]. MVCNN uses three scales of filters, \(2\times 2\), \(3 \times 3\), and \(5 \times 5\), to extract features of the interaction between sensors. A 1D Inception model was used for Supernovae classification using the light flux of a region in space as an input MTS for the network [65]. However, the authors limited the conception of their Inception architecture to the first version of this model [70]. The Inception-ResNet [72] architecture includes convolutional layers, followed by Inception modules and residual blocks. The Inception modules are used to learn multiple scales and aspects of the data, allowing the network to capture more complex patterns. The residual blocks are then used to learn the residuals, or differences, between the input and output of the network, improving its performance.

InceptionTime [12] explores much larger filters than any previously proposed network for TSC to reach state-of-the-art performance on the UCR benchmark. InceptionTime is an ensemble of five randomly initialized inception network models, each of which consists of two blocks of inception modules. Each inception module first reduces the dimensionality of a multivariate time series using a bottleneck layer with a length and stride of 1 while maintaining the same length. Then, 1D convolutions of different lengths are applied to the output of the bottleneck layer to extract patterns at different sizes. In parallel, a max pooling layer followed by a bottleneck layer are also applied to the original time series to increase the robustness of the model to small perturbations. The outputs from the convolution and max pooling layers are stacked to form a new multivariate time series, which is then passed to the next layer. Residual connections are used between each inception block to reduce the vanishing gradient effect. The output of the second inception block is passed to a GAP layer before feeding into a softmax classifier.

The strong performance of InceptionTime has inspired a number of extensions. Like InceptionTime, EEG-inception [66] uses several inception layers and residual connections as its backbone. Additionally, noise-addition-based data augmentation of electroencephalogram (EEG) signals is proposed, which increases the average accuracy. InceptionFCN [67] focuses on combining two well-known deep learning techniques, namely the Inception module and the Fully Convolutional Network [67]. In KDCTime [68], label smoothing (LSTime) and knowledge distillation (KDTime) were introduced for InceptionTime, automatically generated while compressing the inference model. Additionally, knowledge distillation with calibration (KDC) in KDCTime offers two calibrating strategies: KDC by translating (KDCT) and KDC by reordering (KDCR). LITE [69] addresses InceptionTime’s complexity while preserving its TSC performance. Utilizing DepthWise Separable Convolutions, LITE incorporates multiplexing, dilated convolution, and custom filters [73] to enhance efficiency.

3.3 Recurrent Neural Network

RNNs are types of neural networks built with internal memory to work with time series and sequential data. Conceptually similar to feed-forward neural networks (FFNs), RNNs differ in their ability to handle variable-length inputs and produce variable-length outputs.

3.3.1 Vanilla Recurrent Neural Networks (Vanilla RNNs).

RNNs for TSC have been proposed in [74]. Using RNNs, the input series have been classified based on their dynamic behavior. They used sequence-to-sequence architecture in which each sub-series of input series is classified in the first step. Then the argmax function is applied to the entire output, and finally, the neuron with the highest rate specifies the classification result. In order to improve the model parallelization and capacity, [75] proposed a two-layer RNN. In the first layer, the input sequence is split into several independent RNNs to improve parallelization, followed by a second layer that utilizes the first layer’s output to capture long-term dependencies [75]. Further, RNNs have been used in some hierarchical architectures [76, 77]. Hermans and Schrauwen showed that a deeper version of RNNs could perform hierarchical processing on complex temporal tasks and capture the time series structure more naturally than a shallow version [77]. RNNs are usually trained iteratively using a procedure known as backpropagation through time (BPTT). When unfolded in time, RNNs look like very deep networks with shared parameters. With deeper neural layers in RNN and sharing weights across different RNN cells, the gradients are summed up at each time step to train the model. Thus, gradients undergo continuous matrix multiplication due to the chain rule and either shrink exponentially and have small values called vanishing gradients or blow up to a very large value, referred to as exploding gradients [78]. These problems motivated the development of second-order methods for deep architectures named long short-term memory (LSTM) [79] and Gated Recurrent Unit (GRU) [80].

3.3.2 Long Short-Term Memory (LSTM).

LSTM addresses the common vanishing/exploding gradient issue in vanilla RNNs by integrating memory cells with gate control into their state dynamics [79]. Due to its design nature, LSTM is suited to problems involving sequence data, such as language translation [81], video representation learning [82], and image caption generation [83]. The TSC problem is not an exception and mainly adopts a similar model to the language translation [81]. Sequence-to-Sequence with Attention (S2SwA) [84] incorporates two LSTMs, one encoder and one decoder, in a sequence-to-sequence fashion for TSC. In this model, the encoder LSTM accepts input time series of arbitrary lengths and extracts information from the raw data based on which the decoder LSTM constructs fixed-length sequences that can be regarded as automatically extracted features for classification.

3.3.3 Gated Recurrent Unit (GRU).

GRU, another widely used variant of RNNs, shares similarities with LSTM in its ability to control information flow and memorize context across multiple time steps [80]. Similar to S2SwA [84], a sequence auto-encoder (SAE) based on GRU has been defined to deal with TSC problem [85]. A fixed-size output is produced by processing the various input lengths using GRU as the encoder and decoder. The model’s accuracy was also improved by pre-training the parameters on massive unlabeled data.

3.3.4 Hybrid Models.

CNNs and RNNs are often combined for TSC because they have complementary strengths. As mentioned previously, CNNs are well suited for learning from spatial relationships in data, such as the patterns and correlations between the channels of different time steps in a time series. This allows them to learn useful features from the time series data that can help improve the classification performance. RNNs, on the other hand, are well suited for learning from temporal dependencies in data, such as the past values of a time series that can help predict its future values. This allows them to capture the dynamic nature of time series data and make more accurate predictions. Combining the strengths of CNNs and RNNs makes it possible to learn spatial and temporal features from the time series data, improving the model’s performance for TSC. Additionally, the two models can be trained together, allowing them to learn from each other and improve the model’s overall performance.

Various extensions like MLSTM-FCN [86], TapNet [87], and SMATE [88] were proposed later to deal with time series data. MLSTM-FCN extends the univariate LSTM-FCN model [89] to the multivariate case. Like the LSTM-FCN, the multivariate version comprises LSTM blocks and fully convolutional blocks for extracting features from input series. A squeeze and excite block is also added to the FCN block and can execute a form of self-attention on the output feature maps of previous layers [86]. Two further proposals for multivariate TSC are the Time series attentional prototype Network (TapNet) and Semi-Supervised Spatio-Temporal (SMATE) [87, 88]. These methods combine and seek to leverage the relative strengths of both traditional distance-based and deep learning approaches.

MLSTM-FCN, TapNet, and SMATE were designed in dual-network architectures. The input is separately fed into the CNN and RNN models, and their output is concentrated before the fully connected layer for the final task. However, one branch cannot fully use the hidden states of the other during feature extraction since the final classification results are generated by concatenating the outputs of the two branches. That motivates different types of architecture that try layer-wise integration of CNN and RNN models. This motivates different architectures, such as GCRNN [90] and CNN-LSTM [91], which aim to integrate CNNs and RNNs in a layer-wise fashion.

While RNNs are commonly used for time series forecasting, only a few studies have applied them to TSC, mainly due to four reasons: (1) RNNs typically struggle with the gradient vanishing and exploding problem due to training on long time series [92]; (2) RNNs are considered difficult to train and parallelize, so researchers are less likely to use them as they are computationally expensive [78]; (3) recurrent architectures are designed mainly to learn from the previous data to make predictions about the future [28]; and (4) RNN models can fail to effectively capture and utilize long-range dependencies in long sequences [84].

3.4 Attention-based Model

Despite the excellent performance of CNN models for capturing local temporal/spatial correlations, these models cannot effectively capture and utilize long-range dependencies. Additionally, they only consider the local order of data points rather than the overall order of all data points. Therefore, many recent studies have embedded RNNs such as LSTMs alongside the CNNs to capture this information [86, 87, 89]. The disadvantage of RNN-based models is that they are computationally expensive, and their capability to capture long-range dependencies is limited [18, 93]. On the other hand, attention models can capture long-range dependencies, and their broader receptive fields provide more contextual information, which can improve the models’ learning capacity. The attention mechanism aims to enhance a network’s representation ability by focusing on essential features and suppressing unnecessary ones. Not surprisingly, with the success of attention models in natural language processing [93, 94], many previous studies have attempted to bring the power of attention models into various domains such as computer vision [95] and time series analysis [18, 19, 96, 97, 98]. Table 2 presents a list of the attention-based models reviewed in this article.

Table 2.

Model	Year	Embedding	Attention
MuVAN [99]	2018	Bi-GRU	Self-attention
ChannelAtt [102]	2018	RNN	Self-attention
GeoMAN [103]	2018	LSTM	Self-attention
Multi-Stage-Att [104]	2020	LSTM	Self-attention
CT_CAM [105]	2020	FCN + Bi-GRU	Self-attention
CA-SFCN [18]	2020	FCN	Self-attention
RTFN [106]	2021	CNN + LSTM	Self-attention
LAXCAT [100]	2021	CNN	Self-attention
MACNN [101]	2021	Multi-scale CNN	Squeeze-and-excitation
WHEN [107]	2023	CNN + BiLSTM	Self-attention
SAnD [112]	2018	Linear Embedding	Multi-head
T2 [115]	2021	Gaussian Process Regression + 1D Conv	Multi-head
GTN [116]	2021	Linear Embedding	Multi-head
TRANS_tf [113]	2021	Time-Frequency Features	Multi-head
FMLA [117]	2022	Deformable CNN	Multi-head
AutoTransformer [118]	2022	Multi-scale CNN + NAS	Multi-head
ConvTran [13]	2023	Disjoint-CNN	Multi-head

Table 2. Summary of Attention-based Models for Time Series Classification and Extrinsic Regression

3.4.1 Self-attention.

Self-attention has been demonstrated to be effective in various natural language processing tasks due to its ability to capture long-term dependencies in text [93]. Recently, it has also been shown to be effective for TSC tasks [18, 99, 100, 101]. As we mentioned, the self-attention module is embedded in the encoder-decoder models to improve the model performance. However, only the encoder and the self-attention module have been used for TSC. Early models of TSC follow the same backbone of natural language processing models and use the recurrent-based models such as RNN [102], GRU [99] and LSTM [103, 104] for encoding the input series. For example, the Multi-View Attention Network (MuVAN) applies bidirectional GRUs independently to each input dimension as the encoder and then feeds all the representations into a self-attention block [99].

As a result of the excellent performance of the CNN models, many studies have attempted to encode the time series using CNNs before applying attention [18, 100, 105, 106]. Cross-Attention Stabilized Fully Convolutional Neural Network (CA-SFCN) [18] and Locality-Aware eXplainable Convolutional ATtention network (LAXCAT) [100] applied the self-attention mechanism to leverage the long-term dependencies for the MTSC task. CA-SFCN combines FCN and two types of self-attention, temporal attention (TA) and variable attention (VA), which interact to capture the long-range dependencies and variables’ interactions. LAXCAT also used temporal and variable attention to identify informative variables and the time intervals where they have informative patterns for classification. WaveletDTW Hybrid attEntion Networks (WHENs) [107] integrate two attention mechanisms, namely wavelet attention and DTW attention, into the BiLSTM to enhance model performance. In wavelet attention, they leverage wavelets to compute attention scores, specifically targeting the analysis of dynamic frequency components in nonstationary time series. Simultaneously, DTW attention employs the DTW distance to calculate attention scores, addressing the challenge of time distortion in multiple time series.

Several self-attention models have been developed to improve network performance [108, 109], including Squeeze-and-Excitation (SE) [110], which focuses on channel attention and is often used to classify time series data [86, 101, 111]. The SE block allows the whole network to use global information to selectively focus on the informative feature maps and suppress less important ones [110]. More importantly, the SE block can increase the quality of the shared lower-level representations in the early layers and becomes increasingly specialized when responding to different inputs in later layers. The weight of each feature map is automatically learned at each layer of the network, and the SE block can boost feature discrimination throughout the whole network. Multi-scale Attention Convolutional Neural Network (MACNN) [101] applies the different kernel size convolutions to capture different scales of information along the time axis by generating feature maps at differing scales. Then an SE block is used to enhance useful feature maps and suppress less useful ones by automatically learning each feature map’s importance.

3.4.2 Transformers.

The impressive performance of multi-headed attention has led to numerous attempts to adapt multi-headed attention to the TSC domain. Transformers for classification usually employ a simple encoder structure consisting of attention and feed-forward layers. Simply Attend and Diagnose (SAnD) [112] architecture adopted a multi-head attention mechanism similar to a vanilla transformer [93] to classify clinical time series for the first time. The model uses both positional encoding and a dense interpolation embedding technique to incorporate temporal order into representation learning. In another study that classified vibration signals [113], time-frequency features such as Frequency Coefficients and Short Time Fourier Transformation (STFT) spectrums are used as input embeddings to the transformers. A multi-head attention-based model was applied to raw optical satellite TSC using Gaussian Process Interpolation [114] embedding and outperformed CNNs and RNNs [115].

Gated Transformer Networks (GTNs) [116] use two-tower multi-headed attention to capture the discriminative information from the input series. Also, they merged the output of two towers using a learnable matrix named gating. To enhance locality awareness of transformers for TSC, flexible multi-head linear attention (FMLA) [117] integrates deformable convolutional blocks and online knowledge distillation, as well as a random mask to reduce noise. For each TSC dataset, AutoTransformer searches for the suitable network architecture using the neural architecture search (NAS) algorithm before feeding the output to the multi-headed attention blocks. ConvTran [13] currently stands as the state of the art in multivariate TSC. They conducted a review of existing absolute and relative position encoding methods in TSC. Based on the limitations of the current position encodings for time series, they introduced two novel ones named tAPE and eRPE for absolute and relative positions, respectively. Integrating these proposed position encodings into a transformer block and combining them with a convolution layer, they presented a novel deep learning framework for multivariate time series classification—ConvTran.

3.5 Graph Neural Networks

While both CNNs and RNNs perform well on Euclidean data, many time series problems have data that are more naturally represented as graphs [119]. For example, in a network of sensors, the sensors may be irregularly spaced, instead of the sensors forming a regular grid. A graph representation of data collected by this network can model this irregular layout more accurately than can be done using a Euclidean space. However, using standard deep learning algorithms to learn from graph structures is challenging [120]. For example, nodes may have a varying number of neighboring nodes, making it difficult to apply a convolution operation.

GNNs [121] are methods that adapt deep learning techniques to the graph domain. Much of the early research using GNNs for time series analysis concentrated on forecasting tasks [119]. However, recent works consider GNNs for TSC [122, 123] and TSER [124] tasks. A list of the GNN models reviewed in this article is provided in Table 3. Time2Graph+ [125] transforms each time series into a shapelet graph. Shapelets are extracted from the time series and form the graph nodes. The graph edges are weighted based on transition probabilities between the two shapelets. Once the input graphs have been constructed, a graph attention network is used to create a representation of the time series that is fed into a classifier. SimTSC [137] constructs a pairwise similarity graph where each time series forms a node and edge weights are computed based on the DTW distance measure. Node attributes are generated using a feature vector encoder. GNN operations are used to enhance the node features based on similarities between adjacent time series. These representations are then used for the final classification step, which produces a classification for each node. LB-SimTSC [122] replaces the expensive DTW computation with the LB-Keogh lower-bounding method [141].

Table 3.

Model	Year	GNN Type	Other Components
TGCN [126]	2019	Graph convolutional network	1D-CNN
DGCNN [127]	2020	Graph convolutional network	1x1 CNN
GraphSleepNet [128]	2020	Graph convolutional network	Temporal attention
T-GCN [129]	2021	Graph convolutional network	GRU
MRF-GCN [130]	2021	Graph convolutional network	Fast Fourier transforms (FFT)
Nhu et al. [131]	2021	Graph convolutional network	1D-CNN
DCRNN [132]	2021	Graph convolutional network	GRU
Time2Graph+ [125]	2021	Graph attention	Shapelet transform
RAINDROP [133]	2021	Graph guided network	Temporal attention
STEGON [134]	2021	Graph attention	1D-CNN
Azevedo et al. [135]	2022	Graph network block with pooling	1D-CNN
MTPool [136]	2022	Variational graph pooling	1D-CNN
SimTSC [137]	2022	Graph convolutional network	DTW, ResNet
Tulczyjew et al. [138]	2022	Graph convolutional network	Adaptive pooling
C-DGAM [139]	2023	Graph attention	1D-CNN with attention
Dufourg et al. [140]	2023	Spatio-temporal graph	Simple linear iterative clustering
TISER-GCN [124]	2023	Graph convolutional network	1D-CNN
TodyNet [123]	2023	Dynamic graph neural network	1D-CNN
LB-SimTSC [122]	2023	Graph convolutional network	Lower-bound DTW, ResNet

Table 3. Summary of Graph Neural Network Models for Time Series Classification and Extrinsic Regression

Spatiotemporal GNNs model both spatial (or inter-variable) and temporal dependencies using two modules that work in tandem. The spatial module models the dependencies between the time series by applying graph convolutions over a GNN (GCNs [142]). The temporal module models the dependencies within the time series using an RNN [129, 132], 1D-CNN [134, 135], Attention [133, 139], or a combination of these [119]. The features extracted from the graph layers are then fed into the classification or regression layers to make either a single prediction [132, 133, 135, 139] or a prediction for each node [129, 134]. Spatiotemporal GCNs are often used to analyze sensor arrays, where the graph structure models the physical layout of the sensors. A common example is EEG data, where the location of EEG electrodes is represented as a graph that is used to analyze the EEG signal. Some of these applications are epilepsy detection [131], seizure detection [126, 132], emotion recognition [127], and sleep classification [128]. Besides EEG, GCNs have also been applied to engineering applications such as machine fault diagnosis [130], slope deformation prediction [129], and seismic activity prediction [124]. MTPool [136] uses a spatiotemporal GCN for multivariate time series classification. In this study, each channel in the time series is represented by a node in the graph, and the graph edges model the correlations between the channels. The GCN is combined with temporal convolutions and a hierarchical graph pooling technique. Spatiotemporal GNNs have also been used for object-based image analysis [134] and semantic segmentation [138] of image time series. However, these assume the labels and spatial relationships are static over time. In many cases these may both change. Spatiotemporal graphs(STGs), which include temporal edges as well as spatial edges, can model these dynamic relationships [140]. In STGs, each node represents an object at one timestamp. Spatial edges connect the object to adjacent objects, and temporal edges connect two objects in consecutive images if they have common pixels.

4 Self-supervised Models

Obtaining labeled data for large time series datasets poses significant costs and challenges. Machine learning models trained on large labeled time series datasets often exhibit superior performance compared to models trained on sparsely labeled datasets, small datasets with limited labels, or those without supervision, leading to suboptimal performance across various time series machine learning tasks [23, 143]. As a result, rather than depending on high-quality annotations for large datasets, researchers and practitioners are increasingly shifting their focus toward self-supervised representation learning for time series.

Self-supervised representation learning, a subfield of machine learning, focuses on learning representations from data without explicit supervision [24]. In contrast to supervised learning, which relies on labeled data, self-supervised learning methods utilize the inherent structure of the data to learn valuable representations in an unsupervised manner. The learned representations can then be used for a variety of downstream tasks including classification, anomaly detection, and forecasting. This survey specifically emphasizes classification as a downstream task. We categorized self-supervised learning approaches for TSC into three groups based on the pretext. Table 4 shows a list of the self-supervised models reviewed in this article.

Table 4.

Model	Year	Encoder Backbones
Contrastive Learning			Other Features
TCL [144]	2016	MLP	Sequence-based contrast
T-Loss/SRL [145]	2019	Causal CNN	Sequence-based contrast
TNC [146]	2021	Bidirectional RNN	Sequence-based contrast
TS-TCC [21]	2021	CNN + Transformers	Instance/sequence-based contrast
MCL [147]	2021	FCN	Instance-based contrast
TimeCLR [148]	2021	InceptionTime	Instance-based contrast
TS2Vec [23]	2021	Dilated CNN	Sequence-based contrast
BTSF [143]	2022	Causal CNN	Instance-based contrast
TF-C [149]	2022	ResNets	Instance-based contrast
MHCCL [150]	2023	ResNet	Instance-based contrast
Self-Prediction
BENDR [98]	2021	CNN + Transformers	Sequence masking
Voice2Series [22]	2021	CNN+Transformers	Binary masking
TST [19]	2021	Transformers	Binary masking
TARNet [151]	2022	Transformers	Binary masking
TimeMAE [152]	2023	CNN + Transformers	Sequence masking
CRT [153]	2023	Transformers	Sequence masking
Other Pretext tasks
PHIT [154]	2023	H-InceptionTime
Series2Vec [24]	2023	Disjoint CNN	Similarity-based representation learning

Table 4. Summary of Self-supervised Models for Time Series Classification and Extrinsic Regression

4.1 Contrastive Learning

Contrastive learning involves model learning to differentiate between positive and negative time series examples. Time-Contrastive Learning (TCL) [144], Scalable Representation Learning (SRL or T-Loss) [145], and Temporal Neighborhood Coding (TNC) [146] apply a subsequence-based sampling and assume that distant segments are negative pairs and neighbor segments are positive pairs. TNC takes advantage of the local smoothness of a signal’s generative process to define neighborhoods in time with stationary properties to further improve the sampling quality for the contrastive loss function. TS2Vec [23] uses contrastive learning to obtain robust contextual representations for each timestamp hierarchically. It involves randomly sampling two overlapping subseries from input and encouraging consistency of contextual representations on the common segment. The encoder is optimized using both temporal contrastive loss and instance-wise contrastive loss.

In addition to the subsequence-based methods, other models employ instance-based sampling [21, 143, 147, 148, 149, 150], treating each sample individually to generate positive and negative samples for contrastive loss. Time-series Temporal and Contextual Contrasting (TS-TCC) [21] uses weak and strong augmentations to transform the input series into two views and then uses a temporal contrasting module to learn robust temporal representations. The contrasting contextual module is then built upon the contexts from the temporal contrasting module and aims to maximize similarity among contexts of the same sample while minimizing similarity among contexts of different samples. Similarly, TimeCLR [148] introduces DTW data augmentation to enhance robustness against phase shift and amplitude change phenomena. Bilinear Temporal-Spectral Fusion (BTSF) [143] uses simple dropout as the augmentation method and aims to incorporate spectral information into the feature representation. Similarly, Time-Frequency Consistency (TF-C) [149] is a self-supervised learning method that leverages the frequency domain to achieve better representation. It proposes that the time-based and frequency-based representations, learned from the same time series sample, should be more similar to each other in the time-frequency space compared to representations of different time series samples.

4.2 Self-prediction

The primary objective of self-prediction-based self-supervised models is to reconstruct the input or representation of input data. Studies have explored using transformer-based self-supervised learning methods for TSC [19, 22, 98, 151, 152, 153], following the success of models like BERT [94]. BErt-inspired Neural Data Representations (BENDR) [98] uses the transformer structure to model EEG sequences and shows that it can effectively handle massive amounts of EEG data recorded with differing hardware. Another study, Voice-to-Series with Transformer-based Attention (V2Sa) [22], utilizes a large-scale pre-trained speech processing model for TSC.

The Transformer-based Framework (TST) [19] and TARNet [151] adapt vanilla transformers to the multivariate time series domain and use a self-prediction-based self-supervised pre-training approach with masked data. These studies demonstrate the potential of using transformer-based self-supervised learning methods for TSC.

4.3 Other Pretext Tasks

While many pretext tasks in self-supervised learning are typically contrastive or self-predictive, specific tasks are tailored for time series data. In image-based self-supervised learning, synthetic transformations (augmentation) of an image are created, and the model learns to contrast the image and its transforms with other images in the training data, which works well for object interpretation. However, time series analysis fundamentally differs from vision or natural language processing concerning the definition of meaningful self-supervised learning tasks.

Guided by this insight, Foumani et al. [24] introduce Series2Vec, a novel self-supervised representation learning approach. Unlike other contrastive self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task. Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation. Pre-trained H-InceptionTime (PHIT) [154] is pre-trained using a novel pretext task designed to identify the originating dataset of each time series sample. The objective is to generate flexible convolution filters that can be applied across diverse datasets. Furthermore, PHIT demonstrates its capability to mitigate overfitting in small datasets.

5 Data Augmentation

In the field of deep learning, the concept of data augmentation has emerged as an important tool for enhancing performance, particularly in scenarios where the availability of training data is limited [155]. Originally proposed in computer vision, data augmentation involves a variety of transformations to images, such as cropping, rotating, flipping, and applying filters like blurring and sharpening. These transformations serve to introduce a diverse range of scenarios within the training data, thereby aiding in the development of more robust and generalizable models. However, the direct application of these image-based augmentation techniques to time series data often proves to be inadequate or inappropriate. Operations like rotation may disrupt the intrinsic temporal structure of time series data.

The challenge of overfitting is particularly pronounced in the field of deep learning models for TSC. These models are characterized by a high number of trainable parameters, which can lead to a model that performs well on training data but fails to generalize to unseen data. In such cases, data augmentation can be a valuable strategy. It offers an alternative to the costly and sometimes impractical approach of collecting additional real-world data. By generating synthetic samples from existing datasets, we can effectively augment the size and variety of our training data. The following details different investigated methods to produce synthetic time series for data augmentation.

Random Transformations.

Several augmentations have been developed for the magnitude domain. Jittering, as explored by Um et al. [156], involves the addition of random noise to the time series. Another method, flipping [157], reverses the time series values. Scaling is a technique where the time series is multiplied by a factor from a Gaussian distribution. Magnitude warping, which shares similarities with scaling, distorts the series along a curve that varies smoothly. For time domain transformations, permutation algorithms play a significant role. For example, the slicing transformation involves removing sub-sequence from the series. There are also various warping methods like Random Warping [158], Time Warping [156], Time Stretching [159], and Time Perturbation [160], each introducing different forms of distortion to the time series. Finally, in the frequency domain, transformations often utilize the Fourier transform. For example, Gao et al. [161] introduce perturbations to both the magnitude and phase spectrum following a Fourier transform.

Window methods.

A primary approach in window methods is to create new time series by combining segments from various series of the same class. This technique effectively enriches the data pool with a variety of samples. Window slicing, as introduced by Cui et al. [162], involves dividing a time series into smaller segments, with each segment retaining the class label of the original series. These segments are then used to train classifiers, offering a detailed view of the data. During classification, each segment is evaluated individually, and a collective decision on the final label is reached through a voting system among the slices. Another technique is window warping, based on the DTW algorithm. This method adjusts segments of a time series along the temporal axis, either stretching or compressing them. This introduces variability in the time dimension of the data. Le Guennec et al. [163] provide examples of the application of both window slicing and window warping, showcasing their effectiveness in enhancing the diversity and representativeness of time series datasets.

Averaging methods.

Averaging methods in time series data augmentation combine multiple series to form a new, unified series. This process is more difficult than it might seem, as it requires careful consideration of factors like noise and distortions in both the time and magnitude aspects of the data. In this context, weighted DTW Barycenter Averaging (wDBA) introduced by Forestier et al. [164] provides an averaging method by aligning time series in a way that accounts for their temporal dynamics. The practical application of wDBA is illustrated in the study by Ismail Fawaz et al. [165], where it is employed in conjunction with a ResNet classifier, demonstrating its effectiveness. Additionally, the research conducted by Terefe et al. [166] uses an auto-encoder for averaging a set of time series. This method represents a more advanced approach in time series data augmentation, exploiting the auto-encoder’s capacity for learning and reconstructing data to generate averaged representations of time series.

Selection of data augmentation methods.

The selection of the appropriate data augmentation technique is critical and must be adapted to the specific characteristics of the dataset and the architecture of the neural network being used. Studies like those conducted by Iwana and Uchida [167], Pialla et al. [168], and Gao et al. [169] highlight the complexity of this task. These studies demonstrate that the effectiveness of augmentation techniques can vary significantly across different datasets and neural network architectures. Consequently, a method that proves effective in one scenario may not necessarily yield similar results in another. To this end, practitioners in the field of TSC must engage in a careful and informed process of method selection and tuning. While the array of available data augmentation techniques offers a comprehensive toolkit for tackling the challenges of limited data and overfitting, their successful application depends heavily on a nuanced understanding of both the methods themselves and the specific demands of the task at hand.

6 Transfer Learning

Transfer learning, initially popularized in the field of computer vision, is increasingly becoming relevant in the domain of TSC. In computer vision, this approach involves using a pre-trained network, typically on large datasets like ImageNet [170], as a starting point rather than initiating with random network weights. This method is also related to the concept of foundation or base models, which are large-scale machine learning models trained on extensive data, often using self-supervised or semi-supervised learning. These models are adaptable to a wide array of tasks, showcasing their versatility. The principle of transfer learning is also closely associated with domain adaptation, which focuses on applying a model trained on a source data distribution to a different, but related, target data distribution. This approach is crucial in leveraging pre-trained models for various applications, particularly in scenarios where data is scarce or specific to certain domains.

In the context of TSC, insights have been contributed by the work of Ismail Fawaz et al. [171], who conducted a study using the UCR archive. Their extensive experiments demonstrated that transfer learning could lead to positive or negative outcomes, depending on the chosen datasets for transfer. This finding underscores the importance of the relationship between source and target datasets in transfer learning efficacy. Ismail Fawaz et al. [171] also introduced an approach to predict the success of transfer learning in TSC by using DTW to measure similarities between datasets. This metric serves as a guide to select the most appropriate source dataset for a given target dataset, thereby enhancing accuracy in a majority of cases.

Other researchers have also explored transfer learning in TSC. Spiegel’s [172] work on using dissimilarity spaces to enrich feature representations in TSC set a precedent for employing unconventional data sources. This approach of enhancing learning with diverse data types finds a parallel in Li et al.’s [173] method, which leverages sensor modality labels from various fields to train a deep network, emphasizing the importance of versatile data in transfer learning. Building on the concept of data diversity, Rotem et al. [174] pushed the boundaries further by generating a synthetic univariate time series dataset for transfer learning. This synthetic dataset, used for regression tasks, underscores the potential of artificial data in overcoming the limitations of real-world datasets. Furthermore, Senanayaka et al. [175] introduced the similarity-based multi-source transfer learning (SiMuS-TL) approach. By establishing a ”mixed domain” to model similarities among various sources, Senanayaka et al. demonstrated the effectiveness of carefully selected and related data sources in transfer learning. Finally, Kashiparekh et al. [176] with their ConvTimeNet (CTN) focused on the adaptability of pre-trained networks across diverse time scales.

While the explored studies collectively advance our understanding of transfer learning in TSC, the field remains open for further investigation. A key challenge lies in determining the most suitable source models for transfer, a task complicated by the relative scarcity of large, curated, and annotated datasets in time series analysis compared to the field of computer vision. This restricts the utility of transfer learning in TSC, as the availability of extensive and diverse datasets is crucial for developing robust and generalizable models. Furthermore, the question of developing filters that are generic enough to be effective across a wide range of applications remains unresolved. This aspect is critical for the success of transfer learning, as the applicability of a pre-trained model to new tasks depends on the universality of its learned features. Additionally, the strategy of whether to freeze certain layers of the network during transfer or to fine-tune the entire network is another area that warrants deeper exploration.

7 Applications: Recent Developments and Challenges

TSC and TSER techniques have been used to analyze and model time-dependent data in a wide range of applications. These include human activity recognition, Earth observation, medical diagnosis including EEG [177] and electrocardiogram (ECG) [178] monitoring, air quality and pollution prediction [179, 180], structural and machine health monitoring [181, 182], Industrial Internet of Things (IIOT) [183], energy consumption and anomaly detection [184], and bio-acoustics [185].

Due to the extensive range of applications that use TSC and TSER, it is infeasible to cover them all in detail in a single review. Therefore, in this survey, we focus on just two applications: human activity recognition and satellite Earth observation. (References to recent reviews have been provided for the other applications mentioned above.) These are two important but quite different domains and were chosen to give the reader an idea of the diverseness of time series use in deep learning. The following sections provide an overview of the use of TSC and TSER, the latest developments, and challenges in these two applications.

7.1 Human Activity Recognition

Human activity recognition (HAR) is the identification or monitoring of human activity through the analysis of data collected by sensors or other instruments [186]. The recent growth of wearable technologies and the Internet of Things has resulted in not only the collection of large volumes of activity data [187] but also easy deployment of applications utilizing this data to improve the safety and quality of human life [5, 186]. HAR is therefore an important field of research with applications including healthcare, fitness monitoring, smart homes [188], and assisted living [189].

Devices used to collect HAR data can be categorized as visual or sensor based [4, 5]. Sensor-based devices can be further categorized as object sensors (e.g., RFIDs embedded into objects), ambient sensors (motion sensors, WiFi or Bluetooth devices in fixed locations), and wearable sensors [4], including smartphones [3]. However, the majority of HAR studies use data from wearable sensors or visual devices [186]. Additionally, HAR from visual device data requires the use of computer vision techniques and is therefore out of scope for this review. Accordingly, this section reviews wearable sensor-based methods of HAR. For reviews of vision-based HAR, refer to Kong and Fu [190] or Zhang et al. [191].

The main sensors used in wearable devices are accelerometers, gyroscopes, and magnetic sensors [192], which each collect three-dimensional spatial data over time. Inertial measurement units (IMUs) are wearable devices that combine all three sensors in one unit [193, 194]. Wearable device studies typically collect data from multiple IMUs located on different parts of the body [195, 196]. To create a dataset suitable for HAR modeling, the sensor data is split into (usually equally sized) time windows [197]. The task is then to learn a function that maps the multi-variate sensor data for each time window to a set of activities. Thus, the data forms multi-variate time series suited to TSC.

Given the broad scope of our survey, this section necessarily only provides a brief overview of the studies using deep learning for HAR. However, there are several surveys that provide a more in-depth review of machine learning and deep learning for HAR. Lara and Labrador [197] provide a comprehensive introduction to HAR, including machine learning methods used and the principal issues and challenges. Both Nweke et al. [3] and Wang et al. [4] provide a summary of deep learning methods, highlighting their advantages and limitations. Chen et al. [5] discuss challenges in HAR and the appropriate deep learning methods for addressing each challenge. They also provide a comprehensive list of publicly available HAR datasets. Gu et al. [198] focus on deep learning methods, reviewing preprocessing and evaluation techniques as well as the deep learning models.

The deep learning methods used for HAR include both CNNs and RNNs, as well as hybrid CNN-RNN models. While some of the models include an attention module, we did not find any studies proposing a full attention or transformer model. A summary of the studies reviewed and the type of model built is provided in Table 5. Hammerla et al. [199] compared several deep learning models for HAR, including three LSTM variants, a CNN model, and a DNN model. They found that a bi-directional LSTM performed best on naturalistic datasets where long-term effects are important. However, they found that some applications need to focus on short-term movement patterns and suggested CNNs are more appropriate for these applications. Thus, research across all model types is beneficial for the ongoing development of models for HAR applications.

Table 5.

Model	Year	Embedding	Other Features
Zeng et al. [200]	2014	CNN
DCNN [201]	2015	CNN	Discrete Fourier transform
Yang et al. [202]	2015	CNN
DeepConvLSTM [192]	2016	CNN, LSTM
Hammerla et al. [199]	2016	CNN, LSTM	Bi-directional
Ronao and Cho [203]	2016	CNN
Guan and Plötz [204]	2017	LSTM	Ensemble
Lee et al. [205]	2017	CNN
Murad and Pyun [206]	2017	LSTM	Uni- and bi-directional
Ignatov [207]	2018	CNN	Statistical features
Moya Rueda et al. [208]	2018	CNN
Yao et al. [209]	2018	CNN	Fully convolutional
Zeng et al. [210]	2018	LSTM	2 attention layers
AttnSense [211]	2019	CNN, GRU	Fast Fourier transform, 2 attention layers
InnoHAR [212]	2019	CNN, GRU	Inception
Zhang et al. [213]	2020	CNN	Attention
Challa et al. [214]	2021	CNN, LSTM	Bi-directional
CNN-biGRU [215]	2021	CNN, GRU	Bi-directional
DEBONAIR [216]	2021	ConvLSTM
Mekruksavanich and Jitpattanakul [217]	2021	CNN, LSTM
Mekruksavanich and Jitpattanakul [218]	2021	CNN, LSTM	Ensemble
Nafea et al. [219]	2021	CNN, LSTM	Bi-directional
Singh et al. [220]	2021	CNN, LSTM	Attention
Wang et al. [221]	2022	CNN
Xu et al. [222]	2022	CNN, Resnet	Deformable convolutions

Table 5. Summary of HAR Deep Learning Models

Many of the papers reviewed in this section used commonly available datasets to build and evaluate their models.

7.1.1 Convolutional Neural Networks.

One of the most common types of convolutional kernels for HAR is the \(k \times 1\) kernel. This kernel convolves \(k\) time steps together, moving along each time series in the input features in turn [221], so while weights are shared between the input features, there is no mixing between features. The outputs from the final convolutional layer are flattened and processed by fully connected layers before the final classification is made. Ronao et al. [203] performed a comprehensive evaluation of CNN models for HAR, evaluating the effect of changing the number of layers, filters, and filter sizes. The input data was collected from smartphone accelerometer and gyroscope sensors. Ignatov [207] used a one-layer CNN and augmented the extracted features with statistical features before being passed to fully connected layers. The architecture was effective with short time series (1 second) and therefore useful for real-time activity modeling. One drawback of the above method is that it forces weight sharing across all the input features. This may not be optimal, especially when using data collected from multiple devices. In this case, using a separate CNN for each device [208] allows independent weighting of the features. Similarly, as each sensor is typically tri-axial, a separate CNN can be used for each axis [200, 213]. The features extracted by each CNN are then concatenated and processed either by fully connected layers [200] or an attention head [213].

While the above two methods are the most common, other studies have proposed alternative CNNs for HAR. DCNN [201] pre-processes the sensor data using a Discrete Fourier Transform to convert IMU data to frequency signals, then uses two-dimensional convolutions to extract combined temporal and frequency features. Lee et al. [205] pre-processed the tri-axial accelerometer data to a magnitude vector, which was then processed in parallel by CNNs with varying kernel sizes, extracting features at different scales. Xu et al. [222] used deformable convolutions [223] in both a 2D-CNN and a ResNet model and found these models performed better than their non-deformable counterparts. Yao et al. [209] proposed a fully convolutional model using 2D temporal and feature convolutions. Their model has two advantages as (1) it handles arbitrary length input sequences and (2) it makes a prediction for each timestep, which avoids the need to pre-process the data into windows and can detect transitions between activities.

7.1.2 Recurrent Neural Networks.

Several LSTM models have been proposed for HAR. Murad and Pyun [206] designed and compared three multi-layered LSTMs, a uni-directional LSTM, a bi-directional LSTM, and a “cascading” LSTM, which has a bi-directional first layer, followed by uni-directional layers. In each case the output from all time steps is used as input to the classification layer. Zeng et al. [210] added two attention layers to an LSTM, a sensor attention layer before the LSTM and a temporal attention layer after the LSTM. They include a regularization term they called “continuous attention” to smooth the transition between attention weights. Guan and Plötz [204] created an ensemble of LSTM models by saving the models at every training epoch, then selecting the best “M” models based on validation set results, thus aiming to reduce model variance.

7.1.3 Hybrid Models.

Many recent studies have focused on hybrid models, combining both CNNs and RNNs. DeepConvLSTM [192] comprises four temporal convolutional layers followed by two LSTM layers, which the authors found to perform better than an equivalent CNN (replacing the LSTM layers with fully connected layers). As the LSTM layers have fewer parameters than fully connected layers, the DeepConvLSTM model was also much smaller. Singh et al. [220] used a CNN to encode the spatial data (i.e., the sensor readings at each timestamp) followed by a single LSTM layer to encode the temporal data, then a self-attention layer to weight the time steps. They found this model performed better than an equivalent model using temporal convolutions in the CNN layers. Challa et al. [214] proposed using three 1D-CNNs with different kernel sizes in parallel, followed by two bi-directional LSTM layers and a fully connected layer. Nafea et al. [219] also used 1D-CNNs with different kernel sizes and bi-directional LSTMs. However, they used separate branches for the CNNs and LSTMs, merging the features extracted in each branch for the final fully connected layer. Mekruksavanich and Jitpattanakul [217] compared a four-layer CNN-LSTM model with a smaller CNN-LSTM model and LSTM models, finding the extra convolutional layers improved performance over the smaller models. DEBONAIR [216] is another multi-layered model. It uses parallel 1D-CNNs, each having different kernel, filter, and pooling sizes to extract different types of features associated with different types of activity. These are followed by a combined 1D-CNN, then two LSTM layers. Mekruksavanich and Jitpattanakul [218] ensembled four different models: a CNN, an LSTM, a CNN-LSTM, and a ConvLSTM model. They aimed to produce a model for boimetric user identification that could identify not only the activity being performed but also the participant performing the activity.

A few hybrid models use GRUs instead of LSTMs. InnoHAR [212] is a modified DeepConvLSTM [192], replacing the four CNN layers with inception layers and the two LSTM layers with GRU layers. The authors found this inception model performed better than both the original DeepConvLSTM model and a straight CNN model [202]. AttnSense [211] uses a Fast Fourier transform to generate frequency features, which are then convolved separately for each time step. Attention layers are used to weight the extracted frequency features. These are then passed through a GRU with temporal attention to extract temporal features. CNN-BiGRU [215] uses a CNN layer to extract spatial features from the sensor data, then one or more GRU layers to extract temporal features. The final section of the model is a fully connected module consisting of one or more hidden layers and a softmax output layer.

7.2 Satellite Earth Observation

Ever since NASA launched the first Landsat satellite in 1972 [224], Earth-observing satellites have been recording images of the Earth’s surface, providing 50 years of continuous Earth observation data that can be used to estimate environmental variables informing us about the state of the Earth. Instruments on board the satellites record reflected or emitted electromagnetic radiation from the Earth’s surface and vegetation [225]. The regular, repeated observations from these instruments form satellite image time series (SITS) that are useful for analyzing the dynamic properties of some variables, such as plant phenology. The main modalities used for SITS analysis are multispectral spectrometers and spectroradiometers, which observe the visible and infrared frequencies and Synthetic Aperture Radar (SAR) systems, which emit a microwave signal and measure the backscatter.

Raw data collected by satellite instruments needs to be pre-processed before being used in machine learning. This is frequently done by the data providers to produce analysis-ready datasets (ARDs). With the increasing availability of compatible ARDs from sources such as Google Earth Engine [226] and various data cubes [227, 228], models combining data from multiple data sources (multi-modal) are becoming more common. These data sources make it straightforward to obtain data that are co-registered (spatially aligned and with the same resolution and projection), thus avoiding the need for complex pre-processing.

Satellite image time series can be processed either (1) as 2D temporal and spectral data, processing each pixel independently and ignoring the spatial dimensions, or (2) as 4D data, including the two spatial dimensions, in which models thus extract spatio-temporal features. This latter method allows estimates to be made at pixel, patch, or object level; however, it requires either more complex models or spatial features to be extracted in a pre-processing step. Feature extraction can be as simple as extracting the mean value for each band. However, both clustering (TASSEL [229]) and neural-network-based methods, such as the Pixel-Set Encoder [230], have been used for more complex feature extraction. The most common use of SITS deep learning is for the classification of the Earth’s surface by land cover and agricultural land by crop types. The classes used can range from very broad land cover categories (such as forest, grasslands, agriculture) through to specific crop types. Other classification tasks include identifying specific features, such as sinkholes [231], burnt areas [232], flooded areas [233], roads [234], deforestation [235], vegetation quality [236], and forest understory and litter types [237].

Extrinsic regression tasks are less common than classification tasks, but several recent studies have investigated methods of estimating water content in vegetation, as measured by the variable Live Fuel Moisture Content (LFMC) [238, 239, 240, 241]. Other regression tasks include estimating the wood volume of forests [242] by using a hybrid CNN-MLP model combining a time series of Sentinel-2 images with a single LiDAR image and crop yield [243] that uses a hybrid of CNN and LSTM.

Many different approaches to learning from SITS data have been studied, with studies using all the main deep learning architectures, adapting them for multi-modal learning, and combining architectures in hybrid and ensemble models. The rest of this section reviews the architectures that have been used to model SITS data. A summary of these papers and the embedding architecture is provided in Table 6.

Table 6.

Model	Year	Embedding	Other Features
Crop type classification
TAN [244]	2019	2D-CNN and GRU	Attention—temporal
TGA [245]	2020	2D-CNN	Attention—squeeze and excitation
3D-CNN [246]	2018	3D-CNN
DCM [247]	2020	LSTM	Self-attention
HierbiLSTM [248]	2022	LSTM	Self-attention
L-TAE [249]	2020	MLP	Attention—temporal
PSE-TAE [230, 250]	2020	MLP	Attention—temporal optionally multi-modal
SITS-BERT [251]	2021		Pre-trained transformer
Land Cover classification
1D-CNN [252]	2017	1D-CNN and MLP	Hybrid model
1D & 2D-CNNs [253]	2017	1D-CNN; 2D-CNN	Ensemble model
TempCNN [254]	2019	1D-CNN
TASSEL [229]	2020	1D-CNN	Self-attention
TSI [255]	2021	1D-CNN; LSTM	Ensemble model
TWINNS [256]	2019	2D-CNN and GRU	Attention—temporal; multi-modal
DuPLO [257]	2019	2D-CNN and GRU	Attention—temporal
Sequential RNN [258]	2018	2D-FCN and LSTM	Hybrid model
FG-UNET [259]	2019	UNet and 2D-CNN	Hybrid model
LSTM [260]	2017	LSTM
HOb2sRNN [261]	2020	GRU	Attention—temporal
OD2RNN [262]	2019	GRU	Attention—temporal; multi-modal
SITS-Former [263]	2022	3D-CNN	Pre-trained transformer
Other classification tasks
Deforestation [235]	2022	U-Net and LSTM	Hybrid model
Flood detection [233]	2020	Resnet and GRU	Hybrid model
Forest understory [237]	2022	2D-CNN and LSTM	Ensemble model
Road detection [234]	2020	U-Net and convLSTM	Hybrid model
Vegetation quality [236]	2017	LSTM; GRU
Extrinsic regression tasks
TempCNN-LFMC [239]	2021	1D-CNN
Multi-tempCNN [240]	2022	1D-CNN	Multi-modal, ensemble model
LFMC estimation [238]	2020	LSTM	Multi-modal
LFMC estimation [241]	2022	1D-CNN and LSTM	Multi-modal, hybrid, ensemble
MLDL-net [243]	2020	2D-CNN and LSTM	Hybrid model
SSTNN [264]	2021	3D-CNN and LSTM	Hybrid model
MMFVE [242]	2022	2D-CNN	Hybrid model

Table 6. Summary of SITS Deep Learning Models

7.2.1 Recurrent Neural Networks (RNNs).

One of the first papers to use RNNs for land cover classification was Ienco et al. [260], who showed that an LSTM model out-performed non-deep-learning methods such as Random Forest (RF) and Support Vector Machines (SVMs). However, they also showed that the performance of both RF and SVM improves if trained on features extracted by the LSTM model, and in some cases they were more accurate than the straight LSTM model. Rao et al. [238] used an extrinsic regression LSTM model to estimate LFMC in the western United States.

More commonly, however, RNNs are combined with an attention layer to allow the model to focus on the most important time steps. The OD2RNN model [262] used separate GRU layers followed by attention layers to process Sentinel-1 and Sentinel-2 data, combining the features extracted by each source for the final fully connected layers. HOb2sRNN [261] refined OD2RNN by using a hierarchy of land cover classifications; the model was pretrained using broad land cover classifications, then further trained using the finer-grained classifications. DCM [247] and HierbiLSTM [248] both use a bi-directional LSTM, processing the time series in both directions, followed by a self-attention transformer for a pixel-level crop-mapping model. All these studies found that adding the attention layers improved model performance over a straight GRU or LSTM model.

7.2.2 Convolutional Neural Networks (CNNs).

While many authors have claimed that RNNs out-perform CNNs for land cover and crop-type classification, most of these comparisons are to 2D-CNNs, which ignore the temporal ordering of SITS data [254]. However, other studies show using 1D-CNNs to extract temporal information and 3D-CNNs to extract spatio-temporal information are both effective methods of learning from SITS data. TempCNN [254] consists of three 1D convolutional layers. The output from the final convolutional layer is passed through a fully connected layer, then the final softmax classification layer. TASSEL [229] is an adaptation of TempCNN for OBIA classification, using TempCNN models to process features extracted from the objects, followed by an attention layer to weight the convolved features. TempCNN has also been adapted for extrinsic regression [239] and used for LFMC estimation [239, 240, 241].

2D-CNNs are mainly used to extract spatial or spatio-temporal features for both pixel and object classification. The model input is usually 4D and the data is convolved spatially, with two main methods used to handle the temporal dimension. In the first method, each time step is convolved separately and the extracted features are merged in later stages of the model [245]. In the second method, the time steps and channels are flattened to form a large multivariate image [242, 253]. FG-UNet [259] is a fully convolutional model that combines both the above methods, first grouping time steps by threes to produce images with 30 channels (10 spectral \(\times\) 3 temporal), which are passed through both U-Net and 2D-CNN layers. Ji et al. [246] used a 3D-CNN to convolve the spatial and temporal dimensions together, combining the strengths of 1D-CNN and 2D-CNNs. The study found that a 3D-CNN crop classification model performed significantly better than the 2D-CNN, again showing the importance of the temporal features. Another study, SSTNN [264], obtained good results for crop yield prediction by using a 3D-CNN to convolve the spatial and spectral dimensions, extracting spatio-spectral features for each time step. These features were then processed by LSTM layers to perform the temporal modeling.

7.2.3 Transformer and Attention Models.

As an alternative to including attention layers with a CNN or RNN, several studies have designed models that process temporal information using only attention layers. PSE-TAE [230] used a modified transformer called a temporal attention encoder (TAE) for crop mapping and found the TAE performed better than either a CNN or an RNN. L-TAE [249] replaced the TAE with a lightweight transformer that is both computationally efficient and more accurate than the full TAE. Ofori-Ampofo et al. [250] adapted the TAE model for multi-modal inputs using Sentinel-1 and Sentinel-2 data for crop-type mapping. Rußwurm and Körner [265] compared a self-attention model with RNN and CNN architectures. They found that this model was more robust to noise than either RNN or CNN and suggested that self-attention is suitable for processing raw, cloud-affected satellite data.

Building on the success of pre-trained transformers for natural language processing such as BERT [94], pre-trained transformers have been proposed for Earth observation tasks [251]. Earth observation tasks are particularly suited for pre-trained models as large quantities of Earth observation data are readily available, while labeled data can be difficult to obtain [266], especially in remote locations. SITS-BERT [251] is an adaptation of BERT [94] for pixel-based SITS classification. For the pretext task, random noise is added to the pixels, and the model is trained to identify and remove this noise. The pre-trained model is then further trained for required tasks such as crop type or land cover mapping. SITS-Former [263] modifies SITS-BERT for patch classification by using 3D-Conv layers to encode the spatial-spectral information, which is then passed through the temporal attention layers. The pretext task used for SITS-Former is to predict randomly masked pixels.

7.2.4 Hybrid Models.

A common use of hybrid models is to use a CNN to extract spatial features and an RNN to extract temporal features. Garnot et al. [267] compared a straight 2D-CNN model (thus ignoring the temporal aspect), a straight GRU model (thus ignoring the spatial aspect), and a combined 2D-CNN and GRU model (thus using both spatial and temporal information) and found the combined model gave the best results, demonstrating that both the spatial and temporal dimensions provide useful information for land cover mapping and crop classification. DuPLO [257] was one of the first models to exploit this method, running a CNN and ConvGRU model in parallel, then fusing the outputs using a fully connected network for the final classifier. During training, an auxiliary classifier for each component was used to enhance the discriminative power. TWINNS [256] extended DuPLO to a multi-modal model, using time series of both Sentinel-1 (SAR) and Sentinel-2 (Optical) images. Each modality was processed by separate CNN and convGRU models, and then the output features from all four models were fused for classification.

Other hybrid models include Li et al. [244], who used a CNN for spatial and spectral unification of Landsat-8 and Sentinel-2 images, which were then processed by a GRU. MLDL-Net [243] is a 2D-CNN extrinsic regression model, using CNNs to extract time step features, which are then passed through an LSTM model to extract temporal features. Fully connected layers combine the feature sets to predict crop yield. Rußwurm and Körner [258] extracted temporal features first, using a bi-directional LSTM, then used a fully convolutional 2D-CNN to incorporate spatial information and classify each pixel in the input patch.

7.2.5 Ensemble Models.

One of the easiest ways to ensemble DL models is to train multiple homogeneous models that vary only in the random weight initialization [268]. Di Mauro et al. [252] ensembled 100 LULC models with different weight initializations by averaging the softmax predictions. They found this produced a more stable and stronger classifier that outperformed the individual models. Multi-tempCNN [240], a model for LFMC estimation, is an ensemble of homogeneous models for extrinsic regression. The authors suggested that as an additional benefit, the variance of the individual model predictions can be used to obtain a measure of uncertainty of the estimates. TSI [255] also ensembles a set of homogeneous models, but instead of relying on random weight initialization to introduce model diversity, the time series are segmented and models trained on each segment.

Other methods create ensembles of heterogeneous models. Kussul et al. [253] compared ensembles of 1D-CNNs and 2D-CNNs models for land cover classification. Each model in the ensemble used a different number of filters, thus finding different feature sets useful for classification. Xie et al. [241] ensembled three heterogeneous models—a causal temporal convolutional neural network (TCN), an LSTM, and a hybrid TCN-LSTM model—for an extrinsic regression model to estimate LFMC. The ensembles were created using stacking [269]. The authors compared this method to boosting their TCN-LSTM model, using Adaboost [270] to create a three-member ensemble, and found that stacking a diverse set of models out-performed boosting.

7.2.6 EO Surveys and Reviews.

This survey is one of very few that include a section focusing specifically on deep learning TSC and TSER tasks using SITS data. However, there are other reviews that provide further information about related topics. Gomez et al. [271] is an older review highlighting the important role of SITS data for land cover classification. Zhu et al. [272] reviewed the advances and challenges in DL for remote sensing and the resources available that are potentially useful to help DL address some of the major challenges facing humanity. Ma et al. [273] study the role of deep learning in Earth observation using remotely sensed data. It covers a broad range of tasks including image fusion, image segmentation, and object-based analysis, as well as classification tasks. Yuan et al. [274] provide a review of DL applications for remote sensing, comparing the role of DL versus physical modeling of environmental variables and highlighting challenges in DL for remote sensing that need to be addressed. Chaves et al. [275] reviewed recent research using Landsat 8 and/or Sentinel-2 data for land cover mapping. While not focused on SITS DL methods, the review notes the growing importance of these methods. Moskolai et al. [276] provide a review of forecasting applications using DL with SITS data that provides an analysis of the main DL architectures that are relevant for classification as well as forecasting.

8 Conclusion

In conclusion, this survey article has discussed a variety of deep network architectures for time series classification and extrinsic regression tasks, including multilayer perceptrons, convolutional neural networks, recurrent neural networks, and attention-based models. We have also highlighted refinements that have been made to improve the performance of these models on time series tasks. Additionally, we have discussed two critical applications of time series classification and regression, human activity recognition and satellite Earth observation. Overall, using deep network architectures and refinements has enabled significant progress in the field of time series classification and will continue to be essential for addressing a wide range of real-world problems. We hope this survey will stimulate further research using deep learning techniques for time series classification and extrinsic regression. Additionally, we provide a carefully curated collection of sources, available at https://github.com/Navidfoumani/TSC_Survey, to further support the research community.

Supplementary Material

3649448.supp (3649448.supp.pdf)

Supplementary material

Download
136.92 KB

A Non-deep-learning Time Series Classification

In this section, we aim to give a brief introduction to the field of TSC and discuss its current status. We refer interested readers to the ”bake-off” papers [11, 25, 26] that describe TSC methods in much more detail and benchmark them.

Research in TSC started with distance-based approaches that find discriminating patterns in the shape of the time series. Distance-based approaches usually consist of coupling a 1-nearest neighbor (1NN) classifier with a time series distance measure [277, 278]. Small distortions in the time series can lead to false matches when measuring the distance between time series using standard distance measurements such as Euclidean distance [277]. A time series distance measure aims to compensate for these distortions by aligning two time series such that the alignment cost between the two are minimized. There are many time series distances proposed in the literature; among these, the \(DTW\) distance is one of the most popular choices for many time series tasks, due to its intuitiveness and effectiveness in aligning two time series. The 1NN-\(DTW\) has been the go-to method for TSC for decades. However, by comparing several time series distance measures, the work in [277] showed that as of 2015, there was no single distance that significantly outperformed \(DTW\) when used with a 1NN classifier. The recent Amerced \(DTW\) [279] distance is the first distance that is significantly more accurate than \(DTW\). These individual 1NN classifiers with different distances can be ensembled together to create an ensemble, such as the Ensemble of Elastic distances (EE), that significantly outperforms each of them individually [277, 278]. However, since most distances have a complexity of \(O(L^2),\) where \(L\) is the length of the series, performing a nearest neighbor search becomes very costly. Hence, distance-based approaches are considered to be one of the slowest methods for TSC [280, 281].

As a result of EE, recent studies have focused mainly on developing ensembling methods that significantly outperform 1NN-\(DTW\) [278, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290]. These approaches use either an ensemble of tree-based approaches [289, 290] or an ensemble of different types of discriminant classifiers, such as NN with several distances and SVM on one or several feature spaces [282, 285, 286, 287]. All of these approaches share a common property—the data transformation phase where the time series is transformed into a new feature space such as the shapelets transform [286] or DTW features [285]. Taking advantage of this notion led to the development of the Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) [281, 284]. HIVE-COTE is a meta ensemble for TSC and forms its ensemble from ensemble classifiers of multiple domains. Since its introduction in 2016 [284], HIVE-COTE has gone through a few iterations. Recently, the latest HIVE-COTE version, HIVE-COTEv2.0 (HC2), was proposed [281]. It is composed of four ensemble members, each of them being the then state of the art in their respective domains. It is currently one of the most accurate classifiers for both univariate and multivariate TSC tasks [281]. Despite being accurate on 26 multivariate and 142 univariate TSC benchmark datasets that are relatively small, HC2 scales poorly on large datasets with long time series as well as datasets with large numbers of channels.

Various work has been done on speeding up TSC methods without sacrificing accuracy [14, 278, 291, 292, 293, 294, 295]. A recent breakthrough is the development of Rocket [14], which was able to process 109 univariate time series datasets under 4 hours while the previous fastest one took days. Rocket leverages a large number of random convolutional filters to extract features from each series that might be relevant to classifying a series. These features are then passed to a linear model for classification. Rocket has been improved to be faster (Minirocket [291]) and more accurate (Multirocket [292] and Hydra [293]). Hydra when combined with Multirocket is now one of the fastest and most accurate methods for TSC.

B DNN Architectures for Time Series

In this section, we provide a descriptive overview of deep-learning-based models for TSC. The focus is on clarifying their architectures and outlining their adaptations to the specific characteristics of time series data.

B.1 Convolution Neural Networks (CNNs)

Many variants of CNN architectures have been proposed in the literature, but their primary components are very similar. Using the LeNet-5 [296] as an example, it consists of three types of layers: convolutional, pooling, and fully connected. The purpose of the convolutional layer is to learn feature representations of the inputs. Figure 2(a) shows the architecture of the t-LeNet network, which is a time-series-specific version of LeNet. This figure shows that the convolution layer is composed of several convolution kernels (or filters) used to compute different feature maps. In particular, each neuron of a feature map is connected to a region of neighboring neurons in the previous layer called the receptive field. Feature maps can be created by first convolving inputs with learned kernels and then applying an element-wise nonlinear activation function to the convolved results. It is important to note that all spatial locations of the input share the kernel for each feature map, and several kernels are used to obtain the entire feature map.

Fig. 2.

The feature value of the \(l\)th layer of the \(k\)th feature map at location \((i,j)\) is obtained by

\begin{equation} Z^l_{i,j,k}={{\bf W}^l_k}^T {\bf A}^{l-1}_{i,j}+b^l_k, \end{equation}

(2)

where \({\bf W}^l_k\) and \(b^l_k\) are the weight vector and bias term of the \(k\)th filter of the \(l\)th layer, respectively, and \({\bf A}^{l-1}_{i,j}\) is the input patch centered at location \((i, j)\) of the \(l\)th layer. Note that the kernel \({\bf W}^l_k\) that generates the feature map \(Z^l_{:,:,k}\) is shared. A weight-sharing mechanism has several advantages, such as reducing model complexity and making the network easier to train. Let \(f\left(.\right)\) denote the nonlinear activation function. The activation value of convolutional feature \(Z^l_{i,j,k}\) can be computed as

\begin{equation} {\bf A}^{l}_{i,j,k} = f(Z^l_{i,j,k}). \end{equation}

(3)

The most common activation functions are sigmoid, tanh, and ReLU [297]. As shown in Figure 2(a), a pooling layer is often placed between two convolution layers to reduce the resolution of the feature maps and to achieve shift invariance. Following several convolution stages—the block comprising convolution, activation, and pooling is called convolution \(stage\)—there may be one or more fully connected layers that aim to perform high-level reasoning. As discussed in Section 3.1, each neuron in the previous layer is connected to every neuron in the current layer to generate global semantic information. In the final layer of CNNs, there is the output layer in which the Softmax operators are commonly used for classification tasks [40].

B.2 Recurrent Neural Networks (RNN)

RNNs are types of neural networks that are specifically designed to process time series and other sequential data. RNNs are conceptually similar to FFNs. While FFNs map from fixed-size inputs to fixed-size outputs, RNNs can process variable-length inputs and produce variable-length outputs. This capability is enabled by sharing parameters over time through directed connections between individual layers. RNN models for TSC can be classified as sequence to sequence or sequence to one based on their outputs. Figure 2(b) shows sequence-to-sequence architectures for RNN models, with an output for each input sub-series. On the other hand, in sequence-to-one architecture, decisions are made using only \(y^T\) and ignoring the other outputs.

At each time step \(t\), RNNs maintain a hidden vector \(h,\) which updates as follows [298, 299]:

\begin{equation} h_t = tanh(Wh_{t-1} + Ix^{t}), \end{equation}

(4)

where \(X =\left\lbrace x^1, \ldots , x^{t-1},x^t, \ldots , x^T\right\rbrace\) contains all of the observation, \(tanh\) denotes the hyperbolic tangent function, and the recurrent weight and the projection matrix are shown by \(W\) and \(I\), respectively. The hidden-to-hidden connections also model the short-term time dependency. The hidden state \(h\) is used to make a prediction as

\begin{equation} y^t = \sigma _s(Wh_{t-1}), \end{equation}

(5)

where \(\sigma _s\) is a softmax function and provides a normalized probability distribution over the possible classes. As depicted in Figure 2(b), the hidden state \(h\) can be used to stack RNNs in order to build deeper networks:

\begin{equation} h_t^l = \sigma (Wh^l_{t-1} + Ih^{l-1}_t), \end{equation}

(6)

where \(\sigma\) is the logistic sigmoid function. As an alternative to feeding each time step to the RNN, the data can be divided into time windows of \(\omega\) observations, with the option for variable overlaps. Each time window is labeled with the majority response labels within the \(\omega\) window.

B.3 Attention-based Model

B.3.1 Self-attention.

The attention mechanism was introduced by [300] for improving the performance of encoder-decoder models [301] in neural machine translation. The encoder-decoder in neural machine translation encodes a source sentence into a vector in latent space and decodes the latent vector into a target language sentence. As shown in Figure 3(a), the attention mechanism allows the decoder to pay attention to the segments of the source for each target through a context vector \(c_t\). For this model, a variable-length attention vector \(\alpha _t\), equal to the number of source time steps, is derived by comparing the current target hidden state \(h_t\) with each source hidden state \(\overline{h}_s\) as follows [302]:

\begin{equation} \alpha _t(s)= \frac{{exp(score(h_t,\overline{h}_s)})}{\sum _{s^{\prime }}exp(score(h_t,\overline{h}_{s^{\prime }}))}. \end{equation}

(7)

The term \(score\) is referred to as an alignment model and used to compare the target hidden state \(h_t\) with each of the source hidden states \(\overline{h}_s\), and the result is normalized to produce attention weights (a distribution over source positions). There are various choices of the scoring function:

\begin{equation} score(h_t,\overline{h}_s)= {\left\lbrace \begin{array}{ll}h_t^TW\overline{h}_s \\ v_{\alpha }^Ttanh(W_{\alpha }[h_t;\overline{h}_s]). \end{array}\right.} \end{equation}

(8)

These scores influence the attention distribution, impacting how the model attends to different parts of the input sequence during predictions. As shown above, the score function is parameterized as an FFN that is jointly trained with all the other components of the model. The model directly computes soft attention, allowing the cost function’s gradient to be backpropagated [300]. Given the alignment vector as weights, the context vector \(c_t\) is computed as the weighted average over all the source hidden states:

\begin{equation} c_t = \sum _{s}\alpha _{ts}\overline{h}_s. \end{equation}

(9)

Accordingly, the computation path goes from \(h_t\rightarrow \alpha _t \rightarrow c_t \rightarrow \widetilde{h}_t\) and then makes a prediction using a \(Softmax\) function [302]. Note that \(\widetilde{h}_t\) is a refined hidden state that incorporates both the original hidden state \(h_t\) and the context information \(c_t\) obtained through attention mechanisms.

Fig. 3.

B.3.2 Transformers.

Similar to self-attention and other competitive neural sequence models, the original transformer developed for natural language processing (hereinafter the vanilla transformer) has an encoder-decoder structure that takes as input a sequence of words from the source language and then generates the translation in the target language [93]. Both the encoder and decoder are composed of multiple identical blocks. Each encoder block consists of a multi-head self-attention module and a position-wise FFN, while each decoder block inserts cross-attention models between the multi-head self-attention module and the position-wise FFN. Unlike RNNs, transformers do not use recurrence and instead model sequence information using the positional encoding in the input embeddings.

The transformer architecture is based on finding associations or correlations between various input segments using the dot product. As shown in Figure 3(b), the attention operation in transformers starts with building three different linearly weighted vectors from the input \(x_i\), referred to as query (\(q_i\)), key (\(k_i\)), and value (\(v_i\)):

\begin{equation} {\bf q}_i = W_q{\bf x}_i, \quad {\bf k}_i = W_k{\bf x}_i, \quad {\bf v}_i = W_v{\bf x}_i, \end{equation}

(10)

where \(W_q,W_k\), and \(W_v\) are learnable weight matrices. The output vectors \({\bf z}_i\) are given by

\begin{equation} {\bf z}_i=\sum _{j}softmax\left(\frac{{\bf q}_i^T{\bf k}_j}{\sqrt {d_q}}\right){\bf v}_i. \end{equation}

(11)

Note that the weighting of the value vector \({\bf v}_i\) depends on the mapped correlation between the query vector \({\bf q}_i\) at position \(i\) and the key vector \({\bf k}_j\) at position \(j\). The value of the dot product tends to grow with the increasing size of the query and key vectors. As the softmax function is sensitive to large values, the attention weights are scaled by the square root of the size of the query and key vectors \(d_q\). The input data may contain several levels of correlation information, and the learning process may benefit from processing the input data in multiple different ways. Multiple attention heads are introduced that operate on the same input in parallel and use different weight matrices \(W_q,W_k\), and \(W_v\) to extract various levels of correlation between the input data.

References

[1]

Qiang Yang and Xindong Wu. 2006. 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5, 4 (2006), 597–604.