DSTSPYN: a dynamic spatial-temporal similarity pyramid network for traffic flow prediction

Wang, Xing; Chen, Feifei; Jin, Biao; Lin, Mingwei; Zou, Fumin; Zeng, Ruihao

doi:10.1007/s10489-024-06198-z

DSTSPYN: a dynamic spatial-temporal similarity pyramid network for traffic flow prediction

Open access
Published: 28 December 2024

Volume 55, article number 237, (2025)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

DSTSPYN: a dynamic spatial-temporal similarity pyramid network for traffic flow prediction

Download PDF

Xing Wang^1,2,
Feifei Chen^1,2^na1,
Biao Jin^1,2,
Mingwei Lin^1,2,
Fumin Zou³ &
…
Ruihao Zeng⁴

405 Accesses
Explore all metrics

Abstract

Traffic flow prediction plays a crucial role in intelligent transportation systems as it enables effective control and management of urban traffic. However, existing methods that based on Graph Convolutional Networks (GCNs) primarily utilize local neighborhood information for message passing, resulting in limited perception of global structures. Additionally, it is also a challenge to extract spatial-temporal similarity features due to the constraints of graph structures. To address these issues, we propose a novel traffic flow prediction model based on Dynamic Spatial-Temporal Similarity Pyramid Network (DSTSPYN). Our model employs a spatial-temporal pyramid architecture, which dynamically adjusts the weights of central, edge, and global spatial-temporal features using an enhanced attention mechanism. Furthermore, it captures dynamic temporal dependencies at different scales through pyramid gated convolution. Meanwhile, the spatial similarity features of different time steps can be extracted through the spatial-temporal global similarity (STGS) module. We evaluate our model on four public transportation datasets and demonstrate that the DSTSPYN model outperforms several baseline methods in terms of prediction accuracy. It effectively captures the dynamic spatial-temporal correlations of the road network and edge node features, making it well-suited for long-term traffic flow prediction.

MSSTN: a multi-scale spatio-temporal network for traffic flow prediction

Article 10 February 2024

DSTGCS: an intelligent dynamic spatial–temporal graph convolutional system for traffic flow prediction in ITS

Article 13 January 2024

Traffic Flow Prediction Through the Fusion of Spatial-Temporal Data and Points of Interest

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Traffic data prediction is a fundamental task in spatial-temporal data mining and an important component of intelligent transportation systems. Accurate traffic prediction plays a vital role in aiding traffic management authorities in devising more effective traffic dispatch and control strategies. By forecasting future traffic flow, intelligent transportation systems can optimize signal phasing, adapt lane planning, and ultimately reduce traffic congestion while enhancing traffic mobility. Traffic flow prediction involves forecasting traffic flow conditions on specific road segments through the analysis of historical traffic data [2, 45] and other relevant information, necessitating comprehensive modeling of spatial-temporal correlations.

Early researchers employed classical statistical models [14, 26, 29, 34, 38] to predict future traffic conditions, but these models were limited by assumptions of data linearity and smoothness. Machine learning-based methods [7, 8, 15, 28, 35, 39] can capture nonlinear patterns and interactions in traffic flow data, but they struggle to handle the temporal correlation and dynamic characteristics of traffic flow. Additionally, they may face challenges in processing large-scale data and high-dimensional features. Deep learning methods based on recurrent neural networks (RNNs) [10, 31] overcome these limitations and are widely used to extract long-term and short-term dependencies in time series data. However, they cannot effectively model spatial correlations in the traffic network. As a result, subsequent research introduced CNN-based methods [13, 47] to capture spatial correlations. However, these approaches divide the road network into regular grid structures, which fail to reflect the irregular graph structure of the traffic network and struggle to extract complex spatial correlations between traffic nodes. With the advancement of graph deep learning, Graph Convolutional Networks (GCNs) and their variants [32, 40, 42, 46] have been extensively applied to spatial-temporal data prediction tasks. These methods typically treat sensors established in the traffic network as nodes and construct a traffic graph based on the road network and distances between nodes. They update node features through information propagation between nodes. However, existing GCNs primarily utilize local neighborhood information for message passing, limiting the model’s perception of global structure. Specifically, the traffic flow at a certain node is influenced not only by its neighboring nodes but also by the traffic conditions at nodes that are far away. As shown in Fig. 1, when a traffic accident occurs at one intersection, it may lead to congestion at another intersection located at a considerable distance. In this scenario, the local message-passing mechanism of traditional GCNs is unable to effectively capture such global dynamic dependencies, limiting the model’s comprehensive understanding of complex traffic situations. Consequently, the limitations of local information can lead to a significant decline in the model’s performance when predicting traffic flow. In the temporal dimension, traffic flow varies significantly across different time periods. For example, flow patterns during peak hours differ substantially from those during off-peak hours. As shown in Fig. 2, traffic congestion typically occurs at specific times, making it impractical to predict flow during other periods solely based on congestion data. Most existing graph convolutional methods share weights across all time steps, applying the same feature updating mechanism regardless of whether it is a peak or non-peak period. This approach neglects the differences in spatial-temporal correlations between time steps, failing to account for the timeliness and complexity of traffic flow changes. Additionally, current models often face challenges in capturing both short-term and long-term dynamic temporal dependencies in traffic data, further limiting their predictive performance.

To address the aforementioned issues, in this paper, we propose a Dynamic Spatial-Temporal Similarity Pyramid Network (DSTSPYN) model. This model surpasses existing ones by more effectively extracting global dynamic spatial-temporal information. It also delves deeper into modeling spatial similarities in traffic data, successfully tackling the challenges posed by complex traffic scenarios. Firstly, we introduce the Spatial-Temporal Global Correlation (STGC) matrix, which learns graph representations by comparing the difference between the joint distribution of embedded features and the marginal product. This approach successfully captures the global features of the graph. By calculating the similarity of STGC matrices at each time step, we obtain the spatial similarity between adjacent time intervals in the road network. This information is utilized to adjust the weights of graph convolutions at each time step, effectively addressing data heterogeneity. Secondly, we design a spatial-temporal pyramid structure and employ an improved multi-head self-attention mechanism to assign distinct weights to each layer of the spatial pyramid. This design effectively captures spatial-temporal correlations at multiple granularities, improving the model’s adaptability to dynamic traffic environments. The main contributions of this paper are as follows:

We propose the STGC matrix, which captures global spatial information effectively. By examining the similarity of STGC matrices at each time step, it also identifies spatial similarities between consecutive time steps within the road network.
We transform multivariate time series into a multi-layer pyramid structure to efficiently extract spatial-temporal features at different scales. This approach enhances the model’s flexibility in handling diverse traffic patterns.
By employing an improved multi-head self-attention mechanism, the model explores spatial-temporal dependencies and adaptively adjusts the weights of local, global, and graph-level spatial-temporal features. This enhances the model’s ability to perceive dynamic temporal dependencies in the road network.
Extensive experiments on four benchmark datasets show that the proposed model outperforms multiple baseline methods, thoroughly validating its effectiveness in spatial-temporal data prediction.

The remaining parts of this paper are organized as follows. Section 2 provides a literature review on traffic prediction. Section 3 elaborates on the definition of the problem addressed in this paper. Section 4 introduces the details of the DSTSPYN model. Section 5 evaluates the predictive performance of the proposed model. Finally, Section 6 concludes the paper with closing remarks.

2 Related works

2.1 Traffic prediction based on probability and statistics

Traditional methods encompass model-driven approaches and statistical methods. Model-driven approaches mainly aim to elucidate the instantaneous and steady-state relationships among traffic volume, speed, and density. These approaches necessitate the construction of comprehensive and detailed system models based on prior knowledge. Nevertheless, in real-world settings, traffic data is influenced by multiple factors, rendering it arduous for existing models to precisely capture the fluctuations in traffic data. As traffic data collection and storage technologies have rapidly advanced, researchers have begun to redirect their attention towards data-driven methods.

Classical statistical models are representatives of data-driven methods. These methods assume that data samples follow a specific distribution and exhibit linear relationships between the data. Common statistical methods include time series models, linear regression models, and Kalman filter models. In time series analysis, auto-regressive integrated moving average (ARIMA) and its variants are widely used. Hamed et al. [14] used the ARIMA model in 1995 to predict traffic flow on urban roads. Subsequently, Van et al. [38] combined Kohonen self-organizing maps with ARIMA to improve predictive performance. Kamarianakis et al. [17] incorporated spatial features into the spatial-temporal auto-regressive integrated moving average (STARIMA) model. The Vector Auto-Regression (VAR) method [29] regresses lagged variables based on auto-regression and extends to multivariate sequence analysis. Sun et al. [34] proposed a local linear predictor to address interval prediction problems in traffic data time series. The Kalman filter model predicts future traffic conditions based on the traffic states at the previous and current time steps. In 1984, Okutani et al. [26] established a traffic flow state prediction model based on Kalman filter theory, which can estimate dynamic system states from real-time noise data with good performance. However, when traffic flow undergoes drastic changes, the Kalman filter method may result in significant over-prediction or under-prediction. Guo et al. [11] subsequently proposed an adaptive Kalman filter with updatable process variance, demonstrating better adaptability during highly fluctuating traffic flow.

Although traditional traffic prediction methods are computationally simple and have good interpretability, they are limited by the static assumptions of time series and are unable to effectively extract the nonlinear and uncertain characteristics of traffic data. Therefore, their performance is often unsatisfactory. To address these issues, machine learning methods have been applied in the field of traffic prediction and have achieved good results.

2.2 Traffic prediction based on machine learning

Machine learning methods can approximate traffic flow patterns of varying complexity when provided with sufficient historical data. These methods automatically learn statistical patterns in traffic data. Common models include k-nearest neighbor (k-NN), support vector machine (SVM), and Bayesian network models. Xiaoyu et al. [43] proposed a two-layer k-NN algorithm that enhances computational speed and accuracy. Duan et al. [7] introduced the PSO-SVM model, which leverages particle swarm optimization (PSO) to optimize parameter selection for SVM-based traffic flow prediction. Expanding on this, Feng et al. [8] developed an adaptive multi-kernel support vector machine algorithm (AMSVM-STC) that adjusts kernel weights adaptively based on spatial-temporal correlations and real-time traffic trends. Sun et al. [35] proposed a Bayesian network-based method for traffic flow prediction, addressing challenges posed by incomplete data. Petridis et al. [28] introduced a Bayesian combination method (BCM), which generates weighted combinations of predictions using posterior probabilities and Bayesian rules. However, BCM does not account for correlations between historical and current traffic flows. To address this, Wang et al. [39] proposed an improved BCM that assumes traffic flow within a prediction interval correlates only with flows from previous intervals. This enhanced sensitivity to predictor perturbations improves prediction accuracy. Pascale et al. [27] introduced an adaptive Bayesian network model capable of adjusting its topology to accommodate the non-stationary characteristics of traffic flow.

While machine learning-based methods excel at handling complex nonlinear relationships and interactions in traffic data, they often struggle to capture the dynamic characteristics of traffic flow. Furthermore, these methods can face significant challenges when processing large-scale data and high-dimensional features.

2.3 Traffic prediction based on deep learning

With advancements in deep learning, data-driven methods leveraging deep learning techniques have gained prominence [1, 21, 22, 37]. Lv et al. [25] demonstrated the effectiveness of deep learning in handling high-dimensional data. Huang et al. [15] introduced a deep architecture with a bottom deep belief network (DBN) and a top multitask regression layer, marking the initial application of deep learning in traffic research. The multitask learning (MTL) framework utilized weight sharing within DBN to enhance prediction performance. Lingras et al. [24] showed that recurrent neural networks (RNNs) effectively capture the spatial-temporal evolution of traffic flow. Building on this, Zhao et al. [50] applied long short-term memory (LSTM) networks to traffic flow prediction, leveraging their ability to capture long-term dependencies better than traditional RNNs. Fu et al. [10] proposed gated recurrent units (GRUs) as a simpler alternative to LSTMs, offering improved efficiency in traffic flow prediction. However, these approaches overlooked the spatial characteristics of traffic data, failing to extract correlations between traffic nodes. To address this, methods based on convolutional neural networks (CNNs) were introduced. Shi et al. [31] proposed the convolutional LSTM (ConvLSTM) network, which integrates CNNs with LSTMs to capture spatial-temporal dependencies between nodes. Zhang et al. [47] developed the deep spatial-temporal residual network (ST-ResNet), which employs convolutional residual networks to model dependencies between both adjacent and distant regions, assigning different weights to various branches and regions. Guo et al. [13] proposed the spatial-temporal 3D convolutional neural network (ST-3DNet), utilizing 3D convolutions to automatically capture spatial and temporal correlations in traffic data. Despite their contributions, these methods divide the road network into regular grid structures, which inadequately represent the irregular graph topology of real road networks. Consequently, they struggle to extract complex spatial correlations between traffic nodes accurately.

To enhance the performance of traffic flow prediction models, GCNs have gained attention for their exceptional feature extraction capabilities on graph-structured data. Li et al. [23] introduced the Diffusion Convolutional Recurrent Neural Network (DCRNN), modeling traffic flow as a diffusion process on a directed graph. Cui et al. [5] proposed the Traffic Graph Convolutional Long Short-Term Memory Neural Network (TGC-LSTM), which adapts to the physical properties of traffic networks and utilizes traffic graph convolutional operators to extract comprehensive features. However, RNN-based models are computationally intensive and challenging to train. To address this, Yu et al. [46] proposed the Spatial-Temporal Graph Convolutional Network (STGCN), which uses a fully convolutional structure to jointly extract spatial and temporal features, achieving faster training and reduced parameter complexity. Nevertheless, these models treat spatial and temporal correlations separately, overlooking the heterogeneity of spatial-temporal data. Song et al. [32] introduced the Spatial-Temporal Synchronous Graph Convolutional Network (STSGCN), which connects individual spatial graphs of adjacent time steps into a unified graph to capture complex local spatial-temporal correlations through a synchronization mechanism. However, predefined graph structures may not accurately represent true dependency relationships. Wu et al. [42] addressed this limitation by introducing Graph WaveNet, which learns adaptive dependency matrices through node embeddings to uncover hidden spatial dependencies in traffic data. Zhang et al. [48] proposed an adaptive graph learning algorithm (AdapGL), optimizing graph learning module parameters through alternating training. Li et al. [20] advanced this with the Permutation Equivariant Graph Framelet Augmented Network (PEGFAN), employing a Haar-type graph framework to extract multi-scale information for integration into graph neural network architectures. Zheng et al. [52] proposed the Spatial-Temporal Joint Graph Convolutional Network (STJGCN), constructing both predefined and adaptive spatial-temporal joint graphs (STJG) to model relationships between any two time steps. They introduced expanded causal spatial-temporal joint graph convolutional layers to capture dependencies across multiple spatial and temporal ranges. However, these models show lower robustness with irregular spatial-temporal sequences. Choi et al. [4] developed Spatial-Temporal Graph Neural Control Differential Equations (STG-NCDE), combining NCDE with graph convolution to address scenarios with missing sensor data. Rakkiyappan et al. [30] explored global synchronization and periodic behavior in neural networks using adaptive control strategies. Wang et al. [41] proposed merged attention graphs to estimate matching statuses between targets and nodes. For long-term traffic condition prediction, Guo et al. [12] proposed the Attention-based Spatial-Temporal Graph Convolutional Network (ASTGCN), integrating a spatial-temporal attention mechanism to capture dynamic traffic correlations. Zheng et al. [51] introduced the Graph Multi-Attention Network (GMAN), which uses an encoder-decoder architecture with multiple spatial-temporal attention blocks to simulate traffic dynamics. However, their graph convolutional layers share weights across time steps, ignoring spatial similarity variations as spatial correlations between time steps are not always consistent.

3 Problem description

In this section, we define the traffic prediction problem. A description of the detailed notation can be found in Appendix A.

In this study, we define the traffic road network as a directed graph $\mathcal {G} = \left( {\mathcal {V},\mathcal {E},\mathcal {A}} \right) $, where $\mathcal {V}$ represents the set of nodes (such as sensors) in the road network (${\left| \mathcal {V} \right| = N}$), $\mathcal {E}$ denotes the set of edges, and $\mathcal {A}$ is the adjacency matrix (${\mathcal {A} \in \mathbb {R}^{N \times N}}$) of the road network graph. At each time interval, each node collects D node features, such as traffic flow and speed, at the same sampling frequency. The traffic signal $X_{t} \in \mathbb {R}^{N \times D}$ represents the observed values of all sensors in the traffic road network $\mathcal {G}$ at time step t. Given the historical traffic signal $X^{P}=\left( {X_{t - P + 1},{X}_{t - P + 2},\ldots ,X_{t - 1},X_{t}} \right) \in \mathbb {R}^{N \times D \times P}$, the traffic prediction problem aims to predict $Y^{Q} = \left( {Y_{t + 1},Y_{t + 2},\ldots ,Y_{t + Q - 1},Y_{t + Q}} \right) \in \mathbb {R}^{N \times D \times Q}$, where P is the given historical time steps and Q is the prediction time steps. Thus, the traffic prediction problem can be described as learning the mapping function f from the historical time steps P to the next time steps Q: $X^{P}\overset{f{( \cdot )}}{\rightarrow }{Y}^{Q}$.

4 Methodology

In this section, we will provide a detailed description of the structure and functional modules of DSTSPYN, as shown in Fig. 3. DSTSPYN consists of multiple stacked Spatial-Temporal (ST) blocks, each comprising a Spatial-Temporal Attention block and a Spatial-Temporal Pyramid block. The Spatial-Temporal Attention block is used to extract dynamic long-term features of traffic flow, while the Spatial-Temporal Pyramid block is employed to capture spatial-temporal correlations at various scales. Specifically, the Spatial-Temporal Pyramid block consists of a Spatial Pyramid module and a Temporal Pyramid module. The Spatial Pyramid module captures global spatial information and spatial similarity, whereas the Temporal Pyramid module extracts time dependencies at different granularities. After passing through the ST blocks, the data is propagated to the prediction layer via residual connections. The specific details of this model will be discussed in the subsequent subsections.

4.1 ST attention block

Spatial-temporal attention enables automatically learning the dependency relationships between different timestamps and spatial locations, thereby assisting the model in gaining a deeper understanding of the evolving patterns of traffic data. In this study, we employ an enhanced spatial-temporal self-attention mechanism to extract the dynamic spatial-temporal relationships of traffic flow.

4.1.1 Temporal attention

The adoption of a multi-head attention mechanism facilitates the consideration of multiple attention heads, enabling each attention head to focus on different temporal and spatial features. This enhances the model’s representational capacity and predictive performance. For a multi-head attention with H heads, we define the variables as follows:

$$\begin{aligned} Q^{(h)} = X_{Q}W_{Q}^{(h)},K^{(h)} = X_{K}W_{K}^{(h)},V^{(h)} = X_{V}W_{V}^{(h)}, i = 1,\ldots ,H, \end{aligned}$$

(1)

$$\begin{aligned} {Head}^{(h)} \!= & \! \text {Att}\left( {Q^{(h)},K^{(h)},V^{(h)}} \right) = \text {Softmax}\Bigg ( \frac{Q^{(h)}{K^{(h)}}^{T}}{\sqrt{d_{h}}} \nonumber \\ & + \frac{Q^{({h - 1})}{K^{({h - 1})}}^{T}}{\sqrt{d_{h}}} \Bigg )V^{(h)}, \end{aligned}$$

(2)

$$\begin{aligned} M = \text {LayerNorm}\left( {\text {MLP}\left( {\text {Concat}\left( {{Head}^{(1)},\ldots ,{Head}^{(H)}} \right) } \right) } \right) . \end{aligned}$$

(3)

In this context, $X_{Q,K,V} \in \mathbb {R}^{D^{({l - 1})} \times P \times d}$ represents the output of the ST block at layer $l-1$, which serves as the input for the l-th layer’s ST block. $W_{Q,K,V}^{(h)} \in \mathbb {R}^{d \times d_{h}}\left( {d_{h} = \frac{d}{H}} \right) $ denotes the learnable parameters. Within each ST block, the output of the temporal attention module is connected via residual connections [18] to the output of the temporal attention module in the previous block. This facilitates the transmission and retention of historical information, enabling the model to better capture long-term dependencies in the temporal sequences. Finally, by concatenating the matrices of all attention heads, a more comprehensive and enriched feature representation is constructed. After normalization, $M \in \mathbb {R}^{D^{({l - 1})} \times P \times N}$ is obtained, which serves as the input to the spatial attention module.

4.1.2 Spatial attention

The temporal attention adaptively considers the importance of different time scales in the modeling process, enabling the model to handle global temporal dependencies more effectively. In the case of traffic data, there are spatial dependencies among different locations. To address multi-scale spatial dependencies, we employ an enhanced spatial pyramid attention mechanism.

Initially, we employ one-dimensional convolution to map the temporal dimension P of the output M generated by the temporal attention module to dimension $d_{E}$ and aggregate the feature dimension $D^{(l-1)}$. Subsequently, through a spatial embedding operation, we obtain $M^{'} \in \mathbb {R}^{N \times d_{E}}$. Similar to the temporal attention module, we define the following variables:

$$\begin{aligned} {Q^{'}}^{(h)}= & M_{Q}^{'}{W^{'}}_{Q}^{(h)},{K^{'}}^{(h)} = M_{K}^{'}{W^{'}}_{K}^{(h)}, \end{aligned}$$

(4)

$$\begin{aligned} {score}^{(h)}= & \frac{{Q^{'}}^{(h)}{{K^{'}}^{(h)}}^{T}}{\sqrt{d_{h}^{'}}}. \end{aligned}$$

(5)

The variable ${score}^{(h)} \in \mathbb {R}^{d \times d_{h}}$ represents the correlation between the query vector ${Q^{'}}^{(h)}$ and the key vector ${K^{'}}^{(h)}$. Unlike traditional Transformers, the obtained attention scores are not used to weight the embedded vector $M^{'}$ and $V^{'}$. Instead, they are used to adjust the weights of each layer within the spatial pyramid, as shown in Fig. 3b. The detailed process is elaborated in Section 4.2.

4.2 Spatial pyramid

4.2.1 Diffusion graph convolution

A transportation network is essentially a graph structure, where the features of each node can be viewed as signals on the graph. GCNs effectively handle graph-structured data by aggregating information from neighboring nodes to derive features for each node, thereby enhancing the model’s understanding of the relationships between different locations in the road network. Therefore, we employ the GCN as the initial layer of the spatial pyramid to capture localized spatial-temporal dependencies within the transportation network. Existing research primarily employs predefined graph structures for graph convolution operations, aggregating information from neighboring nodes to obtain node features. However, real-world transportation networks encompass concealed and uncertain relationships between different roads. Therefore, in order to more accurately model real transportation networks, it becomes necessary to adaptive learn these hidden graph structures while considering their static characteristics.

First, we represent the proximity of different nodes simply based on whether there is a connection between them:

$$\begin{aligned} \mathcal {A}_{i,j} = {\left\{ \begin{array}{ll} 1,& {\text {if}~\mathcal {V}_{i}~\text {connects to}~\mathcal {V}_{j}}\\ {0,}& {\text {otherwise}} \end{array}\right. }. \end{aligned}$$

(6)

Next, inspired by [42], we generate an adaptive adjacency matrix by randomly initializing two learnable node embedding matrices $e_{1},e_{2} \in \mathbb {R}^{N \times c}$, where N and c are two constants that represent the size:

$$\begin{aligned} {\tilde{\mathcal {A}}}_{\text {adp}} = \text {SoftMax}\left( {\text {ReLU}\left( {e_{1}e_{2}^{T}} \right) } \right) . \end{aligned}$$

(7)

Finally, we describe the state transition between nodes as a spatial diffusion process, and simulate it by performing random walks on the graph. This Markov random process converges to a smooth distribution after K time steps. By incorporating both predefined spatial dependencies and self-learned hidden graph dependencies, the adaptive diffusion graph convolutional layer is expressed as:

$$\begin{aligned} Z = {\sum \limits _{k = 0}^{K}{C_{\text {f}}^{k}XW_{k1}+C_{\text {b}}^{k}XW_{k2} + }}{\tilde{\mathcal {A}}}_{\text {adp}}XW_{k3}. \end{aligned}$$

(8)

In this equation, $C^{k}$ represents the power series of the transition matrix. In the case of a directed graph, the diffusion process encompasses two directions: forward and backward. The forward transition matrix is defined as $C_{\text {f}} = \frac{\mathcal {A}}{rowsum(\mathcal {A})}$, and the backward transition matrix is defined as $C_{\text {b}} = \frac{\mathcal {A}^{T}}{rowsum\left( \mathcal {A}^{T} \right) }$. $X \in \mathbb {R}^{N \times D}$ represents the input signal, $Z \in \mathbb {R}^{N \times F}$ represents the output, and $W \in \mathbb {R}^{D \times F}$ represents the model parameter matrix.

4.2.2 Spatial-temporal global similarity module

The Adaptive Diffusion Graph Convolution (DGCN) associates traffic flow with diffusion processes, capturing the stochastic nature of traffic dynamics. However, since DGCN relies on a local neighborhood propagation approach, it is limited in terms of information propagation range, particularly for distant nodes or the global structural characteristics of the entire graph. As a result, DGCN exhibits relatively weak modeling capabilities in these scenarios. Furthermore, DGCN shares weights across all time steps, neglecting the spatial variations in correlations between different time steps in traffic data.

Brownian Distance Covariance (BDC) [36], grounded in the theory of characteristic functions, measures the Euclidean distance between the joint characteristic function of two random vectors and the product of their marginal characteristic functions. This approach quantifies the dependence between the vectors by analyzing the discrepancy in their characteristic functions. For a set of m independently and identically distributed observed values $\left\{ \left( x_{1},y_{1} \right) ,\left( x_{2},y_{2} \right) ,\ldots \left( x_{m},y_{m} \right) \right\} $, we define a matrix $\hat{A} = {(\hat{a}}_{kl}) \in \mathbb {R}^{m \times m}$, where ${\hat{a}}_{kl}=\parallel x_{k} - x_{l} \parallel $ represents the Euclidean distance matrix computed based on the observations X. Similarly, we compute the Euclidean distance matrix $\hat{B} = {(\hat{b}}_{kl}) \in \mathbb {R}^{m \times m}$, where ${\hat{b}}_{kl} = \parallel y_{k} - y_{l} \parallel $. We designate $A=(a_{kl})$ as the Spatial-Temporal Global Correlation (STGC) matrix:

$$\begin{aligned} a_{kl}={\hat{a}}_{kl}-\frac{1}{m}{\sum \limits _{k = 1}^{m}{{\hat{a}}_{kl} -}}\frac{1}{m}{\sum \limits _{l = 1}^{m}{{\hat{a}}_{kl}- \frac{1}{m^{2}{\sum _{k = 1}^{m}{\sum _{l = 1}^{m}{\hat{a}}_{kl}}}}}}, \end{aligned}$$

(9)

where the last three terms respectively represent the average values of the l-th column, k-th row, and all values in $\hat{A}$. A similar calculation yields the matrix B.Then, the BDC metric has a closed-form expression proved in [44]:

$$\begin{aligned} \rho \left( {X,Y} \right) = tr\left( {A^{T}B} \right) . \end{aligned}$$

(10)

Since the STGC matrix is symmetric, $\rho ({X,Y})$ can be expressed as the inner product of two STGC vectors, a and b, namely:

$$\begin{aligned} \rho \left( {X,Y} \right) = < a,b > = a^{T}b. \end{aligned}$$

(11)

Here, a and b are obtained by extracting the upper triangular parts of matrices A and B, followed by vectorization.

Based on the above derivation, it is evident that the BDC metric facilitates the explicit representation of feature matrices. The BDC metric can model all possible relationships, providing a robust measurement for assessing feature dependence. Therefore, we introduce the Spatial-Temporal Global Similarity (STGS) module, which takes the traffic graph at each moment as input and outputs the STGC matrix as a visual representation. The similarity between two time slices is calculated as the inner product of the corresponding STGC matrices. The STGC matrix can effectively capture the marginal features of the graph by learning the difference between the joint distribution of embedded features and the product of marginals. Furthermore, the STGC matrix encapsulates non-linear relationships between channels through the use of Euclidean distance, enabling the extraction of spatial similarities between road networks at two different time steps. We slice the data processed by the diffusion graph convolution along the temporal dimension, i.e.,

$$\begin{aligned} Z \in \mathbb {R}^{F \times N \times P} = \left( {Z^{(1)},{Z}^{(2)},Z^{(3)},\ldots {,Z^{({P - 1})},Z}^{(P)}} \right) . \end{aligned}$$

(12)

Taking $Z^{(i)}$ as an example, considering the spatial similarity between adjacent time steps, the details of the STGS module are illustrated in Fig. 4. For the traffic feature $Z^{(i)} \in \mathbb {R}^{F \times N}$ at each time step, each column $Z_{k}^{(i)} \in \mathbb {R}^{F}$ or each row (transposed) $Z_{l}^{(i)} \in \mathbb {R}^{N}$ can be regarded as an observation of the random vector $Z^{(i)}$. Following the approach in [44], we take $Z_{k}^{(i)}$ as a random observation, and sequentially compute the squared Euclidean distance matrix $\tilde{A} =({\tilde{a}}_{kl})$, where $\tilde{a}_{kl}$ is the squared Euclidean distance between the k-th and l-th columns of $Z^{(i)}$. The Euclidean distance matrix $\hat{A} = ({\hat{a}}_{kl}) = (\sqrt{\tilde{a}_{kl}})$ is then obtained, and subtracting its row mean, column mean, and overall mean yields the STGC matrix A:

$$\begin{aligned} {\tilde{A}}^{(i)}= & 2\left( \left. J_{N}\left( Z^{(i)} \right. ^{T}Z^{(i)}\circ I \right) \right) _{sym} - 2{Z^{(i)}}^{T}Z^{(i)}, \end{aligned}$$

(13)

$$\begin{aligned} \hat{A}^{(i)}= & \left( \sqrt{\tilde{a}_{kl}^{(i)}} \right) , \end{aligned}$$

(14)

$$\begin{aligned} A^{(i)}= & \hat{A}^{(i)} - \frac{2}{d}\left( {J_{N}\hat{A}^{(i)}} \right) _{sym} + \frac{1}{d^{2}}J_{N}\hat{A}^{(i)}J_{N}. \end{aligned}$$

(15)

Here, $J_{N} \in R^{N \times N}$ is a matrix with each element being 1, I is the identity matrix, and $\circ $ represents the Hadamard product. We denote $U_{sym} = \frac{1}{2}({U + U^{T}})$.

Equation (10) allows us to compute the BDC measure between $A^{(i)}$ and $A^{(j)}$, which quantifies the similarity of traffic flow features between time slice i and time slice j:

$$\begin{aligned} \rho \left( {Z^{(i)},Z^{(j)}} \right) = < a^{(i)},a^{(j)} > = {a^{(i)}}^{T}a^{(j)}, \end{aligned}$$

(16)

where $a^{(i)}$ and $a^{(j)}$ represent the vector representations of $A^{(i)}$ and $A^{(j)}$, respectively.

The significance of features for traffic prediction increases with the similarity between the current time slice and its adjacent time slices. We assign a weight to each time slice based on the mean similarity to its neighboring time slice features, and finally concatenate them together.

$$\begin{aligned} Z_{S} = \text {ReLU}\left( {\text {Concat}\left( {A^{(1)}S^{1},\ldots ,{A}^{(P)}S^{P}} \right) } \right) . \end{aligned}$$

(17)

Here, $S^{(i)}$ represents the mean of the BDC metric between $Z^{(i)}$ and the adjacent time slices. This step is regarded as the second layer of the spatial pyramid, capturing global information for nodes.

4.2.3 Spatial convolution pyramid

After the STGS module, we perform average pooling and pass through a $1\times 1$ convolutional layer to extract the road network-level features. This step corresponds to the third layer of the spatial pyramid.

$$\begin{aligned} {Z_{L} = \text {Conv2D}}_{(1,1)}\left( {\text {GlobalMeanPool}\left( Z_{S} \right) } \right) . \end{aligned}$$

(18)

Next, we integrate the low-resolution, high semantic features with the high-resolution, low semantic features by employing top-down pathways and lateral connections, thus forming a spatial pyramid as shown in Fig. 3b. Moreover, the spatial attention scores calculated in Section 4.1 are utilized to adjust the weights of each layer within the spatial pyramid. Assuming that the output of each layer in the pyramid is denoted as $Z_{j}$, then:

$$\begin{aligned} Z_{O} = {\sum \limits _{j = 1}^{J}{Z_{j} \cdot \text {Softmax}\left( {score}^{j} \right) }}, \end{aligned}$$

(19)

where J represents the number of layers in the spatial pyramid.

4.3 Temporal pyramid

We propose a convolutional module called Pyramid Gated Tanh Unit (PY-GTU) to capture the dynamic temporal information in traffic flow data. The specific structure of this module is illustrated in Fig. 3c and primarily consists of several Gated Tanh Units (GTUs) [6] with varying receptive fields.

We define a pyramid level for each stage, where the input of the i-th pyramid level is denoted as $Z_{\text {in}}^{(i)} \in \mathbb {R}^{N \times D \times P^{(i)}}$. By setting the convolutional kernel size to $1 \times S$, we have:

$$\begin{aligned} Z_{\text {in}}^{({i + 1})} = \Gamma *_{\tau }Z_{\text {in}}^{(i)} = \text {Tanh}(E) \odot \text {Sigmoid}(F) \in \mathbb {R}^{N \times 2D^{(i)} \times {(\frac{P^{i}}{k})}}, \end{aligned}$$

(20)

where $*_{\tau }$ represents the gated convolutional operator, E and F respectively refer to the first and second halves of $Z_{\text {in}}^{(i)}$ along the channel dimension. The parameter k denotes the reduction factor of the temporal dimension in each pyramid level, $k=\frac{p^{(i)}}{(S^{(i)} - 1)}$. Through the bottom-up pathway, we extract temporal dependencies at different scales.

In addition to the topmost layer of the pyramid, we applied a $1 \times 1$ convolutional kernel GTU at each level of the pyramid. Through the top-down pathway, the lower-level pyramids are able to integrate spatially coarser but semantically stronger feature maps from higher pyramid levels and generate features with higher resolutions through upsampling operations. Furthermore, through lateral connections, these feature maps are fused with the bottom-up pathway and serve as the output of that pyramid level. Finally, we employ trainable parameters to adjust the weights of each temporal pyramid layer:

$$\begin{aligned} \text {Output} = {\sum \limits _{i = 1}^{m}{Z_{\text {out}}^{(i)}W^{(i)}}}. \end{aligned}$$

(21)

where m represents the number of layers in the temporal pyramid, and $Z_{\text {out}}^{(i)}$ denotes the output of each layer in the temporal pyramid, where W is a trainable parameter. The temporal convolutional pyramid effectively leverages the pyramid structure and contextual information to enhance the model’s representation and localization capabilities of features. By facilitating information propagation and fusion across different levels of the pyramid, we can obtain a more comprehensive and accurate feature representation. Moreover, this approach contributes to the improvement of the model’s understanding and modeling capabilities of temporal dynamics.

4.4 Spatial-temporal pyramid summary

The spatial-temporal pyramid captures multi-scale spatial and temporal features of traffic flow data using a hierarchical approach, improving the model’s ability to comprehend complex traffic patterns. As illustrated in Fig. 5, the spatial pyramid comprises three layers, each progressively extracting local and global spatial features from the traffic network. The first layer focuses on local neighborhood characteristics within the road network by employing adaptive diffusion graph convolution to capture spatial-temporal dependencies between nodes. This layer effectively extracts features from the central parts of roads, enhancing the model’s understanding of short-range spatial-temporal relationships between nodes.

The second layer incorporates global spatial features using the STGS module, which captures spatial correlations between adjacent time slices and quantifies relationships among distant nodes across the entire network. This layer is particularly effective at extracting features from the edges of the road network. The third layer aggregates and integrates features from the first two layers by employing global average pooling and convolution operations. This process extracts more abstract spatial features, resulting in spatial representations enriched with global contextual information.

The temporal pyramid focuses on extracting dynamic temporal features of traffic flow across different time scales using the Pyramid-Gated Tanh Unit (PY-GTU) structure. Each layer captures temporal dependencies of varying lengths by adjusting convolution kernel sizes. Lower layers primarily extract fine-grained temporal features over short time ranges, emphasizing local variations in traffic flow. As the hierarchy progresses, the model captures longer temporal dependencies, enabling it to identify global temporal patterns across time slices. At the top layer, the temporal pyramid merges information from all time scales, generating a dynamic feature representation that integrates diverse combinations of temporal dependencies.

By modeling multilayer spatial-temporal dependencies in traffic flow data, the spatial-temporal pyramid enables comprehensive feature extraction, spanning from local to global scales and from short-term to long-term patterns. The spatial pyramid effectively captures both local neighborhood features and global similarity characteristics within the traffic network, while the temporal pyramid focuses on uncovering the dynamic temporal patterns of traffic flow. The integration of these pyramids enhances the model’s prediction accuracy, as well as its robustness and adaptability to complex and evolving traffic conditions.

5 Experiment

5.1 Datasets

To assess the model’s performance, we utilized four datasets, namely PEMS03, PEMS04, PEMS07, and PEMS08, which were collected from real California highways, as published by Song et al. [32]. The traffic data was aggregated every 5 minutes, and specific details are outlined in Table 1.

Table 1 Dataset description

Full size table

5.2 Baseline methods

We compared DSTSPYN with eight widely used baseline methods in traffic flow prediction literature, including:

VAR [33]: A classic time series forecasting method based on autoregressive models, which captures dynamic changes and mutual influences among multiple variables by establishing linear regressions at each time step.
GRU [3]: A variant of RNNs that addresses the issues of vanishing and exploding gradients in traditional RNNs by utilizing update and reset gates to control the update of hidden states.
STGCN [46]: This model integrates graph convolution with one-dimensional convolution. The graph convolutional network captures spatial dependencies within the traffic network, while the one-dimensional convolution models dynamic changes along the temporal dimension. This combination enables the model to effectively learn spatial-temporal correlations in traffic data.
TGCN [49]:A combination of GCNs and gated recurrent units (GRU), where GCN captures spatial dependencies in the traffic network and GRU captures dynamic changes in the temporal dimension.
ASTGCN [12]: This model employs a spatial-temporal attention mechanism to dynamically capture spatial-temporal correlations in the traffic network, utilizing a graph convolutional network to capture spatial dependencies.
Graph WaveNet [42]: Combines adaptive GCNs with dilated causal convolution. The adaptive graph convolution learns the adjacency matrix from data, while dilated convolution is used to extract long-term dependencies in the temporal dimension.
STSGCN [32]:This model captures local spatial-temporal correlations through a spatial-temporal synchronous modeling mechanism, ensuring coordinated modeling of spatial-temporal dependencies via synchronous spatial and temporal convolutions.
DSTAGNN [18]: Introduces a dynamic spatial-temporal awareness mechanism that explores dynamic associations between nodes in the traffic network, employing a spatial-temporal attention mechanism to capture evolving spatial-temporal dependencies.
STG-NCDE [4]: Designed two neural controlled differential equations (NCDEs) to learn temporal and spatial dependencies in traffic conditions, combining them into a unified framework.
TESTAM [19]: Proposes a mixture of experts (MoE) model that employs three experts for time modeling, with each expert optimized for different traffic patterns, and reframes the gating mechanism as a classification task with pseudo-labeling.

5.3 Experiment settings

The data was divided into training, validation, and test sets in a ratio of 6:2:2. Utilizing the flow data from 12 consecutive historical time steps to predict the flow data for the subsequent 12 consecutive future time steps, with a temporal interval of 5 minutes between each pair of successive time steps. All experiments were conducted for training and testing on a Linux server equipped with a CPU: Intel 5318Y, 2.1GHz, and GPU: Nvidia Tesla A10 24GB. We employed the Adam optimizer as the optimization algorithm for the model, with an initial learning rate set to 0.0001, a BatchSize of 32, 3 attention heads, and convolutional kernels for the temporal pyramid [S1, S2, S3] = [7, 4, 3]. The model consists of a stack of 4 spatial-temporal blocks. Evaluation metrics include Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). The loss function employed is Huber [16]. For baseline models, we adhered to the default settings as presented in their original papers.

5.4 Experiment results

Table 2, Figs. 6 and 7 display the performance of both baseline models and our proposed model. Deep learning methods exhibit superior predictive accuracy in nonlinear traffic data forecasting tasks, showcasing their ability to capture complex spatial-temporal dependencies. In contrast, the VAR model, constrained by its linear assumptions and suitability for small-scale data, struggles with the complexities of high-dimensional spatial-temporal data, resulting in poorer predictive performance. While GRU, a variant of recurrent neural networks, demonstrates some effectiveness in time series forecasting, its performance in traffic data prediction falls short compared to most other spatial-temporal models, highlighting the critical role of spatial correlations in traffic prediction. Models like STGCN and T-GCN effectively capture spatial relationships between nodes but lack the flexibility to adapt to dynamic changes, particularly during sudden traffic events. ASTGCN, which employs a spatial-temporal attention mechanism to capture dynamic correlations, also faces challenges in handling rapidly changing traffic patterns. On the PSME03 and PSME07 datasets, our model slightly underperforms DSTAGNN in certain metrics. This may be attributed to the larger number of nodes in these datasets, which increases the complexity of the STGS module’s spatial similarity calculations. Additionally, the intricate spatial structures and interrelationships within these datasets present significant challenges for DSTSPYN in modeling spatial similarity. Despite these limitations, our model consistently achieves the best performance across other evaluation metrics, demonstrating its superiority. The results validate DSTSPYN’s robustness and reliability, particularly in long-term forecasting tasks, affirming its effectiveness in capturing both spatial and temporal complexities of traffic data.

Table 2 Performance comparison on the PEMS dataset

Full size table

To further evaluate the performance of DSTSPYN in short-term forecasting, we predicted traffic flow data for the next 5 minutes, 15 minutes, 30 minutes, and 45 minutes on the PEMS04 and PEMS08 datasets, as shown in Table 3, Figs. 8 and 9 illustrate the performance comparison of DSTSPYN and baseline methods under different prediction horizons. For 5-minute short-term predictions, the performance of DSTSPYN is not notably superior. However, as the prediction intervals extend, the accuracy of other baseline methods decreases significantly, whereas our proposed model consistently maintains high prediction accuracy. DSTSPYN achieves the best performance for 15-minute, 30-minute, and 45-minute forecasts. This highlights the effectiveness of the pyramid architecture, which accounts for dynamic spatial-temporal dependencies and adapts well to traffic flow predictions across various time ranges.

We chose node 115, located on the central road, and node 24, situated on the peripheral road, for analysis. In Fig. 10, we plotted the predictions and ground truth of DSTSPYN and DSTAGNN for lead times of 5 and 60 minutes using the test data snapshot. It is evident that DSTSPYN exhibits more accurate predictions of dynamic changes in peak traffic compared to the baseline methods. Specifically, for the peripheral node, DSTAGNN generates several sharp peaks in blue, deviating significantly from the true values. In contrast, the curve of DSTSPYN consistently remains in the middle of the actual values. This indicates that the STGS module can effectively extract features of edge nodes. These findings validate the superiority of DSTSPYN, highlighting its robustness and reliability in long-term forecasting tasks.

5.5 Ablation experiment

To evaluate the effectiveness of various components within DSTSPYN, we introduced the following variants:

w/o sta: Complete removal of the spatial-temporal attention mechanism. We calculate the weights for each layer in the spatial pyramid using learnable weight parameters.
only-gtu: The time pyramid module solely employs a single GTU to calculate temporal dependencies.
w/o tpy: The time pyramid module utilizes a concatenated structure of three GTUs at different scales without using the pyramid structure.
w/o stgs: Exclusion of the spatial-temporal global similarity module.
w/o gf: Removal of the spatial pyramid layer responsible for extracting graph-level features.
w/o spy: The spatial pyramid module no longer utilizes the pyramid structure.

Table 3 Comparison of short-term prediction performance

Full size table

We conducted ablation experiments on the PEMS04 dataset for the aforementioned variants. The results for MAE, RMSE, and MAPE are presented in Table 4 and Fig. 11. Experimental results demonstrate the critical role of the multi-head spatial-temporal attention mechanism in enhancing model performance. Completely removing the spatial-temporal attention mechanism (w/o sta) significantly degrades the model’s performance on MAE, RMSE, and MAPE metrics. This highlights the importance of the mechanism in modeling long-term traffic flow dependencies and spatial dependencies at various scales. By dynamically adjusting the weights of spatial-temporal features, the multi-head attention mechanism extracts richer spatial-temporal characteristics, substantially improving predictive accuracy. Among all variants, the time pyramid module ranks second in effectiveness, validating its role in enhancing temporal dynamics modeling through temporal feature propagation and fusion. The STGS module follows, excelling at extracting edge node features and calculating similarities between time slices, which further improves overall performance. Performance declines were also observed when removing the spatial pyramid layer responsible for extracting graph-level features (w/o gf) or omitting the pyramid structure within the spatial pyramid module (w/o spy). These findings emphasize the indispensable roles of graph-level feature extraction and the pyramid structure in understanding complex spatial dependencies. Overall, DSTSPYN outperforms all variants across every evaluation metric. This validates the effectiveness of each model component and confirms DSTSPYN’s superiority in traffic flow prediction.

Table 4 Results of Ablation Experiments

Full size table

Table 5 Results of Ablation Experiments

Full size table

Table 6 Experimental results on the NYCTaxi dataset

Full size table

Table 7 Computational cost

Full size table

To further analyze the impact of the time pyramid on long-term prediction accuracy, we predicted future traffic data for 120 minutes on the PEMS07 and PEMS08 dataset, and the results are shown in Table 5 and Fig. 12.

The prediction accuracy of the model without the time pyramid structure decreases significantly faster than that of the original model as the prediction time step increases. This clearly highlights the critical role of the time pyramid structure in enhancing long-term prediction performance.

5.6 Generalization capability

To verify the generalization ability of DSTSPYN, we conducted experiments on the NYCTaxi [9] dataset, a grid-based citywide trip dataset recording taxi pick-up and drop-off demands and details in New York City from 01/01/2014 to 12/31/2014. The time step is set to 12, meaning traffic data from the past 12 time steps is used to predict the next time step. We selected several models that performed well in parallel experiments for comparative analysis on this dataset. The experimental results, shown in Table 6, indicate that our model achieved optimal results across all metrics. This fully validates the generalization ability of DSTSPYN.

5.7 Computational cost analysis

To assess the computational efficiency and memory consumption of the DSTSPYN model in practical applications, we compared its performance with models such as STSGCN, STG-NCDE, TESTAM, and DSTAGNN on the PEMS07 and PEMS08 datasets. The evaluation metrics include the number of parameters, training time (seconds per epoch), inference time (time required for test set inference), and memory consumption. Based on the results presented in Table 7, the following observations can be drawn:

Number of parameters

DSTSPYN comprises 4.86M parameters on the PEMS07 dataset and 2.05M parameters on the PEMS08 dataset. Compared to DSTAGNN and STSGCN, DSTSPYN maintains a relatively moderate parameter size, balancing model accuracy with computational efficiency. This suggests that DSTSPYN achieves high predictive performance without imposing an excessive computational burden. Conversely, TESTAM has the fewest parameters, making it a suitable choice for deployment on resource-constrained devices.

Training time

On the PEMS07 dataset, DSTSPYN requires 1633.8 seconds per epoch for training, which is higher than STSGCN and DSTAGNN but lower than STG-NCDE. This increased training time is attributed to the time pyramid structure in DSTSPYN, which enables the model to capture more complex temporal dependencies. On the PEMS08 dataset, DSTSPYN’s training time is significantly reduced to 66.74 seconds per epoch, demonstrating its good computational efficiency on smaller datasets compared to more computationally intensive models like STG-NCDE.

Inference time

On the PEMS07 dataset, DSTSPYN’s inference time is 216.47 seconds, which is longer than STSGCN and DSTAGNN but slightly shorter than TESTAM. The additional computational complexity introduced by the time pyramid structure accounts for the longer inference time, which is balanced by its improved accuracy in long-term predictions. On the PEMS08 dataset, DSTSPYN achieves an inference time of 6.12 seconds, reflecting relatively high efficiency when applied to smaller datasets.

Memory consumption

On the PEMS07 dataset, DSTSPYN’s memory consumption is 10.87GB, slightly exceeding that of other models. This is primarily due to the time pyramid structure, which requires additional storage for processing multi-scale temporal features. On the PEMS08 dataset, DSTSPYN’s memory consumption is significantly lower at 2.66GB, reflecting more moderate memory usage on smaller datasets. In practical applications, while DSTSPYN demands slightly more memory than other models, this trade-off is often justified in scenarios where long-term prediction accuracy is a priority.

In summary, the DSTSPYN model demonstrates notable advantages in computational cost, particularly in inference efficiency on small-sized and medium-sized datasets. Although its memory consumption is slightly higher than that of other models, it compensates by delivering superior long-term prediction accuracy. Consequently, DSTSPYN proves to be highly practical for applications requiring high-accuracy, long-term time series predictions.

5.8 Visualization of spatial-temporal attention

To enhance the interpretability of DSTSPYN and illustrate the details of its attention modules, we have visualized the spatial-temporal dependencies captured by the model. The intensity of attention is represented by the color depth of the nodes. Figure 13a showcases the local and edge attentions of the first and second attention heads. It’s observed that the first attention head primarily focuses on the spatial-temporal dependencies of central streets, demonstrating the effectiveness of the first layer of the spatial pyramid in extracting features from the core areas of the road network. In contrast, the second attention head emphasizes the spatial-temporal dependencies of edge streets, showcasing the STGS module’s strength in capturing features from the network’s periphery. Overall, our model successfully distinguishes and captures distinct spatial-temporal patterns across different street regions. Figure 13b illustrates the global spatial-temporal dependencies captured by the third attention head. We can clearly observe that this attention head possesses the capability to recognize complex traffic situations, such as road intersections.

To further illustrate the effectiveness of the spatial-temporal attention mechanism, we conducted visual experiments on the PEMS03 dataset, showcasing the distribution of attention among the top 25 nodes. The color intensity of the points corresponds proportionally to the attention scores between nodes. We selected three nodes with gradually increasing attention relative to node 10 and compared their daily traffic flow curves with that of node 10, as depicted in Fig. 14. Through observation, we discovered that the attention between node 10 and node 4 was the most prominent, indicating a highly consistent traffic dynamic during peak hours. The attention between node 10 and node 16 was slightly weaker, with nearly identical daily traffic flow except during peak hours. However, the attention from node 10 to node 21 was notably weaker, suggesting a less apparent traffic correlation between them. These findings demonstrate that our enhanced attention mechanism contributes to the model’s improved capability to capture spatial-temporal dependencies between nodes in traffic flow prediction.

In summary, our model showcases highly promising performance in traffic flow prediction and exhibits the capability to extract intricate information from road networks. This provides us with deeper insights into understanding and predicting traffic behaviors, thereby offering valuable information for decision-makers in the domains of traffic management, planning, and related domains.

6 Conclusion

In this study, we propose the Dynamic Spatial-Temporal Similarity Pyramid Network (DSTSPYN) for traffic flow prediction. This method combines the Spatial-Temporal Global Similarity (STGS) module with a spatial-temporal pyramid architecture to address the limitations of traditional GCNs in capturing global structures and spatial-temporal similarities. Moreover, DSTSPYN incorporates an improved attention mechanism to dynamically adjust the weights of spatial-temporal features and employs pyramid-gated convolutional units to effectively capture dynamic temporal dependencies across different scales. Experimental results demonstrate that DSTSPYN outperforms several existing methods on four public traffic datasets, with particularly notable advantages in long-term traffic flow prediction.

However, while the DSTSPYN model exhibits excellent performance across multiple traffic datasets, its advantages are less pronounced when dealing with large-scale traffic data, especially within complex network structures with high-density nodes. This may be due to the increased diversity of spatial-temporal features as road network complexity rises, leading to bottlenecks in the model’s ability to extract global spatial features and model dynamic dependencies between nodes. To address this issue, future work will focus on several improvements. Firstly, we will optimize the model’s ability to extract spatial-temporal similarities at different time steps for large-scale traffic data, potentially through more efficient global feature extraction mechanisms to overcome the limitations of current models in processing large datasets. Additionally, we will incorporate external factors (such as weather, special events, and holidays) to more comprehensively capture the various influences on traffic flow. By integrating multi-source data, we aim to further enhance the model’s predictive accuracy and adaptability across different traffic scenarios. Simultaneously, we will explore new deep learning architectures and optimization strategies, such as applying multimodal learning techniques to combine unstructured data like images and text with existing spatial-temporal data. Through these approaches, we hope to improve the model’s performance and generalization ability under various complex traffic data conditions.

Data Availability

The datasets used in this study are publicly available and can be accessed from Caltrans Performance Measurement System (PeMS) data source at https://dot.ca.gov/programs/traffic-operations/mpr/pems-source and NYCtaxi data source at https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

References

Aslam MS, Radhika T, Chandrasekar A, et al (2024) Improved event-triggered-based output tracking for a class of delayed networked t–s fuzzy systems. Int J Fuzzy Syst 1–14
Chen H, Lin M, Liu J et al (2024) Scalable temporal dimension preserved tensor completion for missing traffic data imputation with orthogonal initialization. IEEE/CAA J Autom Sinica 11(10):2188–2190
Article MATH Google Scholar
Cho K, Van Merriënboer B, Bahdanau D, et al (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259
Choi J, Choi H, Hwang J, et al (2022) Graph neural controlled differential equations for traffic forecasting. In: Proceedings of the AAAI conference on artificial intelligence, pp 6367–6374
Cui Z, Henrickson K, Ke R et al (2020) Traffic Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. IEEE Trans Intell Transp Syst 21(11):4883–4894
Article MATH Google Scholar
Dauphin YN, Fan A, Auli M, et al (2017) Language Modeling with Gated Convolutional Networks. In: Proceedings of the 34th international conference on machine learning. PMLR, pp 933–941
Duan M (2018) Short-Time Prediction of Traffic Flow Based on PSO Optimized SVM. In: 2018 International conference on intelligent transportation, big data & smart city (ICITBS), pp 41–45
Feng X, Ling X, Zheng H et al (2019) Adaptive Multi-Kernel SVM With Spatial-Temporal Correlation for Short-Term Traffic Flow Prediction. IEEE Trans Intell Transp Syst 20(6):2001–2013
Article MATH Google Scholar
Ferreira N, Poco J, Vo HT et al (2013) Visual exploration of big spatio-temporal urban data: A study of new york city taxi trips. IEEE Trans Visual Comput Graph 19(12):2149–2158
Article Google Scholar
Fu R, Zhang Z, Li L (2016) Using lstm and gru neural network methods for traffic flow prediction. In: 2016 31st Youth academic annual conference of Chinese association of automation (YAC), IEEE, pp 324–328
Guo J, Huang W, Williams BM (2014) Adaptive Kalman filter approach for stochastic short-term traffic flow rate prediction and uncertainty quantification. Transp Res Part C Emerg 43:50–64
Article MATH Google Scholar
Guo S, Lin Y, Feng N et al (2019) Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. Proc AAAI Conf Artif Intell 33(01):922–929
MATH Google Scholar
Guo S, Lin Y, Li S et al (2019) Deep Spatial-Temporal 3D Convolutional Neural Networks for Traffic Data Forecasting. IEEE Trans Intell Transp Syst 20(10):3913–3926
Article MATH Google Scholar
Hamed MM, Al-Masaeid HR, Said ZMB (1995) Short-Term Prediction of Traffic Volume in Urban Arterials. J Transp Eng 121(3):249–254
Article Google Scholar
Huang W, Song G, Hong H et al (2014) Deep Architecture for Traffic Flow Prediction: Deep Belief Networks With Multitask Learning. IEEE Trans Intell Transp Syst 15(5):2191–2201
Article MATH Google Scholar
Huber PJ (1992) Robust estimation of a location parameter. In: Breakthroughs in statistics: Methodology and distribution. Springer, pp 492–518
Kamarianakis Y, Prastacos P (2005) Space–time modeling of traffic flow. Comput Geosci 31(2):119–133
Article MATH Google Scholar
Lan S, Ma Y, Huang W, et al (2022) Dstagnn: Dynamic spatial-temporal aware graph neural network for traffic flow forecasting. In: International conference on machine learning, PMLR, pp 11906–11917
Lee H, Ko S (2024) Testam: a time-enhanced spatio-temporal attention model with mixture of experts. arXiv:2403.02600
Li J, Zheng R, Feng H, et al (2024a) Permutation equivariant graph framelets for heterophilous graph learning. IEEE Trans Neural Netw Learn Syst
Li M, Zhang L, Cui L et al (2023) Blog: Bootstrapped graph representation learning with local and global regularization for recommendation. Pattern Recognit 144:109874
Article MATH Google Scholar
Li M, Micheli A, Wang YG et al (2024) Guest editorial: deep neural networks for graphs: theory, models, algorithms, and applications. IEEE Trans Neural Netw Learn Syst 35(4):4367–4372
Article MATH Google Scholar
Li Y, Yu R, Shahabi C, et al (2017) Diffusion convolutional recurrent neural network: data-driven traffic forecasting. arXiv:1707.01926
Lingras P, Sharma S (2002) Prediction of Recreational Travel Using Genetically Designed Regression and Time-Delay Neural Network Models. Transp Res Record 1:16–24
Article MATH Google Scholar
Lv Y, Duan Y, Kang W, et al (2014) Traffic Flow Prediction With Big Data: A Deep Learning Approach. IEEE Trans Intell Transp Syst 1–9
Okutani I, Stephanedes Y (1984) Dynamic prediction of traffic volume through Kalman filtering theory. Transp Res Part B: Methodol 18:1–11
Article MATH Google Scholar
Pascale A, Nicoli M (2011) Adaptive Bayesian network for traffic flow prediction. 2011 IEEE Statistical Signal Processing Workshop (SSP). IEEE, Nice, France, pp 177–180
Chapter Google Scholar
Petridis V, Kehagias A, Petrou L et al (2001) A Bayesian Multiple Models Combination Method for Time Series Prediction. J Intell Robot Syst 31:69–89
Article MATH Google Scholar
Qin D (2011) Rise of var modelling approach. J Econ Surv 25(1):156–174
Article MATH Google Scholar
Rakkiyappan R, Kumari EU, Chandrasekar A et al (2016) Synchronization and periodicity of coupled inertial memristive neural networks with supremums. Neurocomputing 214:739–749
Article MATH Google Scholar
Shi X, Chen Z, Wang H, et al (2015) Convolutional lstm network: A machine learning approach for precipitation nowcasting. Adv Neural Inf Process Syst 28
Song C, Lin Y, Guo S et al (2020) Spatial-Temporal Synchronous Graph Convolutional Networks: A New Framework for Spatial-Temporal Network Data Forecasting. Proc AAAI Conf Artif Intell 34(01):914–921
MATH Google Scholar
Stock JH, Watson MW (2001) Vector Autoregressions. J Econ Perspect 15(4):101–115
Article MATH Google Scholar
Sun H, Zhang C, Ran B (2004) Interval prediction for traffic time series using local linear predictor. In: Proceedings. The 7th International IEEE conference on intelligent transportation systems (IEEE Cat. No.04TH8749), pp 410–415
Sun S, Zhang C, Yu G (2006) A Bayesian Network Approach to Traffic Flow Forecasting. IEEE Trans Intell Transp Syst 7:124–132
Article MATH Google Scholar
Székely GJ, Rizzo ML (2009) Brownian distance covariance
Tamil Thendral M, Ganesh Babu TR, Chandrasekar A, et al (2022) Synchronization of markovian jump neural networks for sampled data control systems with additive delay components: Analysis of image encryption technique. Mathematical methods in the applied sciences
Van Der Voort M, Dougherty M, Watson S (1996) Combining kohonen maps with arima time series models to forecast traffic flow. Transp Res Part C: Emerg Technol 4(5):307–318
Article Google Scholar
Wang J, Deng W, Guo Y (2014) New bayesian combination method for short-term traffic flow forecasting. Transp Res Part C: Emerg Technol 43:79–94
Article MATH Google Scholar
Wang X, Zeng R, Zou F et al (2023) Sttf: an efficient transformer model for traffic congestion prediction. Int J Comput Intell Syst 16(1):2
Article MATH Google Scholar
Wang Z, Li Z, Leng J et al (2022) Multiple pedestrian tracking with graph attention map on urban road scene. IEEE Trans Intell Transp Syst 24(8):8567–8579
Article MATH Google Scholar
Wu Z, Pan S, Long G, et al (2019) Graph wavenet for deep spatial-temporal graph modeling. arXiv:1906.00121
Xiaoyu H, Yisheng W, Siyu H (2013) Short-term traffic flow forecasting based on two-tier k-nearest neighbor algorithm. Proc- Soc Behav Sci 96:2529–2536
Article MATH Google Scholar
Xie J, Long F, Lv J et al (2022) Joint Distribution Matters: Deep Brownian Distance Covariance for Few-Shot Classification. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, pp 7962–7971
Chapter MATH Google Scholar
Xu X, Lin M, Luo X et al (2023) Hrst-lr: a hessian regularization spatio-temporal low rank algorithm for traffic data imputation. IEEE Trans Intell Transp Syst 24(10):11001–11017
Article MATH Google Scholar
Yu B, Yin H, Zhu Z (2017) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv:1709.04875
Zhang J, Zheng Y, Qi D (2017) Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. Proc AAAI Conf Artif Intell 31(1)
Zhang W, Zhu F, Lv Y et al (2022) AdapGL: an adaptive graph learning algorithm for traffic prediction based on spatiotemporal neural networks. Transp Res Part C: Emerg Technol 139:103659
Article MATH Google Scholar
Zhao L, Song Y, Zhang C et al (2018) T-GCN: a temporal graph convolutionalnetwork for traffic prediction. IEEE Trans Intell Transp Syst 21(9):3848–3858
Article MATH Google Scholar
Zhao Z, Chen W, Wu X et al (2017) LSTM network: a deep learning approach for short-term traffic forecast. IET Intell Transp Syst 11(2):68–75
Article MATH Google Scholar
Zheng C, Fan X, Wang C et al (2020) GMAN: A Graph Multi-Attention Network for Traffic Prediction. Proc AAAI Conf Artif Intell 34(01):1234–1241
MATH Google Scholar
Zheng C, Fan X, Pan S, et al (2023) Spatio-Temporal Joint Graph Convolutional Networks for Traffic Forecasting. IEEE Trans Knowl Data Eng 1–14

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for providing helpful comments.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. This research is funded by the Natural Science Foundation of China (Grant No. 62376059), and the Natural Science Foundation of Fujian Province (Grant No. 2024J01070).

Author information

Feifei Chen contributed equally to this work.

Authors and Affiliations

College of Computer and Cyber Security, Fujian Normal University, Fuzhou, 350117, Fujian, China
Xing Wang, Feifei Chen, Biao Jin & Mingwei Lin
Digital Fujian Institute of Big Data Security Technology, Fujian Normal University, Fuzhou, 350117, Fujian, China
Xing Wang, Feifei Chen, Biao Jin & Mingwei Lin
Fujian Key Laboratory of Automotive Electronic and Electrical Drive Technology, Fujian University of Technology, Fuzhou, 350117, Fujian, China
Fumin Zou
School of Civil Engineering, The University of Sydney, Sydney, Australia
Ruihao Zeng

Authors

Xing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Feifei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Biao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Mingwei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Fumin Zou
View author publications
You can also search for this author in PubMed Google Scholar
Ruihao Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FC (Feifei Chen) completed the construction of the model’s backbone, while Xing Wang (XW) conducted its optimization. FC and XW together accomplished the drafting of the initial manuscript. Biao Jin (BJ) carried out the preliminary data processing, Mingwei Lin (ML) and Fumin Zou (FZ) conducted part of the experiments, and Ruihao Zeng (RZ) reviewed and revised the paper.

Corresponding author

Correspondence to Ruihao Zeng.

Ethics declarations

Competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Nomenclature

Table 8 shows the corresponding variables and notations of given parameters in DSTSPYN.

Table 8 Nomenclature list of the proposed DSTSPYN

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, X., Chen, F., Jin, B. et al. DSTSPYN: a dynamic spatial-temporal similarity pyramid network for traffic flow prediction. Appl Intell 55, 237 (2025). https://doi.org/10.1007/s10489-024-06198-z

Download citation

Accepted: 13 December 2024
Published: 28 December 2024
DOI: https://doi.org/10.1007/s10489-024-06198-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DSTSPYN: a dynamic spatial-temporal similarity pyramid network for traffic flow prediction

Abstract

Similar content being viewed by others

MSSTN: a multi-scale spatio-temporal network for traffic flow prediction

DSTGCS: an intelligent dynamic spatial–temporal graph convolutional system for traffic flow prediction in ITS

Traffic Flow Prediction Through the Fusion of Spatial-Temporal Data and Points of Interest

Explore related subjects

1 Introduction

2 Related works

2.1 Traffic prediction based on probability and statistics

2.2 Traffic prediction based on machine learning

2.3 Traffic prediction based on deep learning

3 Problem description

4 Methodology

4.1 ST attention block

4.1.1 Temporal attention

4.1.2 Spatial attention

4.2 Spatial pyramid

4.2.1 Diffusion graph convolution

4.2.2 Spatial-temporal global similarity module

4.2.3 Spatial convolution pyramid

4.3 Temporal pyramid

4.4 Spatial-temporal pyramid summary

5 Experiment

5.1 Datasets

5.2 Baseline methods

5.3 Experiment settings

5.4 Experiment results

5.5 Ablation experiment

5.6 Generalization capability

5.7 Computational cost analysis

Number of parameters

Training time

Inference time

Memory consumption

5.8 Visualization of spatial-temporal attention

6 Conclusion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Nomenclature

Nomenclature

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation