Next Article in Journal
Sudden Cardiac Death Risk Prediction Based on Noise Interfered Single-Lead ECG Signals
Previous Article in Journal
A Linear Quadratic Regulation Controller Based on Radial Basis Function Network Approximation
Previous Article in Special Issue
A Critical AI View on Autonomous Vehicle Navigation: The Growing Danger
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Spiking Neural Networks for Real-Time Pedestrian Street-Crossing Detection Using Dynamic Vision Sensors in Simulated Adverse Weather Conditions

by
Mustafa Sakhai
1,*,†,
Szymon Mazurek
1,2,†,
Jakub Caputa
1,†,
Jan K. Argasiński
3 and
Maciej Wielgosz
1
1
Faculty of Computer Science, Electronics and Telecommunications, AGH University of Krakow, 30-059 Krakow, Poland
2
Academic Computer Centre AGH, AGH University of Krakow, 30-950 Krakow, Poland
3
Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, 31-007 Krakow, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Submission received: 15 September 2024 / Revised: 15 October 2024 / Accepted: 29 October 2024 / Published: 31 October 2024
(This article belongs to the Special Issue Autonomous and Connected Vehicles)

Abstract

:
This study explores the integration of Spiking Neural Networks (SNNs) with Dynamic Vision Sensors (DVSs) to enhance pedestrian street-crossing detection in adverse weather conditions—a critical challenge for autonomous vehicle systems. Utilizing the high temporal resolution and low latency of DVSs, which excel in dynamic, low-light, and high-contrast environments, this research evaluates the effectiveness of SNNs compared to traditional Convolutional Neural Networks (CNNs). The experimental setup involved a custom dataset from the CARLA simulator, designed to mimic real-world variability, including rain, fog, and varying lighting conditions. Additionally, the JAAD dataset was adopted to allow for evaluations using real-world data. The SNN models were optimized using Temporally Effective Batch Normalization (TEBN) and benchmarked against well-established deep learning models, concerning their accuracy, computational efficiency, and energy efficiency in complex weather conditions. This study also conducted a comprehensive analysis of energy consumption, highlighting the significant reduction in energy usage achieved by SNNs when processing DVS data. The results indicate that SNNs, when integrated with DVSs, not only reduce computational overhead but also dramatically lower energy consumption, making them a highly efficient choice for real-time applications in autonomous vehicles (AVs).

1. Introduction

The landscape of Artificial Intelligence (AI) has undergone significant transformation in recent years, with its influence extending across various scientific and technological domains. A notable illustration of this impact can be observed in the autonomous vehicle sector, which heavily leverages AI and neural networks. Self-driving automobiles integrate data from many internal systems and external sensors, propelling a revolution in transportation. The concept of driverless vehicles, equipped with swift computational reflexes and optimized urban navigation capabilities, exemplifies the potential of this technology. However, despite these advancements, the widespread implementation of autonomous vehicles faces obstacles, including regulatory challenges and limitations in current data processing methodologies [1,2]. Furthermore, the proliferation of onboard systems, such as electronic control units (ECUs) and additional sensors, raises concerns regarding energy efficiency.
Waymo One, a pioneer in the commercial deployment of autonomous vehicles, utilizes a combination of LIDAR (Light Detection and Ranging) radars, and long-range RGB cameras [3]. Although this self-driving vehicle operates without human intervention, its functionality is limited to controlled and predictable environments. The constraints of RGB cameras in diverse lighting and weather scenarios necessitate investigating alternative technological solutions.
Dynamic Vision Sensors are a promising solution to these challenges [4,5]. In contrast to conventional RGB cameras, DVS cameras demonstrate superior performance in low-light environments and offer notable advantages in contrast detection. Their distinctive asynchronous operational mode enables them to concentrate on dynamic environmental changes, resulting in reduced latency and minimized data redundancy. These characteristics enhance environmental perception capabilities and contribute to improved energy efficiency [5].
The integration of DVS with SNNs introduces an innovative methodology for processing visual data. SNNs, which emulate the brain’s impulse-based communication system, excel at managing the data stream from DVS cameras with minimal latency and energy expenditure. This synergy represents a substantial advancement in the development of advanced vision systems for autonomous vehicles, potentially catalyzing a transformation in the domains of robotics and AI [5].
This study aims to investigate the efficacy of SNNs in the tasks of pedestrian street-crossing detection and intention prediction. Given the scarcity of readily available data, we employ a simulation environment to generate the dataset for our subsequent experiments. Within this simulated setting, we replicate pedestrian street-crossing scenarios across various urban landscapes and weather conditions, capturing these scenes using both RGB and DVS cameras. We then assess the SNN’s performance in detecting and predicting pedestrian street-crossing behavior, utilizing input sets of sequential frames. The SNN’s performance is benchmarked against established deep learning models, evaluating their respective capabilities when processing data acquired in both favorable and adverse weather conditions.
We further conduct comparative experiments utilizing data from the JAAD [6] dataset; we used the JAAD dataset, which is a collection of videos of people walking and crossing scenarios in all kinds of weather, from sunny days to snowy storms. We wanted to see if our model could handle different situations just as well as it did in our Carla simulations. Our findings demonstrate that SNNs can achieve a performance comparable to or surpassing that of Conventional Neural Networks (CNNs) in specific scenarios, particularly when processing DVS data under challenging weather conditions.
Lastly, we analyze the energy usage of the tested models to investigate their applicability within AV systems. We openly share the dataset and the code used to conduct the experiments.
Although some DVS datasets exist for pedestrian detection under adverse weather conditions [7], the current study emphasizes simulated data of pedestrian street-crossing detection scenarios to thoroughly explore performance under diverse and extreme weather conditions using SNNs. The use of simulations allows for precise control over environmental variables, which is not always possible with real-world datasets.

2. Related Work

2.1. ANNs in Pedestrian Street-Crossing Detection and Intention Prediction in AV

The success of deep learning in intelligent transportation systems has led to numerous solutions using artificial neural networks (ANNs) for pedestrian detection, pedestrian behavior, and intention prediction [8,9,10].
Various deep learning architectures are used for those tasks, with CNNs being one of the most established ones. Research by Kaya et al. focusing on pedestrian crosswalk detection proposed an automatic detection system using image processing techniques [11]. Their evaluations of the Faster R-CNN and YOLOv7 models achieved high accuracy. In another study, CNNs were used for trajectory prediction, reaching high performance on a wide range of datasets and outperforming some of the existing temporal models [12].
CNNs are also used in combination with recurrent models to enhance the involvement of temporal information present in the data. Liu et al. propose a hybrid approach using CNNs to detect pedestrians as an initial step [13]. Then, they use the detection results to construct graphs, which are later processed with gated recurrent units [14] and graph convolutions [15]. With such a set of methods, they achieve high crossing prediction accuracy.
More complex architectures, such as the ones based on attention [16,17], have recently shown remarkable performance across many domains, with autonomous driving included. Youan et al. present a transformer model that includes both the temporal relationships between observed pedestrians and their mutual interactions to improve trajectory prediction performance [18]. In their recent work, Rasouli and Kotseruba adapt a transformer architecture to effectively utilize different data modalities derived from images and their annotations [19]. They achieve state-of-the-art results for trajectory and action prediction in JAAD and PIE [20] datasets.
The development of autonomous driving systems is also relevant from the psychological standpoint of the users interacting with them. A controlled experiment by Qi and Menozzi examined the impact of physical crossings on stress levels and critical gap measurements. They found that the lack of physical crossings significantly lowered measured critical gaps and perceived stress levels, highlighting the need for detailed analysis in future field studies [21]. Further research involved an experiment where participants were divided into four groups to assess their reactions to an Intent Communication System (ICS) in an autonomous vehicle. The study revealed varying levels of trust and perceived safety among the groups, depending on their prior knowledge and the presence of the ICS [22].

2.2. Intelligent AV Systems in Adverse Weather Conditions

The problem of solving pedestrian street-crossing-related tasks in AV systems in adverse weather conditions is also widely recognized. Kulhandjian et al. consider the case of pedestrian street-crossing detection and avoidance in AV systems at night time [23]. They show that in such problems, ANNs can effectively be used, reaching nearly perfect prediction accuracy in detecting pedestrians. While promising, using thermal vision cameras poses potential challenges. It can struggle when the weather conditions are extreme. Such cases are not evaluated, as authors analyze mostly night scenes as adverse weather conditions, with no additional adverse weather conditions. Additionally, the approach is based on an ensemble of three CNNs, increasing the computational burden. Furthermore, the predictions of each network in the ensemble are manually weighted, adding another hyperparameter that potentially requires tuning when the data distribution changes.
In their recent work, Weihmayr et al. propose an interesting system integrating LIDAR data for pedestrian detection in heavy weather conditions [24]. They do not rely on neural networks; they only use filters to evaluate radar signal attenuation. Their results show that the integration of different sensor modalities can provide informative data for the task of pedestrian detection and recognition. However, the filter-based approach can exhibit limited flexibility, as the method is not based on a learning algorithm. This potentially poses a challenge, as for some types of data the filter may degrade its performance, requiring further extension. Additionally, the authors focus on the prediction of detection accuracy degradation, not on its direct improvement.
In another study, Tumas et al. undertake a similar work, proposing a novel dataset for pedestrian detection in various weather conditions. They introduce a new dataset containing recordings of thermal vision cameras [25]. They show that the model based on YOLO v3 reached satisfying detection results, with mAP @ IoU = 0.5 surpassing 90% in some cases. While effective in the evaluated cases, thermal vision cameras may struggle due to cases similar to the ones aforementioned, concerning research by Kulhandjian. This paper shows an example of recording in heavy rain, where the visibility of any meaningful sources of information is greatly limited.
In our work, we aim to evaluate different approaches to solving vision-related problems in heavy weather conditions. We base our approach purely on learnable neural networks and DVS data sources, exploiting only visual information.

2.3. Integration of DVS Data with Deep Learning Models

Contemporary sensory technologies, including DVS, have been incorporated into autonomous vehicles, introducing novel data modalities for training deep learning models [26]. Wan et al. introduced an innovative methodology for efficiently converting DVS events into frames and developed a feature extraction network capable of recycling features to reduce computational demands [27]. Their approach was evaluated using a custom dataset comprising diverse real-world pedestrian scenarios, yielding improved accuracy in pedestrian street-crossing detection. Moreover, the detection speed was enhanced by approximately 20% compared to previous methods, achieving a processing rate of about 26 frames per second with an accuracy of 87.43%.
Research has additionally demonstrated that the application of transfer learning enables the training of networks capable of processing DVS data with minimal latency. Chen’s work has revealed that detection in real-world environments can be executed at an impressive rate of 100 frames per second while maintaining an average test precision of 40.3% [28].
Several recent studies have explored an event-based vision for pedestrian detection in real environments [7]. These studies leverage DVS’s high temporal resolution to detect pedestrian movement effectively. However, our study uniquely focuses on pedestrian street-crossing detection in adverse weather conditions.

2.4. SNNs in AV Systems and Their Integration with DVS Data

Recent advancements in SNNs have demonstrated their increasing potential in autonomous driving applications, facilitated by the development of novel optimization techniques for training these networks [29,30]. Pascarella et al. illustrated that the combination of frame-based and DVS cameras enables the training of CNNs to effectively address steering angle estimation challenges [31].
Cordone et al. adapted popular artificial neural network (ANN) models into spiking variants for pedestrian and vehicle detection in autonomous vehicle systems utilizing DVS data, yielding promising outcomes [32].
Kim et al. proposed an SNN-based version of the YOLO algorithm, achieving performance nearly equivalent to ANN-based models while offering faster training times and improved energy efficiency [33,34].

2.5. Multi-Sensor Methods

A multi-sensor approach using monocular vision and millimeter-wave (MMW) radar has proven to be effective in pedestrian and vehicle tracking. The fusion of these sensors allows the system to compensate for the limitations of individual sensors, enhancing detection accuracy, reducing false positives, and improving robustness under diverse environmental conditions [35]. Similarly [36], shows that using this approach enables robust multi-pedestrian tracking under challenging conditions, especially in low-visibility scenarios.
Monocular 3D object detection can also be significantly improved through a multi-sensor approach, such as the integration of depth-aware convolution layers for enhanced spatial awareness, as demonstrated in M3D-RPN. This approach leverages both global and local features from 2D and 3D perspectives, bridging the performance gap between monocular methods and LIDAR-based systems by utilizing shared multi-class detection and optimizing depth estimation [37].
While these methods demonstrate advancements in the field, our setup and goal differ in key ways. Specifically, we focus on a single-sensor approach, comparing DVSs.

2.6. Neuromorphic Platforms for Efficient SNN Deployment

The benefits of SNNs of fast and efficient processing are even more pronounced when deployed using dedicated hardware [38].
Massa et al. utilized the Loihi chip [39] for gesture recognition with DVS data, achieving results on par with artificial neural networks (ANNs) while substantially reducing energy consumption [40]. Similarly, Viale et al. implemented a comparable approach for object classification in autonomous vehicles, also using the Loihi chip, further supporting these conclusions [41].
The body of research demonstrating the effectiveness of SNNs on neuromorphic platforms is much broader, though it falls outside the scope of this article. However, we want to highlight that such solutions exist and have also been explored in the autonomous vehicle (AV) domain. These developments show promise in extending the applicability of spike-based networks even further.

2.7. Observations from the Literature Study

After the literature review, we recognize the importance of pedestrian street-crossing detection and prediction problems. We note that current state-of-the-art methods are based on ANNs. Of note, most of the studies focus on general pedestrian detection and on pure performance improvement as measured by chosen metrics, without regard for energy usage constraints posed by the AV systems.
Modern sensor systems are also promising as sources of data for such systems, offering unique capabilities for dealing with difficult conditions, such as bad weather. However, we were not able to identify any data source providing recordings of pedestrian street crossing using classic and DVS data in both good and bad weather conditions. We therefore decided to prepare a novel simulation dataset adhering to our requirements and used it to evaluate the potential of SNNs as models capable of effectively detecting and predicting pedestrian behavior in low-resource environments.

3. Theoretical Introduction

This section provides details on the mathematical foundations of the tested SNN models, the CARLA [42] simulation environment used for data creation, and the principles governing DVS.

3.1. Spiking Neural Networks and Neuron Models

Given the significant success of ANNs, adopting proven architectures for training SNNs seems a logical step. In fact, such approaches have demonstrated high effectiveness, enabling performance levels that are competitive with ANNs [43]. However, directly transferring these architectures is not straightforward, as the principles underlying SNNs require certain adjustments to the models [44]. We will first outline the general principles of spiking neurons, followed by a detailed description of the specific neuron model used in this study. Next, we will discuss the learning method and the modifications of ResNet architecture [45] used to enable the processing of spike trains.

3.1.1. Basic Principles of Spiking Neurons

In SNNs, neurons are modeled as units that transmit information through spike trains, which are represented as vectors of binary values. Converting ANNs into SNNs, therefore, involves replacing the nonlinear activation functions with spiking neuron models. The charge of the neuron in a given timestep t is given by
H [ t ] = f ( V [ t 1 ] , X [ t ] ) ,
where X is the input vector, V is the discharge function, and f is the neuron function, which depends on the type of chosen model and will be described later. The discharge function describes the neuron’s behavior after emitting a spike. It can incorporate hard or soft resets, in both cases resulting in an instant decrease in the membrane potential. In this paper, we use the hard reset approach, thus we derive only the following equation:
V [ t ] = H [ t ] · ( 1 S [ t ] ) + V r e s e t · S [ t ] ,
where S is the neuronal firing function and V r e s e t is the reset voltage value, to which the membrane potential comes back after emitting the spike. The equation describing the firing can be denoted as
S [ t ] = Θ ( H [ t ] V t h ) ,
where V t h is the threshold voltage value and Θ is a Heaviside function, denoted as
Θ ( x ) = 1 , x 0 0 , x < 0

3.1.2. Neuron Model

Various biologically plausible neuron models exist [46], differing primarily in their level of biological realism and computational complexity. In this study, we employed the Parametric Leaky Integrate and Fire (PLIF) model [47], which extends the widely-used Leaky Integrate and Fire (LIF) model. Unlike LIF, PLIF introduces the ability to learn the τ parameter, a membrane time constant that governs the rate at which the membrane potential decays over time. The PLIF neuron model is described as
f ( V [ t 1 ] , X [ t ] ) = V [ t 1 ] + 1 τ ( X [ t ] ( V [ t 1 ] V r e s e t ) ) ,
where 1 τ = s i g m o i d ( a ) . Here, a is a learnable parameter shared across all neurons in a given layer. The sigmoid function is introduced to ensure that τ > 1 .

3.1.3. Surrogate Gradient Training of Spiking Neural Networks

The sparsity of SNNs, while offering several advantages, also introduces challenges when training these networks. Since non-continuous functions cannot be differentiated, the widely used gradient-based optimization techniques for training ANNs cannot be directly applied to SNNs. The surrogate gradient method is one approach to approximate the discontinuous functions in SNNs with continuous ones, thereby enabling the use of backpropagation and gradient optimization [29].
In the forward pass, the neuron’s response remains as previously described, represented by a Heaviside function. The derivative of this function corresponds to Dirac’s delta:
Θ ( x ) = , x = 0 0 , x 0
which makes the direct application of backpropagation impossible. To address this, during the backward pass, the Heaviside function is approximated by a selected continuous function. In this study, we employed a sigmoid function approximation, defined as
σ ( x , α ) = 1 1 + exp ( α x ) ,
with α being the hyperparameter controlling the smoothness. Its derivative can now be expressed as
σ ( x , α ) = σ ( x , α ) ( 1 σ ( x , α ) )
which is continuous and differentiable. Thus, the firing function during the backward pass becomes
S [ t ] = σ ( H [ t ] V t h ) ,
allowing for the computation of the gradient and error backpropagation. In this work, the PLIF neuron was adopted with the following hyperparameters: initial τ = 2 , V t h = 1 , and the smoothing factor for a surrogate function α = 4 .

3.1.4. ResNet Architecture Adaptations

In this study, we opted to adapt the ResNet architecture into its spiking variant, motivated by the model’s simplicity and its considerable success in various ANN tasks. Fang et al. [48] noted that converting ResNet into a spiking form encounters challenges such as vanishing gradients and difficulties in maintaining proper identity mapping within the residual block for most neuron models. To address these issues, we adopted the solution proposed in the cited work by replacing the addition operation between the residual block’s output, A [ t ] , and its input, I [ t ] , with one of the suggested operands, G:
G [ t ] = ( ¬ A [ t ] ) I [ t ]
As another modification, we use a Temporally Effective Batch Normalization (TEBN) layer in replacement of the standard batch normalization (BN) layer. This step was guided by the findings of [49], where the authors prove the superiority of TEBN over other normalization techniques [50,51] in SNNs due to its ability to capture the richer properties of the spike trains. The normalized output X ^ [ t ] from the TEBN layer is defined as
X ^ [ t ] = γ ^ [ t ] X [ t ] μ t o t a l σ t o t a l 2 + ϵ + β ^ [ t ] ,
γ ^ [ t ] = γ × p [ t ] , β ^ [ t ] = β × p [ t ] .
Here, in each TEBN layer, γ and β are time-invariant BN parameters and p [ t ] is a set of learnable weight parameters. The mean μ and variance σ 2 are calculated from samples across all timesteps, and ϵ is a small constant that ensures numerical stability.
For the evaluation of these modifications’ effectiveness, we refer the reader to Appendix C, where we perform a comparison study, as well as to the literature originally introducing them [48,49].

3.1.5. Network Readout

As the spiking network produces T output spikes given T input ones, we averaged the output spike train to produce the classification logit. Therefore, the network’s output Y ^ was equal to
Y ^ = 1 T × i = 0 T y [ t ] ,
where y [ t ] is the output spike train value at timestep t.

3.2. CARLA Simulator and Perception System

CARLA is an open-source simulator for autonomous driving research, developed by the Computer Vision Centre and the Embodied AI Foundation [42]. Built on Unreal Engine 4, it offers a highly realistic urban environment with a variety of scenarios and weather conditions, enabling the safe and controlled testing of algorithms. Its extensive customization options, including adjustable environments, vehicle models, and sensor configurations, make it a powerful tool for advancing autonomous driving research. CARLA supports the integration of custom algorithms and large-scale experiments, making it widely adopted by research institutions and companies worldwide.
One of CARLA’s key features is its detailed customization of weather conditions, offering a wide range of independent parameters that allow for the creation of specific environmental scenarios. Key configurable settings include cloudiness, precipitation, wind intensity, sun position, fog density, and road wetness. These parameters can be adjusted to simulate conditions from clear skies to severe storms, strong winds, or dense fog, providing a flexible platform for testing autonomous vehicle systems in varied and challenging environments. For instance, cloudiness and precipitation can be tuned to mimic everything from a sunny day to heavy rain, while the sun’s position can be controlled through its azimuth and altitude angles. Fog settings influence its density and range, adding realism to low-visibility situations. Moreover, CARLA enables the creation of puddles and wet road surfaces to replicate post-rain conditions. We leverage these features to create diverse driving scenarios, allowing for a thorough evaluation of autonomous driving algorithms under different and challenging conditions.
ScenarioRunner is another tool offered by the CARLA simulator for defining and executing traffic scenarios. It provides the ability to create and validate complex traffic scenarios that can be used to evaluate and benchmark autonomous driving agents. ScenarioRunner allows for the selection of maps, weather, sensors, and textures and manages them in a controlled way.

3.3. Dynamic Vision Sensor (DVS)

The DVS, commonly referred to as an Event Camera in Figure 1, functions differently from traditional cameras by capturing changes in intensity asynchronously as a continuous stream of events. Each event represents a change in brightness and encodes information about its pixel location, timestamp, and polarity. Event cameras provide several advantages over conventional cameras, such as a high dynamic range, an elimination of motion blur, and high temporal resolution in the microsecond range [5].
To trigger an event, the change in logarithmic intensity must surpass a specified threshold, resulting in a polarity that can be either positive or negative.
CARLA allows access to the DVS camera during simulations, operating in a uniform sampling manner between two consecutive synchronous frames. This requires a high sampling frequency to replicate the high temporal resolution characteristic of a real event camera.
It is important to note that if there is no difference in pixel values between two consecutive synchronous frames, the camera will not output an image. This can happen either in the first frame or in situations where there is no movement between frames. While the DVS camera shares several features with traditional cameras, it also has unique properties that arise from the principles governing event cameras.

4. Materials and Methods

This section outlines the methodology used to generate the dataset and train the models for pedestrian street-crossing detection in autonomous vehicle (AV) systems. The workflow is depicted in Figure 2.

4.1. Dataset Generation from CARLA Simulator

Our primary goal was to investigate the detection and prediction of pedestrian crossing behavior under challenging weather conditions using neural networks with data from various sensors. Initially, we searched for a dataset that captured pedestrians crossing streets in diverse weather scenarios, including DVS and RGB images. The dataset utilized in our experiments is available at the following link: https://zenodo.org/records/11409259 (accessed on 15 October 2024). Although numerous road-themed datasets exist online, such as n-cars [32], JAAD [6], and DSEC [52], along with various studies analyzing pedestrian crossing intentions [10,21], none fully met our requirements. These datasets either lack DVS data, offer limited instances of pedestrian crossings or insufficient labeling of such events, or do not provide enough video footage featuring adverse weather conditions. As a result, we opted to generate a custom dataset using a simulation environment.
The simulation was developed using CARLA software version 0.9.13 and the scenario repository from the ARCANE project, which focuses on adversarial scenarios for autonomous vehicles [53].
In our simulation environment, the primary aim was to establish a predictable and controlled testing setup rather than to replicate a highly complex or fully randomized traffic scenario. To ensure consistency and reliable evaluation of the models, we chose a simplified configuration consisting of one vehicle and one pedestrian. The pedestrian’s crossing behavior occurs at an arbitrary moment during the simulation, introducing sufficient variability for testing while maintaining control over the conditions. This design facilitates the random occurrence of pedestrian crossings while concentrating on key challenges pertinent to our research, such as adverse weather and low-light conditions. By managing the number, speed, and location of the pedestrian in this manner, we ensure our models are tested under a range of diverse yet systematically manageable conditions, avoiding excessive randomness that could complicate the analysis.

4.1.1. Video Simulation

The simulation began with the creation of urban scenarios where pedestrians would cross streets under various conditions. The scenarios were designed to include different lighting conditions and weather effects, which are crucial for testing the robustness of the detection models.

4.1.2. RGB and DVS Data Collection

Two data types were collected during the simulation: RGB images and DVS events. RGB images provide high-resolution color information, while DVS data offers high temporal resolution, capturing changes in the scene with minimal latency. This dual-modality data is essential for training models that can operate effectively in dynamic and challenging environments.

4.1.3. Manual Data Curation

The raw data collected from the simulation underwent preprocessing to ensure quality and consistency. This step involved data cleaning, where corrupted or irrelevant data was manually removed, and manual labeling, where each frame or event was annotated to indicate the presence or absence of a pedestrian on the street.

4.2. Task Formulation

For the experiments, we formulate two training tasks: detection and prediction. Of note, understandings of these tasks differ from the ones commonly found in the domain terminology; therefore, we provide their detailed explanation.

4.2.1. Detection

At first, we explore the problem of pedestrian street-crossing detection, where we aim to identify if the pedestrian is crossing the street in any frame of a given clip. We randomly extract clips of a given length from the dataset, assigning their labels based on frame labels included in every clip. Intuitively, we label a clip as positive if any of the frames in it has a positive label. Otherwise, the clip is considered a negative example. Formally, this approach can be defined as
max i { 1 , , N } x i ,
where x i is the i-th frame from a extracted clip of length N. Examples of label assignments based on clip frame labels can be seen in Figure 3. Such clips with corresponding labels are used as inputs to the network.
Networks were tested under two clip length scenarios: one with clip lengths of 9 frames and 8 frames overlapping, and another with clip lengths of 30 frames and 29 overlapping. The overlap was introduced only in the training set.
With the following approach, we observed a class imbalance, with a predominance of negative examples. Therefore, the cross-entropy loss function was weighted for positive samples proportionally to the imbalance observed in a given setting.

4.2.2. Prediction

Building on the detection task, we also evaluated the predictive capabilities of our networks through proxy clip classification. In this case, we specify the prediction horizon of length H, which defines the number of consecutive frames that directly precede the first frame in which the pedestrian starts crossing the street in a given video. We label those frames as positive. To obtain negative ones, we extracted H consecutive frames from the videos in which the pedestrian did not cross the street, starting from a randomly chosen frame. The number of negative frames to extract was configured to match the number of positive samples, preventing a class imbalance. There was no limit to the number of negative samples that could be extracted from a non-event video. We ensured no overlap between negative sets of frames extracted from a given non-event video.
Due to the limited number of positive frames, particularly for shorter prediction horizons ( H = 1 s ) we chose to construct only clips of length 9 with 8 frames overlapping. In this case, positive clips were created from consecutive positive frames. The same condition was applied to negative label assignments, this time including only negative frames. An example of this clip-labeling strategy can be found in Figure 4.
The rest of the experiment organization was configured in the same way as for the previously described detection task.

4.3. Benchmarking the Solution Against the JAAD Dataset

To contextualize our findings within existing research, we adapted the JAAD dataset. Although it does not provide the DVS data, available annotations matched the ones found in the proposed simulation dataset. Namely, we were able to determine if in the given frame the pedestrian is crossing the street and categorize the videos into good and bad-weather subsets. Videos were classified as being recorded in bad weather if the conditions contained snow or rainfall. We performed experiments using only RGB data.
Given the limited number and duration of clips, especially under bad weather conditions, we were not able to create positive samples in the same way as in the prediction experiments. Due to this fact, we evaluate only the detection task.
We also observed some labeling problems in the annotations in comparison with our simulation data, which we describe in Appendix B.

4.4. Model Training

Once the data were prepared, they were used to train two types of neural networks.

Neural Network Training

Both SNNs and CNNs were trained using DVS and RGB data. No augmentations were applied, with resizing being the only preprocessing step.

4.5. Model Evaluation

After training, the models were evaluated on a separate test set, including clips not seen during training. The performance of the SNNs and CNNs was compared using metrics such as accuracy, F1 Score, and Area Under the ROC Curve (AUROC).

4.6. Measuring Energy Usage

As energy-efficient processing is one of the prominent benefits of SNNs, we compare the energy usage of each trained network. We follow the methodology in the research by Chen et al. [54], measuring the number of multiply-and-accumulate (MAC) and accumulation (AC) operations in a given network, and translating that to the energy usage of such operations in 45 nm technology [54,55].
In the standard feedforward ANNs used in this research, the number of operations is constant in every forward pass. Therefore, for this type of network, we compute the number of operations during a single forward pass with a dummy input tensor of a shape identical to one of the processed clips in experimental tasks.
For SNNs, the number of emitted spikes varies depending on the input sample. Due to this fact, for SNNs, we estimated the consumption by performing the inference on every sample of the test dataset with the corresponding trained model and averaged the number of operations.
We chose bad weather DVS data in the detection task as the evaluation for SNNs. Corresponding evaluations for ANNs were also performed using single-channel dummy data to mimic the DVS data format.

4.7. Experimental Setup and Data Preprocessing

In this section, we detail the experimental setup using the generated datasets to tackle various tasks. We evaluated four networks: ResNet18, its spiking adaptation, Spiking Sew ResNet18 with TEBN (SPS R18T), and two models designed for video classification, SlowFast R50 [56] and MViTv2 [57].
For the ANNs, pre-trained weights were employed: ResNet18 was trained on the ImageNet1k dataset [58], while SlowFast R50 used the Kinetics400 dataset [59]. Due to architectural modifications, SPS R18T and MViTv2 were trained from scratch. We also note that MViTv2 was modified to accommodate the needs of the experiments. For a full description of those modifications, we refer the reader to Appendix D.
Due to the temporal nature of tasks, the ANN version of ResNet was trained using a “pseudotemporal” scheme, where each frame in the input clip is classified separately. The output logit for each frame contributes to a final prediction, calculated as an average of each logit, as described by Equation (13). We refer to this model as PT ResNet18.
The videos were divided into training, validation, and testing subsets, with 15% of the total videos set aside for testing. Of the remaining videos, 15% were allocated for validation, and the rest were used for training. The frames were processed according to the specific tasks (see task descriptions below).
All networks were optimized using the AdamW [60] optimizer to minimize weighted binary cross-entropy loss. The initial learning rate was established at 10 3 , accompanied by a weight decay factor of 10 1 . Batch sizes varied between 4 and 64, depending on hardware capabilities. The training was conducted for a maximum of 100 epochs, utilizing an early stopping protocol if the validation loss did not improve for 8 consecutive epochs. The best-performing model, as determined by validation loss, was selected for testing. Performance metrics included AUROC and F-score, which are well suited for addressing imbalanced classification challenges.
We resize the input size of images from the generated dataset to 450 × 256 to reduce the computational demands of the experiments. This operation introduced some artifacts into the images, although they were not significant enough to prevent model training. For details, we refer the reader to the Appendix A.
For all experiments, the code was developed in Python 3.10 with PyTorch 2.2.0 [61] and a Lightning 2.1.3 framework, along with SpikingJelly 0.0.0.15 [62], a library implementing abstractions related to SNNs. All the parameters are given in Table 1. The experimental code is available at https://github.com/szmazurek/snn_dvs (accessed on 15 October 2024).

5. Results

5.1. Street-Crossing Detection

The results of the evaluation of the detection task are summarized in Table 2. For shorter time windows of 9 frames, PT ResNet18 exhibited superior performance across most cases. Specifically, in the bad-weather subset, PT ResNet18 achieved outstanding results, with an AUROC of 0.9882 and an F-score of 93.05 for DVS data, and an AUROC of 0.9194 with an F-score of 58.02 for RGB data. The SPS R18T model showed competitive performance in the DVS modality, with an AUROC of 0.9504 and an F-score of 57.94, though it lagged significantly in RGB performance.
In normal weather conditions, all networks demonstrated robust detection capabilities, particularly in the DVS modality, where the AUROC scores exceeded 0.94 for each network. The PT ResNet18 continued to perform well with RGB data, obtaining an AUROC of 0.9516 and an F-score of 76.95. Conversely, the SPS R18T experienced a noticeable drop in performance in RGB data, registering an AUROC of 0.7995 and an F-score of 49.26.
Surprisingly, MViTv2 failed to converge in any of the setups, showing random predictions with an AUROC of 0.5. An F-score of 0 further indicates that the model fails to identify any positive samples, classifying all samples as negative. When extending the clip length to 30 frames, the observed trends were consistent with shorter clips. PT ResNet18 continued to exhibit the best performance, and was slightly surpassed by SlowFast R50 in the good-weather subset. For instance, in bad weather conditions with RGB data, SlowFast R50 markedly improved, increasing its AUROC from 0.5126 in shorter clips to 0.9546 in longer clips, and similarly for DVS data in good weather conditions, achieving an AUROC of 0.9594 and an F-score of 81.68. The SPS R18T model demonstrated comparable performance across various conditions and modalities, regardless of clip length. The performance of MViTv2 remained at the same level as previously.

5.2. Street-Crossing Prediction

The results for prediction task experiments are shown in Table 3. For the 1 s predictive horizon, we can see the remarkable performance of the SPS R18T in the bad-weather subset for both modalities, where it outperforms both SlowFast R50 and PT ResNet18. The performance advantage is most visible with the DVS modality. Notably, for the same data modality, SlowFast R50 failed to converge, reaching an AUROC of only 0.5827 and a 0% F-score.
This changed in the good-weather subset, where classic ANNs performed better than the SNN. The best results were achieved on the DVS modality, with the top one reached by PT ResNet18 with an AUROC of 0.9455 and an F-score of 93.17%. In this scenario, MViTv2 shows a performance above randomness levels only for normal-weather RGB subsets, hinting at the high dependency of the model convergence on the specific data distributions.
For the 5 s predictive horizon, again the best class separability in the bad-weather subset was reached by the SPS R18T. Contrary to the shorter lookback observations, this time, RGB data turned out to be more informative for the SNN, where it reached an AUROC of 0.882 and an F-score of 79.39. MViTv2 once again failed to provide any meaningful predictions.
In the normal-weather subset, the networks maintain robust performance on the DVS modality, except for the SPS R18T, which notes a large drop in performance compared to the shorter predictive horizon version. This is similar to what can be seen in the previous experiments with single-frame prediction for this network. A large performance decrease can be seen for the RGB modality, where none of the networks reached results better than random guesses with AUROC scores below 0.5. For normal RGB data, MViTv2 once again shows improved performance compared with other modalities in good and bad weather, emerging as the best-performing model for that subset and modality.

5.3. Street-Crossing Detection on JAAD Dataset

The results of the street-crossing detection in the JAAD dataset can be seen in Table 4. Overall, the performances of nearly all of the models are lower than the ones observed on the simulation dataset. This could, however, be expected, as this dataset contained fewer clips, making it harder for the models to converge.
SPS R18T was better than the PT ResNet18 and SlowFast R50 counterparts for the shorter clips of 9 frames in bad weather, with an AUROC of 0.5966 and an F-score of 74.03%. It was however surpassed by MViTv2, which shows the best performance in this subset. On longer clips in the same subset, the performance gap is reduced, but the spiking network shows the best class separability with an AUROC of 0.642. This is interesting, considering that in the simulation dataset, RGB data posed a significant challenge for SPS R18T.
These trends change in the good-weather subset. The performance of the SNN remains similar to the one on the bad-weather data, while PT Resnet and SlowFast R50 show significant performance improvements on both short and long samples. The performance of MViTv2 deteriorated to the random level once again.

5.4. Energy Usage Measurement

The results of the energy usage evaluations for the networks used in the previous experiments are shown in Table 5. SPS R18T shows the lowest energy consumption among all networks. For input clips of 9 frames, its average energy consumption was 50.45 mJ of energy compared to nearly 10 times more used by SlowFast R50 and more than 3 times more by PT Resnet18.
Interestingly, SPS R18T has shown decreased energy usage when the number of samples in the clip increased, using only 30.37 mJ of energy. The energy usage of ANNs rose proportionally to the number of samples in the input clip.
Specifically, a more trained SNN model becomes more efficient at processing data, leading to fewer spikes being generated during inference. This reduction in spikes would directly result in lower energy consumption, as SNNs are designed to be energy-efficient by firing spikes only when necessary. Thus, as the model becomes more refined, it may rely on fewer spikes, reducing the overall energy usage, in contrast to traditional ANNs, where energy usage scales linearly with the number of samples.
Based on the energy usage measurements, a performance-energy efficiency tradeoff can be established. In Figure 5 two Pareto plots are shown. From the plots, we can see that in the evaluated task and data subset, SNNs provided the best tradeoff between performance and energy usage. It is also tempting to view MViTv2 as the second-best model in that comparison, yet it has to be noted that its performance is at a random level, making the model unusable.

6. Discussion

Our experimental results provide nuanced insights into the performance of traditional and spiking neural networks for detecting pedestrian street-crossing behavior under various weather conditions and modalities.
In the task of crossing behavior detection, SPS R18T exhibited a strong performance when leveraging DVS data in adverse weather scenarios. It achieved an AUROC of 0.9542 and an F-score of 75.5% in the 30-frame detection task, demonstrating its effectiveness in identifying dynamic events such as pedestrian movement in challenging visual conditions, including low light and precipitation (see Table 2).
In contrast, traditional ANNs like PT ResNet18 and SlowFast R50 performed well across all conditions, particularly excelling in normal weather environments. This underscores their effectiveness in processing high-resolution, color-rich data (see Table 2). Such capability is crucial for the accurate detection of pedestrian street crossings, where visual details must remain clear and unobstructed by environmental factors.
The performance of MViTv2 raises interesting questions about the model’s adaptability to data distributions. In prediction tasks, it constantly failed to converge across all tested cases. For the prediction task, the situation was nearly identical, except for a surprising improvement in good RGB data. The situation was similar when testing the model in the JAAD dataset, where it reached meaningful predictive capabilities in one of the scenarios.
Possibly, such variance in the MViTv2 performance lies in its sensitivity to data distribution. It is possible that, for datasets and tasks where the differences between training samples are minute, it is difficult for the network to extract meaningful feature embeddings.
While the SPS R18T performs competitively in bad-weather DVS modalities, it exhibits a noticeable performance drop when using RGB data. For example, its performance with RGB data dropped significantly regardless of weather conditions, as noted in its lower AUROC and F-score results compared to other networks. This suggests that while SNNs are particularly adept at processing dynamic visual changes enabled by DVS technology, they may not yet fully exploit the static, detailed visual information provided by RGB imagery as effectively as traditional ANNs.
In the proposed task of behavior prediction, we observe that the patterns present in detection task results prevail. With shorter predictive horizons in bad weather, SPS R18T substantially outperforms ANNs, achieving better results even with RGB data. In normal weather, ANNs come to the forefront, especially on RGB data. With a longer predictive horizon, ANNs perform better, but SNN still prevails in bad weather. Those observations support previous claims that SNNs are valuable when dealing with DVS data, especially in bad weather conditions (see Table 2).
The results of the detection task evaluation on the JAAD dataset align with those previously seen in simulation data, with ANNs performing robustly in good-weather scenarios. However, for the bad-weather clips, SNN was performing significantly better for shorter clips and comparably for longer ones than classic ANNs, despite working only with RGB data. This requires further exploration, as it shows that in certain problems SNNs can perform comparably to ANNs on data from traditional sensors. We leave the exploration of this problem for future work.
The observed performance patterns indicate a strategic approach to selecting network types based on sensory data and environmental conditions. SNNs utilizing DVS data prove particularly effective in adverse weather conditions due to their capability to robustly process dynamic visual changes. In contrast, ANNs excel with RGB data in favorable weather, taking advantage of the rich color and texture information that is readily accessible. These distinctions emphasize the importance of choosing the appropriate neural network architecture tailored to specific sensor data and task requirements.
Lastly, SNNs have shown great energy efficiency compared to ANNs. In certain instances, the disparities were of a magnitude several orders greater. Also, their energy usage seems not to be affected largely by the length of the analyzed sample.
The Pareto plots in Figure 5 show that for the evaluated case, SNNs provided the best performance–energy usage ratio. However, it should be noted that in the cases where SNN performance was significantly lower than ANNs’, these relationships would change in favor of the latter. Nevertheless, the number of operations performed by ANNs when making predictions is constant; therefore, reducing their energy usage would be harder. On the other hand, SNNs show large room for performance improvement in the cases where they achieve subpar results, making future research in that direction promising.
These findings are especially interesting from the perspective of AVs, as these systems must operate in energy-constrained environments.

6.1. Challenges and Potential of SNNs with RGB Data in Pedestrian Street-Crossing Detection

In this study, we observed a performance gap between SNNs and CNNs when processing RGB data for pedestrian street-crossing detection, particularly in adverse weather conditions. This gap can be attributed to the inherent differences in how these networks handle data. CNNs are highly specialized for processing dense, continuous visual data, such as RGB images, by leveraging convolutional layers to extract spatial features across multiple resolutions, including edges, textures, and color gradients. RGB data is rich in spatial information that is essential for accurate pedestrian street-crossing detection, especially in complex scenes where color, lighting, and texture variations help distinguish objects and individuals.
SNNs, in contrast, are optimized for processing sparse, event-based data, such as that produced by DVSs, which capture temporal changes in pixel intensities asynchronously. SNNs rely on spike-based information transmission, where the representation of data is encoded in the timing of spikes, which inherently limits their ability to fully capture the detailed spatial and chromatic information present in RGB images. When applied to RGB data, SNNs may lose finer details crucial for tasks like pedestrian street-crossing detection, particularly when subtle color differences or gradients are critical for distinguishing between pedestrians and background objects.
This limitation presents several opportunities for advancing the application of SNNs in RGB-based tasks. One possible approach involves spiking convolutional networks that integrate convolutional operations within the spiking domain, allowing for more effective extraction of spatial features while maintaining the temporal efficiency of SNNs. In particular, the development of spiking convolutional layers that mimic the hierarchical feature extraction of CNNs could significantly improve SNNs’ ability to process RGB data, potentially bridging the gap in performance.
Additionally, exploring more advanced spike encoding schemes for RGB data could enhance SNNs’ capacity to process detailed visual information. Traditional rate-based encoding, which translates pixel intensities into spike rates, may not fully capture the richness of RGB images. Temporal coding strategies, such as latency coding or phase-based coding, could improve how spatial and color information is represented in spike trains, thereby enhancing SNNs’ ability to leverage RGB data.
Furthermore, hybrid models that combine CNN-SNN architectures could offer a promising solution. By utilizing CNNs to preprocess and extract spatial features from RGB images, and then passing the processed information to SNNs for efficient temporal analysis, such models could exploit the strengths of both network types. This hybrid approach could be particularly beneficial in applications where both spatial resolution and temporal precision are necessary, such as pedestrian street-crossing detection under dynamic, real-world conditions.
Finally, the fusion of multi-modal sensory data, such as integrating RGB and DVS inputs, offers another path forward. By allowing CNNs to handle the rich spatial details of RGB data and SNNs to process the high-temporal resolution of DVS data, we can develop systems that leverage complementary sensor modalities to improve robustness, particularly in challenging environments like low-light or adverse weather conditions.
Despite these challenges, SNNs’ advantages in low energy consumption and event-based processing make them a compelling choice for autonomous vehicle systems, where energy efficiency and real-time processing are critical. Future work will focus on improving the integration of SNNs with RGB data through architectural and encoding innovations, potentially enabling their wider application in complex visual processing tasks.

6.2. Limitations in Using Simulation Data

Despite the simulation data being more easily accessible and controllable than the real data, we acknowledge the need for extensive evaluations of proposed algorithms in real-world scenarios. We show initial experiment results on the JAAD dataset with RGB frames, noting that DVS recordings were unavailable for this dataset. Thus, in the future, we aim to expand our evaluation into existing datasets for both modalities, using RGB-to-event conversion methods to obtain missing DVS data. We have already developed such pipelines, the results of which can be seen in Figure 6. While promising, it requires thorough investigation and evaluation. It is therefore beyond the scope of this article and will be addressed in our future work.
We also plan to incorporate additional datasets, such as those presented in [7], as part of our future work. Integrating these real-world datasets will require modifications to our existing pipeline. At this stage, the use of simulation data has provided a strong foundation for subsequent experiments with real-world datasets. Moreover, we aim to evaluate the model’s performance on mixed datasets, combining both real and simulated data, to assess its robustness across varied scenarios.

7. Conclusions

This study successfully illustrates the use of Spiking Neural Networks (SNNs) integrated with Dynamic Vision Sensors (DVSs) to enhance pedestrian street-crossing detection in challenging weather conditions, which is essential for advancing autonomous vehicle technologies. Our research concentrated on evaluating the effectiveness of SNNs in comparison to traditional Convolutional Neural Networks (CNNs) across various sensory modalities and environmental scenarios.
Key findings from our experiments include:
  • Improved Detection in Challenging Conditions: When combined with DVS, SNNs exhibit strong performance in low-light and adverse weather scenarios, utilizing the high dynamic range and temporal resolution of DVS to identify subtle movements and variations in pixel intensity.
  • Energy Efficiency: SNNs show considerable reductions in energy consumption compared to conventional CNNs, making them suitable for real-time applications in autonomous driving, where power efficiency and rapid response times are critical.
  • Challenges with RGB Data: Although SNNs excel with DVS data, their performance with standard RGB data poses challenges, particularly in fluctuating weather conditions, where traditional CNNs generally outperform them. However, SNNs demonstrated robust performance with RGB data in the JAAD dataset. This issue will be investigated in future research.
  • Influence of Clip Length and Complexity: Our findings indicate that SNNs may enhance their performance as task complexity increases, such as with longer clip lengths for pedestrian prediction, suggesting their ability to manage more intricate temporal sequences.
This research emphasizes the significance of selecting the appropriate neural network architecture and sensor modality tailored to specific operational environments, advocating for a hybrid approach where SNNs are utilized alongside traditional artificial neural networks (ANNs) to leverage the strengths of each according to situational requirements.
Future research will tackle the identified challenges, particularly in improving the processing capabilities of SNNs with RGB data and further optimizing the integration of SNNs with DVS for wider applications in intelligent transportation systems. Moreover, investigating more advanced models and training methods to enhance the robustness and accuracy of pedestrian street-crossing detection under varying environmental conditions will be crucial. We will also prioritize the integration of real-world and mixed datasets in our experiments.

Author Contributions

Conceptualization, M.S. and S.M.; methodology, M.S. and S.M.; software, M.S., S.M. and J.C.; validation, M.S., S.M. and J.C.; formal analysis, M.S. and S.M.; investigation, M.S.; resources, M.S. and S.M.; data curation, M.S. and J.C.; writing—original draft preparation, M.S., S.M. and M.W.; writing—review and editing, M.S., S.M., J.K.A. and M.W.; visualization, S.M. and M.W.; supervision, M.W.; project administration, M.W.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded within the project of the Minister of Science and Higher Education “Support for the activity of Centers of Excellence established in Poland under Horizon 2020” on the basis of the contract number MEiN/2023/DIR/3796. This publication is supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement Sano No 857533. This publication is supported by the Sano project, carried out within the International Research Agendas programme of the Foundation for Polish Science, co-financed by the European Union under the European Regional Development Fund.

Data Availability Statement

The code for the experiments is available at https://github.com/szmazurek/snn_dvs (accessed on 15 October 2024). Instructions for downloading the dataset are also provided there.

Acknowledgments

We gratefully acknowledge Poland’s high-performance Infrastructure PLGrid ACK Cyfronet AGH for providing computer facilities and support within computational grant no. PLG/2023/016767.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
AVAutonomous vehicle
ANNArtificial neural network
CNNConvolutional Neural Network
SNNSpiking Neural Network
DVSDynamic Vision Sensor
LIDARLight Detection and Ranging
ECUElectronic Control Unit
RGBRed, Green, Blue
JAADJoint Attention in Autonomous Driving Dataset
PIEPIE Dataset
ICSIntent Communication System
PLIFParametric Leaky Integrate and Fire
LIFLeaky Integrate and Fire
TEBNTemporally Effective Batch Normalization
BNBatch Normalization
AUROCArea Under the ROC Curve
PT ResNet18Pseudotemporal ResNet-18
SPS R18TSpiking Sew ResNet-18 with Temporally Effective Batch Normalization
MACMultiply-and-accumulate

Appendix A

As stated in this article, to reduce the computational demands of the various experiments conducted, we opted to resize the images from their original dimensions of 1600 × 600 for RGB and 1542 × 587 for the DVS modality down to a resolution of 450 × 256. This resizing process requires the use of an interpolation procedure. Since interpolation methods utilize the values of neighboring pixels to calculate new pixel values in the resized image, this presents a potential challenge when handling DVS data. This modality relies on highly accurate measurements of light change at specific points, meaning that approximations could compromise quality. We investigated this issue and found that, from a human observer’s perspective, the DVS images were significantly impacted by interpolation, regardless of the method employed. Examples of these transformations can be observed in Figure A1.
Nevertheless, we noted in preliminary trials that networks trained on this data were still capable of converging and achieving high performance using the nearest neighbor interpolation method. Consequently, we proceeded with this approach in the experiments presented in this paper. Other resizing techniques, such as seam carving or super-resolution networks, might also address this issue, and we will reserve their evaluation for future research.
Figure A1. Comparison of an original-size DVS image with the pedestrian crossing the street in bad weather with the same frame after interpolation. The top image is the original frame, and the grid below shows the interpolation results for different methods. (Top left): bilinear, (top right): nearest neighbor, (bottom left): areal, (bottom right): bicubic. Note that after the interpolation, the image quality degrades significantly, making the visibility of the pedestrian much lower for the human observer.
Figure A1. Comparison of an original-size DVS image with the pedestrian crossing the street in bad weather with the same frame after interpolation. The top image is the original frame, and the grid below shows the interpolation results for different methods. (Top left): bilinear, (top right): nearest neighbor, (bottom left): areal, (bottom right): bicubic. Note that after the interpolation, the image quality degrades significantly, making the visibility of the pedestrian much lower for the human observer.
Electronics 13 04280 g0a1

Appendix B

Due to performance declines observed in the JAAD dataset, we conducted further error analysis by visualizing test set samples for which the model was assigned incorrect labels. We found that errors occurred consistently across specific clips, regardless of the task or model used. Some examples are shown in Figure A2. In these videos, the labels assigned based on the original annotations can be seen as “corner cases”. For instance, in the top right video, a man is directly crossing the path of an oncoming car. Consequently, based on other examples in the dataset, the networks accurately predict that crossing behavior is occurring or will soon occur, yet this prediction is deemed incorrect. This is somewhat anticipated, as the JAAD dataset primarily focuses on the intention to cross.
In the simulation datasets introduced in this work, such inconsistencies are absent, allowing the experimenter to concentrate solely on the algorithm’s effectiveness without the complications associated with data labeling.
Figure A2. Examples of incorrectly classified frames evaluated on the bad-weather subset of JAAD dataset. The predictions were made using the SPS R18T model in a single-frame prediction task. In those examples, it is visible that the labels assigned to the given frame do not precisely match the observed pedestrian behavior, therefore leading to the prediction being considered incorrect.
Figure A2. Examples of incorrectly classified frames evaluated on the bad-weather subset of JAAD dataset. The predictions were made using the SPS R18T model in a single-frame prediction task. In those examples, it is visible that the labels assigned to the given frame do not precisely match the observed pedestrian behavior, therefore leading to the prediction being considered incorrect.
Electronics 13 04280 g0a2

Appendix C

We developed our Spiking Neural Network (SNN) by integrating the findings from [48,49] into the original ResNet18 architecture, converting it to its spiking variant. To assess the effectiveness and impact of these modifications on the final model’s performance and latency, we conducted a comparative study. We evaluated three models: the baseline Spiking ResNet18 (SP R18), Spiking SEW ResNet18 [48] (SPS R18), and SPS R18 model with Temporally Effective Batch Normalization (TEBN) layers replacing the original batch normalization [49]. The networks were tested in a single-frame detection task, where a single input frame with an associated label was presented to the network either once or repeated ten times in a sample. The results are summarized in Table A1.
When presented with a single input frame, SPS R18 demonstrated significant improvements over SP R18 across all data cases, except for the DVS modality in the bad-weather subset, where SP R18 achieved a comparable AUROC. The inclusion of TEBN yielded similar a performance to that of SPS R18, though surprisingly, it showed a slight advantage for the model without batch normalization. An exception was seen in the RGB data within the normal-weather subset, where SPS R18T outperformed SPS R18T. The first network achieved an AUROC of 0.8541 and an F-score of 58.61.
For the scenario where the input frame was repeated ten times in the samples, SP R18 exhibited substantial performance gains, often surpassing the enhanced versions. SPS R18 improved performance with RGB data in the normal-weather subset but declined in the bad-weather subset for the same modality compared to the one-frame approach. Once again, SPS R18T showed stable performance across all modalities.
After examining these results, we ultimately selected SPS R18T for further evaluation due to its stable and robust performance across most modalities. Even in cases where it did not achieve the highest results, it remained close to the best-performing network. Notably, this network’s ability to perform nearly as well with samples consisting of just one frame without repetitions is advantageous, as it directly reduces latency, increases computational speed, and lowers energy consumption for potential future applications.
Table A1. Comparison of SNN architecture performances and frame latency on different datasets in single-frame classification with the task evaluated by test AUROC and F-score. F-score is expressed in percentages. SPS R18 stands for the Spiking SEW ResNet18, and TEBN stands for Temporal Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
Table A1. Comparison of SNN architecture performances and frame latency on different datasets in single-frame classification with the task evaluated by test AUROC and F-score. F-score is expressed in percentages. SPS R18 stands for the Spiking SEW ResNet18, and TEBN stands for Temporal Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
NetworkSubset
Bad WeatherNormal Weather
DVSRGBDVSRGB
AUROCF-ScoreAUROCF-ScoreAUROCF-ScoreAUROCF-Score
Clip length: 1 frame
SP R180.942975.860.559429.940.836568.640.50
SPS R180.947879.970.767642.020.930175.280.568634.2
SPS R18T0.946779.160.74139.380.898572.250.854158.61
Frame repeats: 10 frames
SP R180.963778.570.73742.580.948577.580.911668.97
SPS R180.977183.080.575911.540.942173.950.873563.4
SPS R18T0.963683.610.722637.910.928874.220.832554.32

Appendix D

The MViTv2 model was trained from scratch using a smaller number of layers to limit the computational burden. By reducing the amount of parameters, we also sought to reduce the risk of overfitting, as vision transformers usually require orders of magnitude more data to converge than available in our datasets.
The feature extractor consisted of five multiscale attention blocks, having 1, 2, 4, 4, and 8 attention heads and 192, 384, 384, 384, and 768 convolutional filters, respectively. The classifier head was a dense layer with 768 neurons.
Additionally, we wanted to avoid the modification of the original size of the convolutional layer performing patch extraction. Therefore, we interpolated the positional embeddings to accommodate tensor shapes.

References

  1. Bathla, G.; Bhadane, K.; Singh, R.K.; Kumar, R.; Aluvalu, R.; Krishnamurthi, R.; Kumar, A.; Thakur, R.N.; Basheer, S. Autonomous Vehicles and Intelligent Automation: Applications, Challenges, and Opportunities. Mob. Inf. Syst. 2022, 2022, 7632892. [Google Scholar] [CrossRef]
  2. Burd, J.T.J. Regulatory Sandboxes for Safety Assurance of Autonomous Vehicles. Univ. Pa. J. Law Public Aff. 2021, 7, 5. [Google Scholar]
  3. Lillo, L.D.; Gode, T.; Zhou, X.; Atzei, M.; Chen, R.; Victor, T. Comparative Safety Performance of Autonomous- and Human Drivers: A Real-World Case Study of the Waymo One Service. arXiv 2023, arXiv:2309.01206. [Google Scholar]
  4. Cazzato, D.; Bono, F. An Application-Driven Survey on Event-Based Neuromorphic Computer Vision. Information 2024, 15, 472. [Google Scholar] [CrossRef]
  5. Shariff, W.; Dilmaghani, M.S.; Kielty, P.; Moustafa, M.; Lemley, J.; Corcoran, P. Event Cameras in Automotive Sensing: A Review. IEEE Access 2024, 12, 51275–51306. [Google Scholar] [CrossRef]
  6. Rasouli, A.; Kotseruba, I.; Tsotsos, J.K. Are they going to cross? A benchmark dataset and baseline for pedestrian crosswalk behavior. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 206–213. [Google Scholar]
  7. Wang, H.; Nie, Y.; Li, Y.; Liu, H.; Liu, M.; Cheng, W.; Wang, Y. Research, Applications and Prospects of Event-Based Pedestrian Detection: A Survey. arXiv 2024, arXiv:2407.04277. [Google Scholar]
  8. Elallid, B.B.; Benamar, N.; Hafid, A.S.; Rachidi, T.; Mrani, N. A Comprehensive Survey on the Application of Deep and Reinforcement Learning Approaches in Autonomous Driving. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 7366–7390. [Google Scholar] [CrossRef]
  9. Brunetti, A.; Buongiorno, D.; Trotta, G.F.; Bevilacqua, V. Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing 2018, 300, 17–33. [Google Scholar] [CrossRef]
  10. Zhang, C.; Berger, C. Pedestrian Behavior Prediction Using Deep Learning Methods for Urban Scenarios: A Review. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10279–10301. [Google Scholar] [CrossRef]
  11. Kaya, O.; Codur, M.Y.; Mustafaraj, E. Automatic Detection of Pedestrian Crosswalk with Faster R-CNN and YOLOv7. Buildings 2023, 13, 1070. [Google Scholar] [CrossRef]
  12. Zamboni, S.; Kefato, Z.T.; Girdzijauskas, S.; Norén, C.; Dal Col, L. Pedestrian trajectory prediction with convolutional neural networks. Pattern Recognit. 2022, 121, 108252. [Google Scholar] [CrossRef]
  13. Liu, B.; Adeli, E.; Cao, Z.; Lee, K.H.; Shenoi, A.; Gaidon, A.; Niebles, J.C. Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction. In Proceedings of the IEEE Robotics and Automation Letters (IEEE RA-L) and International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–4 June 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
  14. Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
  15. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  17. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  18. Yuan, Y.; Weng, X.; Ou, Y.; Kitani, K. AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
  19. Rasouli, A.; Kotseruba, I. PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9844–9851. [Google Scholar] [CrossRef]
  20. Rasouli, A.; Kotseruba, I.; Kunic, T.; Tsotsos, J. PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6261–6270. [Google Scholar] [CrossRef]
  21. Qi, S.; Menozzi, M. Untersuchung des Entscheidungsverhaltens von Fußgängern bei überqueren MIT autonomen Fahrzeugen in Virtueller Realität. Z. Arbeitswissenschaft 2023, 77, 218–229. [Google Scholar] [CrossRef]
  22. Matthews, M.; Chowdhary, G.; Kieson, E. Intent Communication between Autonomous Vehicles and Pedestrians. arXiv 2017, arXiv:1708.07123. [Google Scholar]
  23. Kulhandjian, H.; Barron, J.; Tamiyasu, M.; Thompson, M.; Kulhandjian, M. AI-Based Pedestrian Detection and Avoidance at Night Using Multiple Sensors. J. Sens. Actuator Netw. 2024, 13, 34. [Google Scholar] [CrossRef]
  24. Weihmayr, D.; Sezgin, F.; Tolksdorf, L.; Birkner, C.; Jazar, R.N. Predicting the Influence of Adverse Weather on Pedestrian Detection with Automotive Radar and Lidar Sensors. arXiv 2024, arXiv:2405.12736. [Google Scholar]
  25. Tumas, P.; Nowosielski, A.; Serackis, A. Pedestrian Detection in Severe Weather Conditions. IEEE Access 2020, 8, 62775–62784. [Google Scholar] [CrossRef]
  26. Vogginger, B.; Kreutz, F.; López-Randulfe, J.; Liu, C.; Dietrich, R.; Gonzalez, H.A.; Scholz, D.; Reeb, N.; Auge, D.; Hille, J.; et al. Automotive Radar Processing with Spiking Neural Networks: Concepts and Challenges. Front. Neurosci. 2022, 16, 851774. [Google Scholar] [CrossRef]
  27. Wan, J.; Xia, M.; Huang, Z.; Tian, L.; Zheng, X.; Chang, V.; Zhu, Y.; Wang, H. Event-Based Pedestrian Detection Using Dynamic Vision Sensors. Electronics 2021, 10, 888. [Google Scholar] [CrossRef]
  28. Chen, N.F.Y. Pseudo-Labels for Supervised Learning on Dynamic Vision Sensor Data, Applied to Object Detection Under Ego-Motion. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 757–75709. [Google Scholar] [CrossRef]
  29. Neftci, E.O.; Mostafa, H.; Zenke, F. Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag. 2019, 36, 51–63. [Google Scholar] [CrossRef]
  30. Wang, S.; Cheng, T.; Lim, M.H. A hierarchical taxonomic survey of spiking neural networks. Memetic Comput. 2022, 14, 335–354. [Google Scholar] [CrossRef]
  31. Pascarella, L.; Magno, M. Grayscale and Event-Based Sensor Fusion for Robust Steering Prediction for Self-Driving Cars. In Proceedings of the 2023 IEEE Sensors Applications Symposium (SAS), Ottawa, ON, Canada, 18–20 July 2023; pp. 1–6. [Google Scholar] [CrossRef]
  32. Cordone, L.; Miramond, B.; Thierion, P. Object Detection with Spiking Neural Networks on Automotive Event Data. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
  33. Kim, S.; Park, S.; Na, B.; Yoon, S. Spiking-yolo: Spiking neural network for real-time object detection. arXiv 2019, arXiv:1903.06530. [Google Scholar]
  34. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  35. Wang, X.; Xu, L.; Sun, H.; Xin, J.; Zheng, N. On-Road Vehicle Detection and Tracking Using MMW Radar and Monovision Fusion. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2075–2084. [Google Scholar] [CrossRef]
  36. Zhu, Y.; Wang, T.; Zhu, S. Adaptive Multi-Pedestrian Tracking by Multi-Sensor: Track-to-Track Fusion Using Monocular 3D Detection and MMW Radar. Remote Sens. 2022, 14, 1837. [Google Scholar] [CrossRef]
  37. Brazil, G.; Liu, X. M3D-RPN: Monocular 3D Region Proposal Network for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9286–9295. [Google Scholar] [CrossRef]
  38. Al Abdul Wahid, S.; Asad, A.; Mohammadi, F. A Survey on Neuromorphic Architectures for Running Artificial Intelligence Algorithms. Electronics 2024, 13, 2963. [Google Scholar] [CrossRef]
  39. Davies, M.; Srinivasa, N.; Lin, T.H.; Chinya, G.; Cao, Y.; Choday, S.H.; Dimou, G.; Joshi, P.; Imam, N.; Jain, S.; et al. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro 2018, 38, 82–99. [Google Scholar] [CrossRef]
  40. Massa, R.; Marchisio, A.; Martina, M.; Shafique, M. An Efficient Spiking Neural Network for Recognizing Gestures with a DVS Camera on the Loihi Neuromorphic Processor. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–9. [Google Scholar] [CrossRef]
  41. Viale, A.; Marchisio, A.; Martina, M.; Masera, G.; Shafique, M. CarSNN: An Efficient Spiking Neural Network for Event-Based Autonomous Cars on the Loihi Neuromorphic Research Processor. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; pp. 1–10. [Google Scholar] [CrossRef]
  42. Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robotics, Taichung, Taiwan, 10–12 April 2017. [Google Scholar]
  43. Zenke, F.; Vogels, T.P. The Remarkable Robustness of Surrogate Gradient Learning for Instilling Complex Function in Spiking Neural Networks. Neural Comput. 2021, 33, 899–925. [Google Scholar] [CrossRef]
  44. Ding, J.; Yu, Z.; Tian, Y.; Huang, T. Optimal ANN-SNN Conversion for Fast and Accurate Inference in Deep Spiking Neural Networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Montreal, QC, Canada, 19–27 August 2021; Zhou, Z.H., Ed.; International Joint Conferences on Artificial Intelligence Organization: Primary, CA, USA, 2021; Volumn 8, pp. 2328–2336. [Google Scholar] [CrossRef]
  45. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  46. Izhikevich, E. Which model to use for cortical spiking neurons? IEEE Trans. Neural Netw. 2004, 15, 1063–1070. [Google Scholar] [CrossRef]
  47. Fang, W.; Yu, Z.; Chen, Y.; Masquelier, T.; Huang, T.; Tian, Y. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2661–2671. [Google Scholar]
  48. Fang, W.; Yu, Z.; Chen, Y.; Huang, T.; Masquelier, T.; Tian, Y. Deep residual learning in spiking neural networks. Adv. Neural Inf. Process. Syst. 2021, 34, 21056–21069. [Google Scholar]
  49. Duan, C.; Ding, J.; Chen, S.; Yu, Z.; Huang, T. Temporal effective batch normalization in spiking neural networks. Adv. Neural Inf. Process. Syst. 2022, 35, 34377–34390. [Google Scholar]
  50. Kim, Y.; Panda, P. Revisiting batch normalization for training low-latency deep spiking neural networks from scratch. Front. Neurosci. 2021, 15, 773954. [Google Scholar] [CrossRef] [PubMed]
  51. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  52. Gehrig, M.; Aarents, W.; Gehrig, D.; Scaramuzza, D. DSEC: A Stereo Event Camera Dataset for Driving Scenarios. IEEE Robot. Autom. Lett. 2021, 6, 4947–4954. [Google Scholar] [CrossRef]
  53. Riaz, M.N.; Wielgosz, M.; Romera, A.G.; López, A.M. Synthetic Data Generation Framework, Dataset, and Efficient Deep Model for Pedestrian Intention Prediction. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; pp. 2742–2749. [Google Scholar] [CrossRef]
  54. Chen, G.; Peng, P.; Li, G.; Tian, Y. Training Full Spike Neural Networks via Auxiliary Accumulation Pathway. arXiv 2023, arXiv:2301.11929. [Google Scholar]
  55. Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14. [Google Scholar] [CrossRef]
  56. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
  57. Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. MViTv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  58. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  59. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
  60. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  61. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
  62. Fang, W.; Chen, Y.; Ding, J.; Yu, Z.; Masquelier, T.; Chen, D.; Huang, L.; Zhou, H.; Li, G.; Tian, Y. SpikingJelly: An open-source machine learning infrastructure platform for spike-based intelligence. Sci. Adv. 2023, 9, eadi1480. [Google Scholar] [CrossRef]
Figure 1. Example frames from the dataset. The first row shows samples from the good-weather subset, and the second row shows samples from the bad-weather subset. The first column displays images captured by the DVS camera, while the second column shows the corresponding RGB images. Each DVS-RGB pair represents the same frame within a given subset. A pedestrian is present in the scene in both of the presented frames. Note that in the bad weather frames, the pedestrian is barely visible to the human observer in the RGB images on the right.
Figure 1. Example frames from the dataset. The first row shows samples from the good-weather subset, and the second row shows samples from the bad-weather subset. The first column displays images captured by the DVS camera, while the second column shows the corresponding RGB images. Each DVS-RGB pair represents the same frame within a given subset. A pedestrian is present in the scene in both of the presented frames. Note that in the bad weather frames, the pedestrian is barely visible to the human observer in the RGB images on the right.
Electronics 13 04280 g001
Figure 2. Pipeline of the Dataset Generation and Model Training Process. The process includes video simulation, data collection, preprocessing, model training, evaluation, and final implementation in the autonomous vehicle system.
Figure 2. Pipeline of the Dataset Generation and Model Training Process. The process includes video simulation, data collection, preprocessing, model training, evaluation, and final implementation in the autonomous vehicle system.
Electronics 13 04280 g002
Figure 3. An example of a label assignment approach for clips in the detection task. A positive label is assigned if even a single frame contains a pedestrian on the street within the given context.
Figure 3. An example of a label assignment approach for clips in the detection task. A positive label is assigned if even a single frame contains a pedestrian on the street within the given context.
Electronics 13 04280 g003
Figure 4. Graphical depiction of labeling frames in the prediction task.
Figure 4. Graphical depiction of labeling frames in the prediction task.
Electronics 13 04280 g004
Figure 5. Pareto plots for performance—energy usage tradeoff for the evaluated models in bad weather detection task with DVS data for clip length of 9 (a) and 30 (b). Performance is measured via AUROC. Numbers next to points represent Euclidean distance between X and Y axis crossing point and given point (larger values are better).
Figure 5. Pareto plots for performance—energy usage tradeoff for the evaluated models in bad weather detection task with DVS data for clip length of 9 (a) and 30 (b). Performance is measured via AUROC. Numbers next to points represent Euclidean distance between X and Y axis crossing point and given point (larger values are better).
Electronics 13 04280 g005
Figure 6. Examples of RGB-to-DVS data conversion of frames from JAAD dataset. It shows the potential of using such methods to impute missing DVS modality to datasets containing only RGB recordings.
Figure 6. Examples of RGB-to-DVS data conversion of frames from JAAD dataset. It shows the potential of using such methods to impute missing DVS modality to datasets containing only RGB recordings.
Electronics 13 04280 g006
Table 1. Training Parameters.
Table 1. Training Parameters.
ParameterValueComments
Initial Learning Rate 10 3 Learning rate used at the start of training
Weight Decay Factor 10 1 Regularization parameter to prevent overfitting
Batch Size4 to 64Varies based on hardware capacity
Maximum Epochs100Maximum number of training epochs
Early Stopping8 epochsStops training if no improvement in validation loss for 8 consecutive epochs
Loss FunctionWeighted binary cross-entropyWeights calculated based on the negative-to-positive sample ratio
Optimization AlgorithmAdamWOptimizer used for training
Performance MetricsAUROC, F-scoreMetrics used for evaluating model performance
Table 2. The performance of the networks across various datasets in the detection task, evaluated by AUROC and F-score. The F-score is expressed in percentages. “PT ResNet18” refers to the pseudotemporal variant of the standard ResNet18, and “SPS R18T” represents the Spiking Sew ResNet18 enhanced with Temporally Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
Table 2. The performance of the networks across various datasets in the detection task, evaluated by AUROC and F-score. The F-score is expressed in percentages. “PT ResNet18” refers to the pseudotemporal variant of the standard ResNet18, and “SPS R18T” represents the Spiking Sew ResNet18 enhanced with Temporally Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
NetworkSubset
Bad WeatherNormal Weather
DVSRGBDVSRGB
AUROCF-ScoreAUROCF-ScoreAUROCF-ScoreAUROCF-Score
Clip length: 9 frames
PT ResNet180.988293.050.919458.020.968583.830.951676.95
SlowFast R500.744642.170.512620.010.948772.610.993292.81
MViTv20.500.500.500.50
SPS R18T0.950457.940.677128.110.942174.40.799549.26
Clip length: 30 frames
PT ResNet180.97591.940.982583.560.954879.870.961680.5
SlowFast R500.953580.310.954684.30.959481.680.978487.68
MViTv20.500.500.500.49510
SPS R18T0.954275.50.681145.050.928973.430.801254.39
Table 3. The performance of the compared networks on different datasets in prediction tasks, evaluated by the test AUROC and F-score. The F-score is expressed in percentages. All clip lengths were set to 9 frames, with 8 frames overlapping. PT ResNet18 stands for the “pseudotemporal” version of the normal ResNet18, and SPS R18T denotes Spiking Sew ResNet18 with Temporally Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
Table 3. The performance of the compared networks on different datasets in prediction tasks, evaluated by the test AUROC and F-score. The F-score is expressed in percentages. All clip lengths were set to 9 frames, with 8 frames overlapping. PT ResNet18 stands for the “pseudotemporal” version of the normal ResNet18, and SPS R18T denotes Spiking Sew ResNet18 with Temporally Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
NetworkSubset
Bad WeatherNormal Weather
DVSRGBDVSRGB
AUROCF-ScoreAUROCF-ScoreAUROCF-ScoreAUROCF-Score
Horizon: 30 frames (1 s)
PT ResNet180.7959800.647670.970.945593.170.717765.45
SlowFast R500.582700.833389.080.942288.370.817376.34
MViTv20.521000.444300.45500.67950
SPS R18T0.85283.270.864368.750.880282.550.344952.33
Horizon: 150 frames (5 s)
PT ResNet180.673759.860.863381.70.860879.560.405356.07
SlowFast R500.790369.970.627164.810.829173.710.45457.08
MViTv20.471500.469600.500.671.49
SPS R18T0.824967.990.88279.390.442356.420.412862.96
Table 4. The performance of the compared networks on subsets extracted from the JAAD dataset in the detection task, evaluated by AUROC and F-score. The F-score is expressed in percentages. PT ResNet18 stands for the “pseudotemporal” version of the normal ResNet18, and SPS R18T denotes Spiking Sew ResNet18 with Temporally Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
Table 4. The performance of the compared networks on subsets extracted from the JAAD dataset in the detection task, evaluated by AUROC and F-score. The F-score is expressed in percentages. PT ResNet18 stands for the “pseudotemporal” version of the normal ResNet18, and SPS R18T denotes Spiking Sew ResNet18 with Temporally Effective Batch Normalization. The best outcomes for each subset and modality are emphasized in bold.
NetworkSubset
Bad WeatherNormal Weather
AUROCF-ScoreAUROCF-Score
Clip length: 9 frames
PT ResNet180.436656.30.759771.92
SlowFast R500.358552.890.751376.97
MViTv20.66475.520.47450
SPS R18T0.596674.030.616970.2
Clip length: 30 frames
PT ResNet180.596674.420.786146.24
SlowFast R500.477356.250.526980.7
MViTv20.403474.410.50
SPS R18T0.64268.570.609365.78
Table 5. Comparison of the number of parameters, energy usage, and the number of specific operations in the used networks. The best values are emphasized in bold (the lower the better).
Table 5. Comparison of the number of parameters, energy usage, and the number of specific operations in the used networks. The best values are emphasized in bold (the lower the better).
Model NameParametersACsMACsEnergy Usage [mJ]
Clip length: 9 frames
PT ResNet1811.17 M036.16 G166.33
SlowFast R5031.53 M0109.94 G505.73
MViTv212.26 M020.1 G92.48
SPS R18T11.17 M22.75 G6.52 G50.45
Clip length: 30 frames
PT ResNet1811.17 M0120.53 G554.42
SlowFast R5031.63 M0363.65 G1672.8
MViTv212.26 M020.1 G92.48
SPS R18T11.17 M4.63 G6.13 G32.37
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sakhai, M.; Mazurek, S.; Caputa, J.; Argasiński, J.K.; Wielgosz, M. Spiking Neural Networks for Real-Time Pedestrian Street-Crossing Detection Using Dynamic Vision Sensors in Simulated Adverse Weather Conditions. Electronics 2024, 13, 4280. https://doi.org/10.3390/electronics13214280

AMA Style

Sakhai M, Mazurek S, Caputa J, Argasiński JK, Wielgosz M. Spiking Neural Networks for Real-Time Pedestrian Street-Crossing Detection Using Dynamic Vision Sensors in Simulated Adverse Weather Conditions. Electronics. 2024; 13(21):4280. https://doi.org/10.3390/electronics13214280

Chicago/Turabian Style

Sakhai, Mustafa, Szymon Mazurek, Jakub Caputa, Jan K. Argasiński, and Maciej Wielgosz. 2024. "Spiking Neural Networks for Real-Time Pedestrian Street-Crossing Detection Using Dynamic Vision Sensors in Simulated Adverse Weather Conditions" Electronics 13, no. 21: 4280. https://doi.org/10.3390/electronics13214280

APA Style

Sakhai, M., Mazurek, S., Caputa, J., Argasiński, J. K., & Wielgosz, M. (2024). Spiking Neural Networks for Real-Time Pedestrian Street-Crossing Detection Using Dynamic Vision Sensors in Simulated Adverse Weather Conditions. Electronics, 13(21), 4280. https://doi.org/10.3390/electronics13214280

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop